This commit attempts to resolve some potential deadlock
scenarios during a peer disconnect.
Currently, writeMessage returns a nil error when disconnecting.
This should have minimal impact on the writeHanlder, as the
subsequent loop selects on the quit chan, and will cause it to
exit. However, if this happens when sending the init message,
the Start() method will attempt to proceed even though the peer
has been disconnected.
In addition, this commit changes the behavior of synchronous
write errors, by using a non-blocking select. Though unlikely,
this prevents any cases where multiple errors are returned, and
the errors are not being pulled from the other side of the errChan.
This removes any naked sends on the errChan from stalling the peer's
shutdown.
In this commit, we ensure that any time we send a TempChannelFailure
that's destined for a multi-hop source sender, then we'll always package
the latest channel update along with it.
In this commit, we fix a bug that could at times cause a deadlock when a
peer is attempting to disconnect. The issue was that when a peer goes to
disconnect, it needs to stop any active msgStream instances. The Stop()
method of the msgStream would block until an atomic variable was set to
indicate that the stream had fully exited. However, in the case that we
disconnected lower in the msgConsumer loop, we would never set the
streamShutdown variable, meaning that msgStream.Stop() would never
unblock.
The fix for this is simple: set the streamShutdown variable within the
quit case of the second select statement in the msgConsumer goroutine.
In this commit, we might a very small change to the way writing messages
works in the peer, which should have large implications w.r.t reducing
memory usage amongst chatty nodes.
When profiling the heap on one of my nodes earlier, I noticed this
fragment:
```
Showing top 20 nodes out of 68
flat flat% sum% cum cum%
0 0% 0% 75.53MB 54.61% main.(*peer).writeHandler
75.53MB 54.61% 54.61% 75.53MB 54.61% main.(*peer).writeMessage
```
Which points to an inefficiency with the way we handle allocations when
writing new messages, drilling down further we see:
```
(pprof) list writeMessage
Total: 138.31MB
ROUTINE ======================== main.(*peer).writeMessage in /root/go/src/github.com/lightningnetwork/lnd/peer.go
75.53MB 75.53MB (flat, cum) 54.61% of Total
. . 1104: p.logWireMessage(msg, false)
. . 1105:
. . 1106: // As the Lightning wire protocol is fully message oriented, we only
. . 1107: // allows one wire message per outer encapsulated crypto message. So
. . 1108: // we'll create a temporary buffer to write the message directly to.
75.53MB 75.53MB 1109: var msgPayload [lnwire.MaxMessagePayload]byte
. . 1110: b := bytes.NewBuffer(msgPayload[0:0:len(msgPayload)])
. . 1111:
. . 1112: // With the temp buffer created and sliced properly (length zero, full
. . 1113: // capacity), we'll now encode the message directly into this buffer.
. . 1114: n, err := lnwire.WriteMessage(b, msg, 0)
(pprof) list writeHandler
Total: 138.31MB
ROUTINE ======================== main.(*peer).writeHandler in /root/go/src/github.com/lightningnetwork/lnd/peer.go
0 75.53MB (flat, cum) 54.61% of Total
. . 1148:
. . 1149: // Write out the message to the socket, closing the
. . 1150: // 'sentChan' if it's non-nil, The 'sentChan' allows
. . 1151: // callers to optionally synchronize sends with the
. . 1152: // writeHandler.
. 75.53MB 1153: err := p.writeMessage(outMsg.msg)
. . 1154: if outMsg.errChan != nil {
. . 1155: outMsg.errChan <- err
. . 1156: }
. . 1157:
. . 1158: if err != nil {
```
Ah hah! We create a _new_ buffer each time we want to write a message
out. This is unnecessary and _very_ wasteful (as seen by the profile).
The fix is simple: re-use a buffer unique to each peer when writing out
messages. Since we know what the max message size is, we just allocate
one of these 65KB buffers for each peer, and keep it around until the
peer is removed.
In this commit, we follow up to the prior commit by ensuring we won't
accept a co-op close request for a chennel with active HTLCs. When
creating a chanCloser for the first time, we'll check the set of HTLC's
and reject a request (by sending a wire error) if the target channel
still as active HTLC's.
In this commit, we fix a minor deviation in our implementation from the
specification. Before if we encountered an unknown error type, we would
disconnect the peer. Instead, we’ll now just continue along parsing the
remainder of the messages. This was flared up recently by some
c-lightning related incompatibilities that emerged on main net.
In this commit, we fix a goroutine leak that could occur if while we
were loading an error occurred in any of the steps after we created the
channel object, but before it was actually loaded in to the script. If
an error occurs at any step, we ensure that we’ll stop toe channel.
Otherwise, the sigPool goroutines would still be lingering and never be
stopped.
This commit adds a set used to track channels we consider failed. This
is done to ensure we don't end up in a connect/disconnect loop when we
attempt to re-sync the channel state of a failed channel with a peer.
In this commit, we remove the DecodeHopIterator method from the
ChannelLinkConfig struct. We do this as we no longer use this method,
since we only ever use the DecodeHopIterators method now.
In this commit, we modify the msgStream struct to ensure that it has a
cap at which it’ll continue to buffer messages. Currently we have two
msgStream structs per peer: the first for the discovery messages, and
the second for any messages that modify channel state. Due to
inefficiencies in the current protocol for reconciling graph state upon
connection (just dump the entire damn thing), when a node first starts
up, this can lead to very high memory usage as all peers will
concurrently send their initial message dump which can be in the
thousands of messages on testate.
Our fix is simple: make the message stream into a _bounded_ message
stream. The newMsgStream function now has a new argument: bufSize.
Internally, we’ll take this bufSize and create more or less an internal
semaphore for the producer. Each time the producer gets a new message,
it’ll try and read an item from the channel. If the queue still has
size, then this will succeed immediately. If not, then we’ll block
until the consumer actually finishes processing a message and then
signals by sending a new item into the channel.
We choose an initial value of 1000. This was chosen as there’s already
a max limit of outstanding adds on the commitment, and a value of 1000
should allow any incoming messages to be safely flushed and processed
by the gossiper.
In this commit, we fix a slight miscalculation within the GetInfo call.
Before this commit, we would list any channel that the peer knew of as
active, instead of those which are, well, actually *active*. We fix
this by skipping any channels that we don’t have the remote revocation
for.
In order to reduce high CPU utilization during the initial network view
sync, we slash down the total number of active in-flight jobs that can
be launched.
In this commit, we modify the way that notifications are dispatched
within the chainWatcher. Before we would *always* wait for an ack back
before we started to clean up he database state. This would at times
lead to deadlocks. To remedy this, we now allow callers to decide if
they want notifications to be sync or not. The only current caller that
requires this is the breach arbiter.
In this commit, we modify the interaction between the chanCloser
sub-system and the chain notifier all together. This fixes a series of
bugs as before this commit, we wouldn’t be able to detect if the remote
party actually broadcasted *any* of the transactions that we signed off
upon. This would be rejected to the user by having a “zombie” channel
close that would never actually be resolved.
Rather than the chanCloser watching for on-chain closes, we’ll now open
up a co-op close context to the chainWatcher (via a layer of
indirection via the ChainArbitrator), and report to it all possible
closes that we’ve signed. The chainWatcher will then be able to launch
a goroutine to properly update the database state once any of the
possible closure transactions confirms.
We no longer need to hand off new channels that come online as the
chainWatcher will be persistent, and always have an active signal for
the entire lifetime of the channel.
In this commit, we modify the logic within the Stop() method for
msgStream to ensure that the main goroutine properly exits. It has been
observed on running nodes with tens of connections, that if a node is
very flappy, then the node can end up with hundreds of leaked
goroutines.
In order to fix this, we’ll continually signal the msgConsumer to wake
up after the quit channel has been closed. We do this until the
msgConsumer sets a bool indicating that it has exited atomically.
This commit adds an overlooked case into the main type switch statement
within the peer’s readHandler. Before this commit, we would fail to
process any UpdateFailMalformedHTLC messages, possibly leading to a
commitment desynchronization. To avoid this case, we’ll no properly
process the UpdateFailMalformedHTLC message by sending the message to
an active link registered to the switch.
In this commit, we modify the logWireMessage function to ensure that we
don't attempt to nil out the LocalUnrevokedCommitPoint.Curve field
unless it's actually set. We need to do this as the field as actually
optional, and we may be reading a message from a node that doesn't
support the option.
Fixes#461.