Sometimes when performing an initial sync, the remote
node isn't able to pull messages off the wire because
of long running tasks and queues are saturated. With
a shorter write timeout, we will give up trying to send
messages and teardown the connection, even though the
peer is still active.
This commit adds additional synchronization logic to
WaitForDisconnect, such that it can be spawned before
Start has been executed by the server. Without
modification, the current version will return
immediately since no goroutines will have been
spawned.
To solve this, we modify WaitForDisconnect to block until:
1) the peer is disconnected,
2) the peer is successfully started,
before watching the waitgroup.
In the first case, the waitgroup will block until all
(if any) spawned goroutines have exited. Otherwise, if
the Start is successful, we can switch to watching the
waitgroup, knowing that waitgroup counter is positive.
In this commit, we modify the existing message sending functionality
within the fundingmanager. Due to each mesage send requiring to hold the
server's lock to retrieve the peer, we might run into a case where the
lock is held for a larger than usual amount of time and would therefore
block on sending the message within the fundingmanager. We remedy this
by taking a similar approach to some recent changes within the gossiper.
We now keep track of each peer within the internal fundingmanager
messages and send messages directly to them.
In this commit, we add a timeout within the writeMessage method when we go to write to the socket. We do this as otherwise, if the other peer is blocked for some reason, we'll never actually unblock ourselves, which may cause issues in other sub-systems waiting on this write call. For now, we use a value of 10 seconds, and will adjust in the future if we deem this time period too short.
In this commit, we move the block height dependency from the links in
the switch to the switch itself. This is possible due to a recent change
on the links no longer depending on the block height to update their
commitment fees.
We'll now only have the switch be alerted of new blocks coming in and
links will retrieve the height from it atomically.
In this commit, we modify the behavior of links updating their
commitment fees. Rather than attempting to update the commitment fee for
each link every time a new block comes in, we'll use a timer with a
random interval between 10 and 60 minutes for each link to determine
when to update their corresponding commitment fee. This prevents us from
oscillating the fee rate for our various commitment transactions.
In this commit we fix an existing bug which could cause internal state
inconsistency between then switch, funding manager, and the peer. Before
this commit, we would _always_ add a new channel to the channelManager.
However, due to recent logic, it may be the case that this isn't the
channel that will ultimately reside in the link. As a result, we would
be unable to process incoming FundingLocked messages properly, as we
would mutate the incorrect channel in memory.
We remedy this by moving the inserting of the new channel into the
activeChannels map until the end of the loadActiveChannels method, where
we know that this will be the link that persists.
In this commit, we fix a recently introduced bug. The issue is that
while we're failing the link, the peer we're attempting to force close
on may disconnect. As a result, if the peerTerminationWatcher exits
before we can add to the wait group (it's waiting on that), then we'll
run into a panic as we're attempting to increment the wait group while
another goroutine is calling wait.
The fix is to first check that the server isn't shutting down, and then
use the server's wait group rather than the peer to synchronize
goroutines.
Fixes#1285.
This commit makes the peer aware of the LinkFailureErrors that can
happen during link operation, and making it start a goroutine to
properly remove the link and force close the channel.
This commit attempts to resolve some potential deadlock
scenarios during a peer disconnect.
Currently, writeMessage returns a nil error when disconnecting.
This should have minimal impact on the writeHanlder, as the
subsequent loop selects on the quit chan, and will cause it to
exit. However, if this happens when sending the init message,
the Start() method will attempt to proceed even though the peer
has been disconnected.
In addition, this commit changes the behavior of synchronous
write errors, by using a non-blocking select. Though unlikely,
this prevents any cases where multiple errors are returned, and
the errors are not being pulled from the other side of the errChan.
This removes any naked sends on the errChan from stalling the peer's
shutdown.
In this commit, we ensure that any time we send a TempChannelFailure
that's destined for a multi-hop source sender, then we'll always package
the latest channel update along with it.
In this commit, we fix a bug that could at times cause a deadlock when a
peer is attempting to disconnect. The issue was that when a peer goes to
disconnect, it needs to stop any active msgStream instances. The Stop()
method of the msgStream would block until an atomic variable was set to
indicate that the stream had fully exited. However, in the case that we
disconnected lower in the msgConsumer loop, we would never set the
streamShutdown variable, meaning that msgStream.Stop() would never
unblock.
The fix for this is simple: set the streamShutdown variable within the
quit case of the second select statement in the msgConsumer goroutine.
In this commit, we might a very small change to the way writing messages
works in the peer, which should have large implications w.r.t reducing
memory usage amongst chatty nodes.
When profiling the heap on one of my nodes earlier, I noticed this
fragment:
```
Showing top 20 nodes out of 68
flat flat% sum% cum cum%
0 0% 0% 75.53MB 54.61% main.(*peer).writeHandler
75.53MB 54.61% 54.61% 75.53MB 54.61% main.(*peer).writeMessage
```
Which points to an inefficiency with the way we handle allocations when
writing new messages, drilling down further we see:
```
(pprof) list writeMessage
Total: 138.31MB
ROUTINE ======================== main.(*peer).writeMessage in /root/go/src/github.com/lightningnetwork/lnd/peer.go
75.53MB 75.53MB (flat, cum) 54.61% of Total
. . 1104: p.logWireMessage(msg, false)
. . 1105:
. . 1106: // As the Lightning wire protocol is fully message oriented, we only
. . 1107: // allows one wire message per outer encapsulated crypto message. So
. . 1108: // we'll create a temporary buffer to write the message directly to.
75.53MB 75.53MB 1109: var msgPayload [lnwire.MaxMessagePayload]byte
. . 1110: b := bytes.NewBuffer(msgPayload[0:0:len(msgPayload)])
. . 1111:
. . 1112: // With the temp buffer created and sliced properly (length zero, full
. . 1113: // capacity), we'll now encode the message directly into this buffer.
. . 1114: n, err := lnwire.WriteMessage(b, msg, 0)
(pprof) list writeHandler
Total: 138.31MB
ROUTINE ======================== main.(*peer).writeHandler in /root/go/src/github.com/lightningnetwork/lnd/peer.go
0 75.53MB (flat, cum) 54.61% of Total
. . 1148:
. . 1149: // Write out the message to the socket, closing the
. . 1150: // 'sentChan' if it's non-nil, The 'sentChan' allows
. . 1151: // callers to optionally synchronize sends with the
. . 1152: // writeHandler.
. 75.53MB 1153: err := p.writeMessage(outMsg.msg)
. . 1154: if outMsg.errChan != nil {
. . 1155: outMsg.errChan <- err
. . 1156: }
. . 1157:
. . 1158: if err != nil {
```
Ah hah! We create a _new_ buffer each time we want to write a message
out. This is unnecessary and _very_ wasteful (as seen by the profile).
The fix is simple: re-use a buffer unique to each peer when writing out
messages. Since we know what the max message size is, we just allocate
one of these 65KB buffers for each peer, and keep it around until the
peer is removed.
In this commit, we follow up to the prior commit by ensuring we won't
accept a co-op close request for a chennel with active HTLCs. When
creating a chanCloser for the first time, we'll check the set of HTLC's
and reject a request (by sending a wire error) if the target channel
still as active HTLC's.
In this commit, we fix a minor deviation in our implementation from the
specification. Before if we encountered an unknown error type, we would
disconnect the peer. Instead, we’ll now just continue along parsing the
remainder of the messages. This was flared up recently by some
c-lightning related incompatibilities that emerged on main net.
In this commit, we fix a goroutine leak that could occur if while we
were loading an error occurred in any of the steps after we created the
channel object, but before it was actually loaded in to the script. If
an error occurs at any step, we ensure that we’ll stop toe channel.
Otherwise, the sigPool goroutines would still be lingering and never be
stopped.
This commit adds a set used to track channels we consider failed. This
is done to ensure we don't end up in a connect/disconnect loop when we
attempt to re-sync the channel state of a failed channel with a peer.
In this commit, we remove the DecodeHopIterator method from the
ChannelLinkConfig struct. We do this as we no longer use this method,
since we only ever use the DecodeHopIterators method now.
In this commit, we modify the msgStream struct to ensure that it has a
cap at which it’ll continue to buffer messages. Currently we have two
msgStream structs per peer: the first for the discovery messages, and
the second for any messages that modify channel state. Due to
inefficiencies in the current protocol for reconciling graph state upon
connection (just dump the entire damn thing), when a node first starts
up, this can lead to very high memory usage as all peers will
concurrently send their initial message dump which can be in the
thousands of messages on testate.
Our fix is simple: make the message stream into a _bounded_ message
stream. The newMsgStream function now has a new argument: bufSize.
Internally, we’ll take this bufSize and create more or less an internal
semaphore for the producer. Each time the producer gets a new message,
it’ll try and read an item from the channel. If the queue still has
size, then this will succeed immediately. If not, then we’ll block
until the consumer actually finishes processing a message and then
signals by sending a new item into the channel.
We choose an initial value of 1000. This was chosen as there’s already
a max limit of outstanding adds on the commitment, and a value of 1000
should allow any incoming messages to be safely flushed and processed
by the gossiper.
In this commit, we fix a slight miscalculation within the GetInfo call.
Before this commit, we would list any channel that the peer knew of as
active, instead of those which are, well, actually *active*. We fix
this by skipping any channels that we don’t have the remote revocation
for.
In order to reduce high CPU utilization during the initial network view
sync, we slash down the total number of active in-flight jobs that can
be launched.
In this commit, we modify the way that notifications are dispatched
within the chainWatcher. Before we would *always* wait for an ack back
before we started to clean up he database state. This would at times
lead to deadlocks. To remedy this, we now allow callers to decide if
they want notifications to be sync or not. The only current caller that
requires this is the breach arbiter.
In this commit, we modify the interaction between the chanCloser
sub-system and the chain notifier all together. This fixes a series of
bugs as before this commit, we wouldn’t be able to detect if the remote
party actually broadcasted *any* of the transactions that we signed off
upon. This would be rejected to the user by having a “zombie” channel
close that would never actually be resolved.
Rather than the chanCloser watching for on-chain closes, we’ll now open
up a co-op close context to the chainWatcher (via a layer of
indirection via the ChainArbitrator), and report to it all possible
closes that we’ve signed. The chainWatcher will then be able to launch
a goroutine to properly update the database state once any of the
possible closure transactions confirms.
We no longer need to hand off new channels that come online as the
chainWatcher will be persistent, and always have an active signal for
the entire lifetime of the channel.
In this commit, we modify the logic within the Stop() method for
msgStream to ensure that the main goroutine properly exits. It has been
observed on running nodes with tens of connections, that if a node is
very flappy, then the node can end up with hundreds of leaked
goroutines.
In order to fix this, we’ll continually signal the msgConsumer to wake
up after the quit channel has been closed. We do this until the
msgConsumer sets a bool indicating that it has exited atomically.
This commit adds an overlooked case into the main type switch statement
within the peer’s readHandler. Before this commit, we would fail to
process any UpdateFailMalformedHTLC messages, possibly leading to a
commitment desynchronization. To avoid this case, we’ll no properly
process the UpdateFailMalformedHTLC message by sending the message to
an active link registered to the switch.
In this commit, we modify the logWireMessage function to ensure that we
don't attempt to nil out the LocalUnrevokedCommitPoint.Curve field
unless it's actually set. We need to do this as the field as actually
optional, and we may be reading a message from a node that doesn't
support the option.
Fixes#461.
This commit is a follow up to the prior commit: as it’s possible for
the channel_reestablish message to be sent *before* the channel has
been fully confirmed, we’ll now ensure that we process it to the link
even if the channel isn’t yet open.
In this commit, we modify the logic within loadActiveChannels to
*always* load a channel, even if it isn’t yet fully confirmed. With
this change, we ensure that we’ll always send a channel_reestablish
message upon reconnection.
Fixes#458.
In this commit, we modify the logic within the channelManager to be
able to process any retransmitted FundingLocked messages. Before this
commit, we would simply ignore any new channels sent to us, iff, we
already had an active channel with the same channel point. With the
recent change to the loadActiveChannels method in the peer, this is now
incorrect.
When a peer retransmits the FundingLocked message, it goes through to
the fundingManager. The fundingMgr will then (if we haven’t already
processed it), send the channel to the breach arbiter and also to the
peer’s channelManager. In order to handle this case properly, if we
already have the channel, we’ll check if our current channel *doesn’t*
already have the RemoteNextRevocation field set. If it doesn’t, then
this means that we haven’t yet processed the FundingLcoked message, so
we’ll process it for the first time.
This new logic will properly:
* ensure that the breachArbiter still has the most up to date channel
* allow us to update the state of the link has been added to the
switch at this point
* this link will now be eligible for forwarding after this
sequence
In this commit we revert a prior change which was added after
FundingLocked retransmission was implemented. This prior change didn’t
factor in the fact that the FundingLocked message will *only* be
re-sent after both sides receive the ChannelReestablishment message.
With the prior code, as we never added the channel to the link, we’d
never re-send the ChannelReestablishment, meaning the other side would
never send the FundingLocked message.
By unconditionally adding the channel to the switch, we ensure that
we’ll always properly retransmit the FundingLocked message.
This new field was added as a recent modification to the spec, but the
curve parameter within the attribute wasn’t set to nil. As a result
this would result in a large degree of spam within the logs when set to
trace mode. This commit fixes this issue by setting it to nil along
with all the other pub keys within messages.
It dictates in the spec, that the error message should be an ASCII
string to allow other implementations to easily discern the type of
error. The other implementations do this, but we don’t yet, but we’ll
go ahead and display it anyway as it’s helpful when debugging.
In this commit, we add a new ResetState method to the channel state
machine which will reset the state of the channel to `channelOpen`. We
add this as before this commit, it was possible for a channel to shift
into the closing state, the closing negotiation be cancelled for
whatever reason, resulting the the channel held by the breachArbiter
unable to act to potential on-chain events.