This commit makes the peer aware of the LinkFailureErrors that can
happen during link operation, and making it start a goroutine to
properly remove the link and force close the channel.
This commit attempts to resolve some potential deadlock
scenarios during a peer disconnect.
Currently, writeMessage returns a nil error when disconnecting.
This should have minimal impact on the writeHanlder, as the
subsequent loop selects on the quit chan, and will cause it to
exit. However, if this happens when sending the init message,
the Start() method will attempt to proceed even though the peer
has been disconnected.
In addition, this commit changes the behavior of synchronous
write errors, by using a non-blocking select. Though unlikely,
this prevents any cases where multiple errors are returned, and
the errors are not being pulled from the other side of the errChan.
This removes any naked sends on the errChan from stalling the peer's
shutdown.
In this commit, we ensure that any time we send a TempChannelFailure
that's destined for a multi-hop source sender, then we'll always package
the latest channel update along with it.
In this commit, we fix a bug that could at times cause a deadlock when a
peer is attempting to disconnect. The issue was that when a peer goes to
disconnect, it needs to stop any active msgStream instances. The Stop()
method of the msgStream would block until an atomic variable was set to
indicate that the stream had fully exited. However, in the case that we
disconnected lower in the msgConsumer loop, we would never set the
streamShutdown variable, meaning that msgStream.Stop() would never
unblock.
The fix for this is simple: set the streamShutdown variable within the
quit case of the second select statement in the msgConsumer goroutine.
In this commit, we might a very small change to the way writing messages
works in the peer, which should have large implications w.r.t reducing
memory usage amongst chatty nodes.
When profiling the heap on one of my nodes earlier, I noticed this
fragment:
```
Showing top 20 nodes out of 68
flat flat% sum% cum cum%
0 0% 0% 75.53MB 54.61% main.(*peer).writeHandler
75.53MB 54.61% 54.61% 75.53MB 54.61% main.(*peer).writeMessage
```
Which points to an inefficiency with the way we handle allocations when
writing new messages, drilling down further we see:
```
(pprof) list writeMessage
Total: 138.31MB
ROUTINE ======================== main.(*peer).writeMessage in /root/go/src/github.com/lightningnetwork/lnd/peer.go
75.53MB 75.53MB (flat, cum) 54.61% of Total
. . 1104: p.logWireMessage(msg, false)
. . 1105:
. . 1106: // As the Lightning wire protocol is fully message oriented, we only
. . 1107: // allows one wire message per outer encapsulated crypto message. So
. . 1108: // we'll create a temporary buffer to write the message directly to.
75.53MB 75.53MB 1109: var msgPayload [lnwire.MaxMessagePayload]byte
. . 1110: b := bytes.NewBuffer(msgPayload[0:0:len(msgPayload)])
. . 1111:
. . 1112: // With the temp buffer created and sliced properly (length zero, full
. . 1113: // capacity), we'll now encode the message directly into this buffer.
. . 1114: n, err := lnwire.WriteMessage(b, msg, 0)
(pprof) list writeHandler
Total: 138.31MB
ROUTINE ======================== main.(*peer).writeHandler in /root/go/src/github.com/lightningnetwork/lnd/peer.go
0 75.53MB (flat, cum) 54.61% of Total
. . 1148:
. . 1149: // Write out the message to the socket, closing the
. . 1150: // 'sentChan' if it's non-nil, The 'sentChan' allows
. . 1151: // callers to optionally synchronize sends with the
. . 1152: // writeHandler.
. 75.53MB 1153: err := p.writeMessage(outMsg.msg)
. . 1154: if outMsg.errChan != nil {
. . 1155: outMsg.errChan <- err
. . 1156: }
. . 1157:
. . 1158: if err != nil {
```
Ah hah! We create a _new_ buffer each time we want to write a message
out. This is unnecessary and _very_ wasteful (as seen by the profile).
The fix is simple: re-use a buffer unique to each peer when writing out
messages. Since we know what the max message size is, we just allocate
one of these 65KB buffers for each peer, and keep it around until the
peer is removed.
In this commit, we follow up to the prior commit by ensuring we won't
accept a co-op close request for a chennel with active HTLCs. When
creating a chanCloser for the first time, we'll check the set of HTLC's
and reject a request (by sending a wire error) if the target channel
still as active HTLC's.
In this commit, we fix a minor deviation in our implementation from the
specification. Before if we encountered an unknown error type, we would
disconnect the peer. Instead, we’ll now just continue along parsing the
remainder of the messages. This was flared up recently by some
c-lightning related incompatibilities that emerged on main net.
In this commit, we fix a goroutine leak that could occur if while we
were loading an error occurred in any of the steps after we created the
channel object, but before it was actually loaded in to the script. If
an error occurs at any step, we ensure that we’ll stop toe channel.
Otherwise, the sigPool goroutines would still be lingering and never be
stopped.
This commit adds a set used to track channels we consider failed. This
is done to ensure we don't end up in a connect/disconnect loop when we
attempt to re-sync the channel state of a failed channel with a peer.
In this commit, we remove the DecodeHopIterator method from the
ChannelLinkConfig struct. We do this as we no longer use this method,
since we only ever use the DecodeHopIterators method now.
In this commit, we modify the msgStream struct to ensure that it has a
cap at which it’ll continue to buffer messages. Currently we have two
msgStream structs per peer: the first for the discovery messages, and
the second for any messages that modify channel state. Due to
inefficiencies in the current protocol for reconciling graph state upon
connection (just dump the entire damn thing), when a node first starts
up, this can lead to very high memory usage as all peers will
concurrently send their initial message dump which can be in the
thousands of messages on testate.
Our fix is simple: make the message stream into a _bounded_ message
stream. The newMsgStream function now has a new argument: bufSize.
Internally, we’ll take this bufSize and create more or less an internal
semaphore for the producer. Each time the producer gets a new message,
it’ll try and read an item from the channel. If the queue still has
size, then this will succeed immediately. If not, then we’ll block
until the consumer actually finishes processing a message and then
signals by sending a new item into the channel.
We choose an initial value of 1000. This was chosen as there’s already
a max limit of outstanding adds on the commitment, and a value of 1000
should allow any incoming messages to be safely flushed and processed
by the gossiper.
In this commit, we fix a slight miscalculation within the GetInfo call.
Before this commit, we would list any channel that the peer knew of as
active, instead of those which are, well, actually *active*. We fix
this by skipping any channels that we don’t have the remote revocation
for.
In order to reduce high CPU utilization during the initial network view
sync, we slash down the total number of active in-flight jobs that can
be launched.
In this commit, we modify the way that notifications are dispatched
within the chainWatcher. Before we would *always* wait for an ack back
before we started to clean up he database state. This would at times
lead to deadlocks. To remedy this, we now allow callers to decide if
they want notifications to be sync or not. The only current caller that
requires this is the breach arbiter.
In this commit, we modify the interaction between the chanCloser
sub-system and the chain notifier all together. This fixes a series of
bugs as before this commit, we wouldn’t be able to detect if the remote
party actually broadcasted *any* of the transactions that we signed off
upon. This would be rejected to the user by having a “zombie” channel
close that would never actually be resolved.
Rather than the chanCloser watching for on-chain closes, we’ll now open
up a co-op close context to the chainWatcher (via a layer of
indirection via the ChainArbitrator), and report to it all possible
closes that we’ve signed. The chainWatcher will then be able to launch
a goroutine to properly update the database state once any of the
possible closure transactions confirms.
We no longer need to hand off new channels that come online as the
chainWatcher will be persistent, and always have an active signal for
the entire lifetime of the channel.
In this commit, we modify the logic within the Stop() method for
msgStream to ensure that the main goroutine properly exits. It has been
observed on running nodes with tens of connections, that if a node is
very flappy, then the node can end up with hundreds of leaked
goroutines.
In order to fix this, we’ll continually signal the msgConsumer to wake
up after the quit channel has been closed. We do this until the
msgConsumer sets a bool indicating that it has exited atomically.
This commit adds an overlooked case into the main type switch statement
within the peer’s readHandler. Before this commit, we would fail to
process any UpdateFailMalformedHTLC messages, possibly leading to a
commitment desynchronization. To avoid this case, we’ll no properly
process the UpdateFailMalformedHTLC message by sending the message to
an active link registered to the switch.
In this commit, we modify the logWireMessage function to ensure that we
don't attempt to nil out the LocalUnrevokedCommitPoint.Curve field
unless it's actually set. We need to do this as the field as actually
optional, and we may be reading a message from a node that doesn't
support the option.
Fixes#461.
This commit is a follow up to the prior commit: as it’s possible for
the channel_reestablish message to be sent *before* the channel has
been fully confirmed, we’ll now ensure that we process it to the link
even if the channel isn’t yet open.
In this commit, we modify the logic within loadActiveChannels to
*always* load a channel, even if it isn’t yet fully confirmed. With
this change, we ensure that we’ll always send a channel_reestablish
message upon reconnection.
Fixes#458.
In this commit, we modify the logic within the channelManager to be
able to process any retransmitted FundingLocked messages. Before this
commit, we would simply ignore any new channels sent to us, iff, we
already had an active channel with the same channel point. With the
recent change to the loadActiveChannels method in the peer, this is now
incorrect.
When a peer retransmits the FundingLocked message, it goes through to
the fundingManager. The fundingMgr will then (if we haven’t already
processed it), send the channel to the breach arbiter and also to the
peer’s channelManager. In order to handle this case properly, if we
already have the channel, we’ll check if our current channel *doesn’t*
already have the RemoteNextRevocation field set. If it doesn’t, then
this means that we haven’t yet processed the FundingLcoked message, so
we’ll process it for the first time.
This new logic will properly:
* ensure that the breachArbiter still has the most up to date channel
* allow us to update the state of the link has been added to the
switch at this point
* this link will now be eligible for forwarding after this
sequence
In this commit we revert a prior change which was added after
FundingLocked retransmission was implemented. This prior change didn’t
factor in the fact that the FundingLocked message will *only* be
re-sent after both sides receive the ChannelReestablishment message.
With the prior code, as we never added the channel to the link, we’d
never re-send the ChannelReestablishment, meaning the other side would
never send the FundingLocked message.
By unconditionally adding the channel to the switch, we ensure that
we’ll always properly retransmit the FundingLocked message.
This new field was added as a recent modification to the spec, but the
curve parameter within the attribute wasn’t set to nil. As a result
this would result in a large degree of spam within the logs when set to
trace mode. This commit fixes this issue by setting it to nil along
with all the other pub keys within messages.
It dictates in the spec, that the error message should be an ASCII
string to allow other implementations to easily discern the type of
error. The other implementations do this, but we don’t yet, but we’ll
go ahead and display it anyway as it’s helpful when debugging.
In this commit, we add a new ResetState method to the channel state
machine which will reset the state of the channel to `channelOpen`. We
add this as before this commit, it was possible for a channel to shift
into the closing state, the closing negotiation be cancelled for
whatever reason, resulting the the channel held by the breachArbiter
unable to act to potential on-chain events.
In this commit, we refactor the existing channel closure logic for
co-op closes to use the new channelCloser state machine. This results
in a large degree of deleted code as all the logic is now centralized
to a single state machine.
This commit addresses an issue that could occur if a
message was attempted added to the sendQueue by the
queueHandler before the writeHandler had started.
If a message was sent to the queueHandler before the
writeHandler was ready to accept messages on the
sendQueue, the message would be added to the
pendingMsg queue, but would not be attempted sent
on the sendQueue again before a new incoming message
triggered a new attempt.
In this commit BOLT№2 retranmission logic for the channel link have
been added. Now if channel link have been initialised with the
'SyncState' field than it will send the lnwire.ChannelReestablish
message and will be waiting for receiving the same message from remote
side. Exchange of this message allow both sides understand which
updates they should exchange with each other in order sync their
states.
This commit refactors the core logic of the
chanMsgStream to support an additional stream
that is used to asynchronously queue for in-order
delivery to the authenticated gossiper. The channel
streams are slightly adapted to use the more flexible
primitive. We may look to refactor this using more
isolated interfaces, but for now this provides a
minimal change to resolving known flakes.
In this commit we fix an existing bug within the msgConsumer grouting
of the chanMsgStream that could result in a partial deadlock, as the
readHandler would no longer be able to add messages to the message
queue. The primary cause of this issue would be if we got an update for
a channel that “we don’t know of”. The main loop would continue,
leaving the mutex unlocked. We would then try to re-lock at the top of
the loop, leading to a deadlock.
We avoid this situation by properly unlocking the condition variable as
soon as we’re done modifying the condition itself.
In this commit we add the set of local features advertised as a
parameter to the newPeer function. With this change, the server will be
able to programmatically determine _which_ bits should be set on a
connection basis, rather than re-using the same global set of bits for
each peer.
This commit fills in an existing logging gap by adding a new set of
message summaries that is shown for the debug logging level.
Before this commit, if a user wanted to get a close up feel for what
lnd was doing under the covers, they had to use the trace logging
level. Trace can be very verbose, so we now provide a debug logging
level with message “summaries”. The summaries may not contain all the
data in the message, hut have been crafted in order to provide
sufficient detail at a glance.
This commit fixes an existing bug within the iteration between the
queueHandler and the writeHandler. Under certain scenarios, if the
writeHandler was blocked for a non negligible period of time, then the
queueHandler would enter a very tight spinning loop. This was due to
the fact that the break statement in the inner select loop of the
queueHandler wouldn’t actually break the inner for loop, instead it
would cause the execution logic to re-enter that same select loop,
causing a very tight spin.
In this commit, we fix the issue by adding to things: we now label the
inner select loop so we can break out of it if we detect that the
writeHandler has blocked. Secondly, we introduce a new channel between
the queueHandler and the writeHandler to signal the queueHandler that
the writeHandler has finished processing the last message.
In this commit, we add an idle timer to the readHandler itself. This
will serve to slowly prune away inactive TCP connections as a result of
remote peer being blocked either upon reading or writing to the socket.
Our ping timer interval is 1 minute, so an idle timer interval of 5
minutes seem reasonable.
This commit fixes a bug to wrap up the recently merged PR to properly
handle duplicate FundingLocked retransmissions and also ensure that we
reliably re-send the FundingLocked message if we’re unable to the first
time around.
In this commit, we skip processing a channel that does not yet have a
set remote revocation as otherwise, if we attempt to trigger a state
update, then we’ll be attempting to manipulate a nil commitment point.
Therefore, we’ll rely on the fundingManager to properly send the
channel all relevant subsystems.
This commit adds a new debug mode for lnd
called hodlhtlc. This mode instructs a node
to refrain from settling incoming HTLCs for
which it is the exit node. We plan to use
this in testing to more precisely control
the states a node can take during
execution.
This commit adds a precautionary check for the error returned if the
channel hasn’t yet been announced when attempting to read the our
current routing policy to initialize the channelLink for a channel.
Previously, if the channel wasn’t they announced, the function would
return early instead of using the default policy.
We also include another bug fix, that avoids a possible nil pointer
panic in the case that the ChannelEdgeInfo reread form the graph is
nil.
This commit fixes a bug that could arise if either we had not, or the
remote party had not advertised a routing policy for either outgoing
channel edge. In this commit, we now detect if a policy wasn’t
advertised, falling back to the default routing policy if so.
Fixes#259.
This commit fixes a lingering bug within the logic for the
peer/htlcswitch/channellink. When the link needs to fetch the latest
update to send to a sending party due to a violation of the set routing
policy, previously it would modify the timestamp on the message read
from disk. This was incorrect as it would invalidate the signature
within the message itself. We fix this by instead
This commit adds another conditional send select statement to ensure
that when sending the finalized contract to the breach arbiter, the
peer doesn’t possible cause the daemon to hang on shutdown.
This commit modifies the logic when we are loading alll the channels
that we have with a particular peer to grab the current committed
forwarding policy from disk rather then using the default forwarding
policy. We do this as it’s now possible for active channels to have
distinct forwarding policies.
This commit fixes a possible deadlock bug that may arise during
shutdown due to an unconditional send on a channel to the breach
arbiter. We do this on two occasions within the peer: when loading a
new contract to give it the live version, and also when closing a
channel to ensure that it no longer watches over it.
Previously it was possible for these sends to block indefinitely in the
scenario that the server was shutting down (which means the breach
arbiter) is. As a result, the channel would never be drained, meaning
the server couldn’t complete shutdown as the peer hadn’t exited yet.
This commit adds the fee negotiation procedure performed
on channel shutdown. The current algorithm picks an ideal
a fee based on the FeeEstimator and commit weigth, then
accepts the remote's fee if it is at most 50%-200% away
from the ideal. The fee negotiation procedure is similar
both as sender and receiver of the initial shutdown
message, and this commit also make both sides use the
same code path for handling these messages.
This commit fixes a bug which was covered by the recent server
refactoring wherein the grouting would be stuck on the send over the
message channel in the case that the handshake failed. This blockage
would create a deadlock now that the ConnectToPeer method is full
synchronous.
We fix this issue by ensuring the goroutine properly exits.
In addition to improved synchronization between the client
and server, this commit also moves the channel snapshotting
procedure such that it is handled without submitting a query
to the primary select statement. This is primarily done as a
precaution to ensure that no deadlocks occur, has channel
snapshotting has the potential to block restarts.