In this commit, we fix a bug that caused us to be unable to properly
handle malformed HTLC failures from our direct link. Before this commit,
we would attempt to decrypt it and fail since it wasn't well formed. In
this commit, if its an error for a local payment, and it needed to be
converted, then we'll decode it w/o decrypting since it's already
plaintext.
In this commit, we add a new method to the ErrorEncrypter interface:
`EncryptMalformedError`. This takes a raw error (no encryption or MAC),
and encrypts it as if we were the originator of this error. This will be
used by the switch to convert malformed fail errors to regular fully
encrypted errors.
In this commit, we start the first phase of fixing an existing bug
within the switch. As is, we don't properly convert
`UpdateFailMalformedHTLC` to regular `UpdateFailHTLC` messages that are
fully encrypted. When we receive a `UpdateFailMalformedHTLC` today,
we'll convert it into a regular fail message by simply encoding the
failure message raw. This failure message doesn't have a MAC yet, so if
we sent it backwards, then the destination wouldn't be able to decrypt
it. We can recognize this type of failure as it'll be the same size as
the raw failure message max size, but it has 4 extra bytes for the
encoding. When we come across this message, we'll mark is as needing
conversion so the switch can take care of it.
In this commit, we update the process that we use to generate a sphinx
packet to send our onion routed HTLC. Due to recent changes in the
`sphinx` package we use, we now need to use a new PaymentPath struct. As
a result, it no longer makes sense to split up the nodes in a route and
their per hop paylods as they're now in the same struct. All tests have
been updated accordingly.
Prior to this change, the numQueryResponses that we calculated would be
one more than what we actually wanted since it didn't account for the
initial QueryChannelRange msg. This resulted in the test sending one
extra delayed query than was configured. This doesn't fundamentally
impact the test, but does make what happens in the test more reflective
of the configuration.
This commit makes all replies in the gossip syncer synchronous, meaning
that they will wait for each message to be successfully written to the
remote peer before attempting to send the next. This helps throttle
messages the remote peer has requested, preventing unintended
disconnects when the remote peer is slow to process messages. This
changes also helps out congestion in the peer by forcing the syncer to
buffer the messages instead of dumping them into the peer's queue.
As a prepatory step to making gossip replies synchronous, we will move
the ErrPeerExiting error into the lnpeer package so that it can be
imported by the discovery package. With synchronous sends, this error
can now be detected and handled by goroutines in the syncer, and cause
them to exit instead of continue to sending backlogs.
This commit creates a distinct replyHandler, completely isolating the
requesting state machine from the processing of queries from the remote
peer. Before the two were interlaced, and the syncer could only reply to
messages in certain states. Now the two will be complete separated,
which is preliminary step to make the replies synchronous (as otherwise
we would be blocking our own requesting state machine).
With this changes, the channelGraphSyncer of each peer will drive the
replyHanlder of the other. The two can now operate independently, or
even spun up conditionally depending on advertised support for gossip
queries, as shown below:
A B
channelGraphSyncer ---control-msg--->
replyHandler
channelGraphSyncer <--control-msg----
gossiper <--gossip-msgs----
<--control-msg---- channelGraphSyncer
replyHandler
---control-msg---> channelGraphSyncer
---gossip-msgs---> gossiper
In this commit, we modify the way we detect local force closes. Before
this commit, we would directly check the broadcast commitment's txid
against what we know to be our best local commitment. In the case of DLP
recovery from an SCB, it's possible that the user force closed, _then_
attempted to recover their channels. As a result, we need to check the
outputs directly in order to also handle this rare, but
possible recovery scenario.
The new detection method uses the outputs to detect if it's a local
commitment or not. Based on the state number, we'll re-derive the
expected scripts, and check to see if they're on the commitment. If not,
then we know it's a remote force close. A new test has been added to
exercise this new behavior, ensuring we catch local closes where we have
and don't have a direct output.
This race was possible due to us making a subscription request before
the ChannelRouter has started. We address it by creating a dummy
subscription before proceeding to the real one to ensure we can do so
successfully. We use a dummy one in order to not consume an update from
the real one. This addresses the common "timed out waiting for opened
channel" flake within the integration test suite since the subscription
was never properly created, so we'd never be notified of when new graph
updates were received.
In this commit, we speed up the `TestChainWatcherDataLossProtect`
_considerably_ by enumerating relevant tests using table driven tests
rather than generating random tests via the `testing/quick` package.
Each of these test cases are also run in parallel bringing down the
execution time of this test from a few minutes, to a few seconds.
In this commit, we begin to queue any active syncers until the initial
historical sync has completed. We do this to ensure we can properly
handle any new channel updates at tip. This is required for fresh nodes
that are syncing the channel graph for the first time. If we begin
accepting updates at tip while the initial historical sync is still
ongoing, then we risk not processing certain updates since we've yet to
learn of the channels themselves.
In this commit, we add logic to handle a peer with whom we're performing
an initial historical sync disconnecting. This is required to ensure we
get as much of the graph as possible when starting a fresh node. It will
also serve useful to ensure we do not get stalled once we prevent active
GossipSyncers from starting until the initial historical sync has
completed.
Now that the roundRobinHandler is no longer present, this commit aims to
clean up and simplify some of the logic surrounding initializing/tearing
down new/stale GossipSyncers from the SyncManager. Along the way, we
also synchronize these calls with the syncerHandler, which will serve
useful in future work that allows us to recovery from initial historical
sync disconnections.
Since ActiveSync GossipSyncers no longer synchronize our state with the
remote peers, none of the logic surrounding the round-robin is required
within the SyncManager.
In this commit, we remove the ability for ActiveSync GossipSyncers to
synchronize our graph with our remote peers. This serves as a starting
point towards allowing the daemon to only synchronize our graph through
historical syncs, which will be routinely done by the SyncManager.