In this commit, we modify the way we detect local force closes. Before
this commit, we would directly check the broadcast commitment's txid
against what we know to be our best local commitment. In the case of DLP
recovery from an SCB, it's possible that the user force closed, _then_
attempted to recover their channels. As a result, we need to check the
outputs directly in order to also handle this rare, but
possible recovery scenario.
The new detection method uses the outputs to detect if it's a local
commitment or not. Based on the state number, we'll re-derive the
expected scripts, and check to see if they're on the commitment. If not,
then we know it's a remote force close. A new test has been added to
exercise this new behavior, ensuring we catch local closes where we have
and don't have a direct output.
This race was possible due to us making a subscription request before
the ChannelRouter has started. We address it by creating a dummy
subscription before proceeding to the real one to ensure we can do so
successfully. We use a dummy one in order to not consume an update from
the real one. This addresses the common "timed out waiting for opened
channel" flake within the integration test suite since the subscription
was never properly created, so we'd never be notified of when new graph
updates were received.
In this commit, we speed up the `TestChainWatcherDataLossProtect`
_considerably_ by enumerating relevant tests using table driven tests
rather than generating random tests via the `testing/quick` package.
Each of these test cases are also run in parallel bringing down the
execution time of this test from a few minutes, to a few seconds.
In this commit, we begin to queue any active syncers until the initial
historical sync has completed. We do this to ensure we can properly
handle any new channel updates at tip. This is required for fresh nodes
that are syncing the channel graph for the first time. If we begin
accepting updates at tip while the initial historical sync is still
ongoing, then we risk not processing certain updates since we've yet to
learn of the channels themselves.
In this commit, we add logic to handle a peer with whom we're performing
an initial historical sync disconnecting. This is required to ensure we
get as much of the graph as possible when starting a fresh node. It will
also serve useful to ensure we do not get stalled once we prevent active
GossipSyncers from starting until the initial historical sync has
completed.
Now that the roundRobinHandler is no longer present, this commit aims to
clean up and simplify some of the logic surrounding initializing/tearing
down new/stale GossipSyncers from the SyncManager. Along the way, we
also synchronize these calls with the syncerHandler, which will serve
useful in future work that allows us to recovery from initial historical
sync disconnections.
Since ActiveSync GossipSyncers no longer synchronize our state with the
remote peers, none of the logic surrounding the round-robin is required
within the SyncManager.
In this commit, we remove the ability for ActiveSync GossipSyncers to
synchronize our graph with our remote peers. This serves as a starting
point towards allowing the daemon to only synchronize our graph through
historical syncs, which will be routinely done by the SyncManager.
Now that the write pool no longer executes blocking i/o operations, it
is safe to reduce this number considerably. The write pool now only
handles encoding and encryption of messages, making problem
computationally bound and thus dependent on available CPUs. The
descriptions of the workers configs is also updated to explain how users
should set these on their on machines.
This commit removes the write backoff, since subsequent retries no
longer need to access the write pool. Subsequent flushes will resume
writing any partial writes that occurred before the timeout until the
message is received or the idle write timer is triggered.
In the future, we can add callbacks that execute on write events
timeouts. This can be useful for mobile clients that might roam, as the
timeout could indicate the connection is dead even if the OS has not
reported it closed. The callback can be used then, for example, to
initiate another outbound connection to test whether or not the issue is
related to the connection.
This commit modifies the way the link writes messages to the wire, by
first buffering ciphertexts to the connection using WriteMessage, and
then calling Flush separately. Currently, the call to Write tries to do
both, which can result in a blocking operation for up to the duration of
the write timeout. Splitting these operations permits less blocking in
the write pool, since now we only need to use a write worker to
serialize and encrypt the plaintext.
After the write pool is released, the peer then attempts to flush the
message using the appropriate write timeout. If a timeout error occurs,
the peer will continue to flush the message w/o serializing or
encrypting the message again, until the message is fully written to the
wire or the write idle timer disconnects the peer.
As a preliminary step to integrating the separated WriteMessage and
Flush calls in the peer, we'll modify the peer to only set a timestamp
on Ping messages once. This makes sense for two reasons, 1) if the
message has already been partially written, we have already committed to
a ping time, and 2) a ciphertext containing the first ping time will
already be buffered in the connection, and we will only be attempting to
Flush on timeout errors.