The recently added gossip throttling was shown to be too aggressive,
especially with our auto channel enable/disable signaling. We switch to
a token bucket based system instead as it's based on time, rather than a
block height which isn't constantly updated at a given rate.
We do this instead of using the source of the AnnounceSignatures
message, as we filter out the source when broadcasting any
announcements, leading to the remote node not receiving our channel
update. Note that this is done more for the sake of correctness and to
address a flake within the integration tests, as channel updates are
sent directly and reliably to channel counterparts.
AFAICT it's not possible to flip back from bein synced_to_chain, so we
remove the underlying call that could reflect this. The method is moved
into the test file since it's still used to test correctness of other
portions of the flow.
Rather than performing this call in the SyncManager, we give each
gossipSyncer the ability to mark the first sync completed. This permits
pinned syncers to contribute towards the rpc-level synced_to_graph
value, allowing the value to be true after the first pinned syncer or
regular syncer complets. Unlinke regular syncers, pinned syncers can
proceed in parallel possibly decreasing the waiting time if consumers
rely on this field before proceeding to load their application.
A pinned syncer is an ActiveSyncer that is configured to always remain
active for the lifetime of the connection. Pinned syncers do not count
towards the total NumActiveSyncer count, which are rotated periodically.
This features allows nodes to more tightly synchronize their routing
tables by ensuring they are always receiving gossip from distinguished
subset of peers.
We do this instead of using the source of the AnnounceSignatures
message, as we filter out the source when broadcasting any
announcements, leading to the remote node not receiving our channel
update. Note that this is done more for the sake of correctness and to
address a flake within the integration tests, as channel updates are
sent directly and reliably to channel counterparts.
As similarly done with premature channel announcements, we'll no longer
allow premature channel updates to be rebroadcast once mature. This is
no longer necessary as channel announcements that we're not aware of are
usually broadcast to us with their accompanying channel updates.
In this commit, we add a new option to toggle gossip rate limiting. This
new option can be useful in contexts that require near instant
propagation of gossip messages like integration tests.
This change was largely motivated by an increase in high disk usage as a
result of channel update spam. With an in memory graph, this would've
gone mostly undetected except for the increased bandwidth usage, which
this doesn't aim to solve yet. To minimize the effects to disks, we
begin to rate limit channel updates in two ways. Keep alive updates,
those which only increase their timestamps to signal liveliness, are now
limited to one per lnd's rebroadcast interval (current default of 24H).
Non keep alive updates are now limited to one per block per direction.
This allows for a 1000 different validation operations to proceed
concurrently. Now that we are batching operations at the db level, the
average number of outstanding requests will be higher since the commit
latency has increased. To compensate, we allow for more outstanding
requests to keep the gossiper busy while batches are constructed.
Similarly as with kvdb.View this commits adds a reset closure to the
kvdb.Update call in order to be able to reset external state if the
underlying db backend needs to retry the transaction.
This commit adds a reset() closure to the kvdb.View function which will
be called before each retry (including the first) of the view
transaction. The reset() closure can be used to reset external state
(eg slices or maps) where the view closure puts intermediate results.
This commit moves all localized instances of mock implementations of
the Signer interface to the lntest/mock package. This allows us to
remove a lot of code and have it housed under a single interface in
many cases.
Modifies syncer.replyChanRangeQuery method to use the LastBlockHeight
method on the query. LastBlockHeight safely calculates the ending
block height and prevents an overflow of start_block + num_blocks.
Prior to this change, query messages that had a start_block +
num_blocks that overflows uint32_max would return zero results in the
reply message.
Tests are added to fix the bug and ensure proper start and end values
are supplied to the channel graph filter.
This reworks the locking behavior of the Gossiper so that a race
condition on channel updates and block notifications doesn't cause any
loss of messages.
This fixes an issue that manifested mostly as flakes on itests during
WaitForNetworkChannelOpen calls.
The previous behavior allowed ChannelUpdates to be missed if they
happened concurrently to block notifications. The
processNetworkAnnoucement call would check for the current block height,
then lock the gossiper and add the msg to the prematureAnnoucements
list. New blocks would trigger an update to the current block height
then a lock and check of the aforementioned list.
However, specially during itests it could happen that the missing lock
before checking the height could case a race condition if the following
sequence of events happened:
- A new ChannelUpdate message was received and started processing on a
separate goroutine
- The isPremature() call was made and verified that the ChannelUpdate
was in fact premature
- The goroutine was scheduled out
- A new block started processing in the gossiper. It updated the block
height, asked and was granted the lock for the gossiper and verified
there was zero premature announcements. The lock was released.
- The goroutine processing the ChannelUpdate asked for the gossiper lock
and was granted it. It added the ChannelUpdate in the
prematureAnnoucements list. This can never be processed now.
The way to fix this behavior is to ensure that both isPremature checks
done inside processNetworkAnnoucement and best block updates are made
inside the same critical section (i.e. while holding the same lock) so
that they can't both check and update the prematureAnnoucements list
concurrently.
The linter complains about not checking the return value from
WipeChannel in certain places. Instead of checking we simply remove the
returned error because the in-memory modifications cannot fail.
If the provided ChainHash in a QueryChannelRange message does not match
that of our current chain, then we should send a blank response, rather
than reply with channels for the wrong chain.
We move from our legacy way of interpreting ReplyChannelRange messages
which was incorrect. Previously, we'd rely on the Complete field of the
ReplyChannelRange message to determine when our peer had sent all of
their replies. Now, we properly adhere to the specification by
interpreting the block ranges of these messages as intended.
Due to the large number of nodes deployed with the previous method, we
still maintain and detect when we are communicating with them, such that
we are still able to sync with them for backwards compatibility.
It's not possible to send another reply once all replies have been sent
without another request. The purpose of the check is also done within
another test, TestGossipSyncerReplyChanRangeQueryNoNewChans, so it can
be removed from here.
In order to properly adhere to the spec, when handling a
QueryChannelRange message, we must reply with a series of
ReplyChannelRange messages, that when consumed together cover the
entirety of the block range requested.