Previously every payment had its own local mission control state which
was in effect only for that payment. In this commit most of the local
state is removed and payments all tap into the global mission control
probability estimator.
Furthermore the decay time of pruned edges and nodes is extended, so
that observations about the network can better benefit future payment
processes.
Last, the probability function is transformed from a binary output to a
gradual curve, allowing for a better trade off between candidate routes.
This PR replaces the previously used edge and node ignore lists in path
finding by a probability based system. It modifies path finding so that
it not only compares routes on fee and time lock, but also takes route
success probability into account.
Allowing routes to be compared based on success probability is achieved
by introducing a 'virtual' cost of a payment attempt and using that to
translate probability into another cost factor.
TestRouterPaymentStateMachine tests that the router interacts as
expected with the ControlTower during a payment lifecycle, such that it
payment attempts are not sent twice to the switch, and results are
handled after a restart.
This commit makes the router use the ControlTower to drive the payment
life cycle state machine, to keep track of active payments across
restarts. This lets the router resume payments on startup, such that
their final results can be handled and stored when ready.
This encapsulates all state needed to resume a payment from any point of
the payment flow, and that must be shared between the different stages
of the execution. This is done to prepare for breaking the send loop
into smaller parts, and being able to resume the payment from any point
from persistent state.
In this commit we move handing the deobfuscator from the router to the
switch from when the payment is initiated, to when the result is
queried.
We do this because only the router can recreate the deobfuscator after a
restart, and we are preparing for being able to handle results across
restarts.
Since the deobfuscator cannot be nil anymore, we can also get rid of
that special case.
This lets us distinguish an critical error from a actual payment result
(success or failure). This is important since we know that we can only
attempt another payment when a final result from the previous payment
attempt is received.
This commit moves the responsibility of generating a unique payment ID
from the switch to the router. This will make it easier for the router
to keep track of which HTLCs were successfully forwarded onto the
network, as it can query the switch for existing HTLCs as long as the
paymentIDs are kept.
The router is expected to maintain a map from paymentHash->paymentID,
such that they can be replayed on restart. This also lets the router
check the status of a sent payment after a restart, by querying the
switch for the paymentID in question.
This commit reevaluates the router's quit channel between each block
during the initial call to syncGraphWithChain, which, in the worst case,
may have to scan several thousand blocks on startup if the node has not
been active for some time. Without this, attempting to stop the daemon
will not exit until the rescan has completed, which for certain backends
could be several hours.
In this commit, we update the process that we use to generate a sphinx
packet to send our onion routed HTLC. Due to recent changes in the
`sphinx` package we use, we now need to use a new PaymentPath struct. As
a result, it no longer makes sense to split up the nodes in a route and
their per hop paylods as they're now in the same struct. All tests have
been updated accordingly.
In this commit, we make our findPath function use an edge's MaxHTLC as
its available bandwidth instead of its Capacity. We do this as it's
possible for the capacity of an edge to not exist when operating as a
light client. For channels that do not support the MaxHTLC optional
field, we'll fall back to using the edge's Capacity.
In this commit, we refactor DeleteChannelEdge to use ChannelIDs rather
than ChannelPoints. We do this as the only use of DeleteChannelEdge is
when we are pruning zombie channels from our graph. When running under a
light client, we are unable to obtain the ChannelPoint of each edge due
to the expensive operations required to do so. As a stop-gap, we'll
resort towards using an edge's ChannelID instead, which is already
gossiped between nodes.
Since light clients no longer have access to an edge's capacity, they
are unable to validate whether the max HTLC value for an updated edge
policy respects the capacity limit. As a stop-gap, we'll skip this
check.
This serves as a stop-gap for light clients as blocks need to be
downloaded from the P2P network, and even with caches, would be too
costly for them to verify. Doing this has two side effects however:
we'll no longer know of the channel capacity and outpoint, which are
essential for some of lnd's responsibilities.
In this commit, we disable attempting to determine when a channel has
been closed out on-chain whenever AssumeChannelValid is active. Since
the flag indicates that performing this operation is expensive, we do
this as a temporary optimization until we can include proofs of channels
being closed in the gossip protocol.
With this change, the only way for channels being removed from the graph
will be once they're considered zombies: which can happen when both
edges of a channel have their disabled bits set or when both edges
haven't had an update within the past two weeks.
To ensure we don't mark an edge as live again just because an update
with a fresh timestamp was received, we'll ensure that we reject any
new updates for zombie channels if they remain disabled when running
with AssumeChannelValid.
In this commit, we add an additional heuristic when running with
AssumeChannelValid. Since AssumeChannelValid being present assumes that
we're not able to quickly determine whether channels are valid, we also
assume that any channels with the disabled bit set on both sides are
considered zombie. This should be relatively safe to do, since the
disabled bits are usually set when the channel is closed on-chain. In
the case that they aren't, we'll have to wait until both edges haven't
had a new update within two weeks to prune them.
We do this to ensure we don't prune too aggressively, as it's possible
that we've only received the channel announcement for a channel, but not
its accompanying channel updates.
This commit removes the QueryRoutes route cache. It is causing wrong
routes to be returned because not all of the request parameters are
stored.
The cache allowed high frequency QueryRoutes calls to the same
destination and with the same amount to be returned fast. This behaviour
can also be achieved by caching the request on the client side. In case
a route is invalidated because of for example a channel update,
the subsequent SendToRoute call will fail. This is a trigger to call
QueryRoutes again for a fresh route.
In this commit, we extend the graph's FetchChannelEdgesByID and
HasChannelEdge methods to also check the zombie index whenever the edge
to be looked up doesn't exist within the edge index. We do this to
signal to callers that the edge is known, but only as a zombie, and the
only information that we have about the edge are the node public keys of
the two parties involved in the edge.
In the event that an edge does exist within the zombie index, we make
an additional check on edge policies to ensure they are not within the
router's pruning window, indicating that it is a fresh update.
In this commit, we update the build to point to the latest version of
neutrino and btcwallet. The latest version of neutrino includes a number
of bug fixes, and new features like reliably transaction broadcast. The
latest version of btcwallet contains a number of bug fixes related to
properly remove invalid transactions from its database.
Currently public keys are represented either as a 33-byte array (Vertex) or as a
btcec.PublicKey struct. The latter isn't useable as index into maps and
cannot be used easily in compares. Therefore the 33-byte array
representation is used predominantly throughout the code base.
This commit converts the argument types of source and target nodes for
path finding to Vertex. Path finding executes no crypto operations and
using Vertex simplifies the code.
Additionally, it prepares for the path finding source parameter to be
exposed over rpc in a follow up commit without requiring conversion back
and forth between Vertex and btcec.PublicKey.
This commit allows the execution of QueryRoutes to be controlled using
lists of black-listed edges and nodes. Any path returned will not pass
through the edges and/or nodes on the list.
In this commit, we update the path finding logic to
ignore a channel if the HTLC value (including the fees
at the point) exceeds the max HTLC value (if set) of the
link.
Since the MaxHTLC field was recently added to the ChannelEdgePolicy struct,
and the Flags field was broken into ChannelFlags and MessageFlags, the
test edge policies should be updated accordingly.
This commit is a step to split the lnwallet package. It puts the Input
interface and implementations in a separate package along with all their
dependencies from lnwallet.
In this commit, we deprecate the `IncorrectHtlcAmount` onion error.
We'll still decode this error to use when retrying paths, but we'll no
longer send this ourselves. The `UnknownPaymentHash` error has been
amended to also include the value of the payment as well. This allows us
to worry about one less error.
In this commit, we ensure that when we update an edge
as a result of a ChannelUpdate being returned from an
onion failure, the max htlc portion of the channel update
is included in the edge update.
In this commit, we alter the ValidateChannelUpdateAnn function in
ann_validation to validate a remote ChannelUpdate's message flags
and max HTLC field. If the message flag is set but the max HTLC
field is not set or vice versa, the ChannelUpdate fails validation.
Co-authored-by: Johan T. Halseth <johanth@gmail.com>
In this commit:
* we partition lnwire.ChanUpdateFlag into two (ChanUpdateChanFlags and
ChanUpdateMsgFlags), from a uint16 to a pair of uint8's
* we rename the ChannelUpdate.Flags to ChannelFlags and add an
additional MessageFlags field, which will be used to indicate the
presence of the optional field HtlcMaximumMsat within the ChannelUpdate.
* we partition ChannelEdgePolicy.Flags into message and channel flags.
This change corresponds to the partitioning of the ChannelUpdate's Flags
field into MessageFlags and ChannelFlags.
Co-authored-by: Johan T. Halseth <johanth@gmail.com>
In this commit we introduce pruning of channel edges instead of channels.
Channel failures apply to a single direction and it is unnecessarily
restricting to prune both directions.
Hop maps were used in a test to verify the population of the hop map
itself and further only in a single function (getFailedChannelID).
Rewrote that function and removed the hop maps completely.
There is the general assumption that channel edge policy nodes are
ordered such that the node1 pubkey is smaller than the key of node 2. In
the test graph, this assumption didn't hold. This commit fixes the test
graph and also adds a check to prevent this from happening again.
This commit adds a new test that checks that the bandwidth hints are
considered correclty for local channels, and that disable flags are
ignored in this case.
To decouple our own path finding from the graph state, we don't consider
the disable bit when attempting to use local channels. Instead the
bandwidth hints will be zero for local inactive channels.
We alos modify the unit test to check that the disable flag is ignored
for local edges.
Fixes the following issues:
- If the channel update of FailFeeInsufficient contains an invalid channel
update, it is not possible to properly add to the failed channels set.
- FailAmountBelowMinimum may apply a channel update, but does not retry.
- FailIncorrectCltvExpiry immediately prunes the vertex without
trying one more time.
In this commit, the logic for all three policy related errors is
aligned.
In this commit we add a check to HtlcSatifiesPolicy to verify that the
time lock for the outgoing htlc that is requested in the onion packet
isn't too far in the future.
Without this check, anyone could force an unreasonably long time lock on
the forwarding node.
In this commit the dependency of unmarshallRoute on edge policies being
available is removed. Edge policies may be unknown and reported as nil.
SendToRoute does not need the policies, but it does need pubkeys of the
route hops. In this commit, unmarshallRoute is modified so that it
takes the pubkeys from edgeInfo instead of channelEdgePolicy.
In addition to this, the route structure is simplified. No more connection
to the database at that point. Fees are determined based on incoming and
outgoing amounts.
Previously, gossiper was the only object that validated channel
updates. Because updates can also be received as part of a
failed payment session in the routing package, validation logic
needs to be available there too. Gossiper already depends on
routing and having routing call the validation logic inside
gossiper would be a circular dependency. Therefore the validation
was moved to routing.
We make sure to return an error other than ErrIgnored, as ErrIgnored is
expected to only be returned for updates where we already have the
necessary information in the database.
In case of a channel ID found in the rejectCache, there was a
possibility that we had rejected an invalid update for this channel
earlier, and when attempting to add the current update we wouldn't
distinguish the failure to add from an outdated/ignored update.
The commit ensures that for every channel, there will always
be two entries in the edges bucket. If the policy from one or
both ends of the channel is unknown, it is marked as such.
This allows efficient lookup of incoming edges. This is
required for backwards payment path finding.
In this commit, we introduce a nice optimization with regards to lnd's
interaction with a bitcoind backend. Within lnd, we currently have three
different subsystems responsible for watching the chain: chainntnfs,
lnwallet, and routing/chainview. Each of these subsystems has an active
RPC and ZMQ connection to the underlying bitcoind node. This would incur
a toll on the underlying bitcoind node and would cause us to miss ZMQ
events, which are crucial to lnd. We remedy this issue by sharing the
same connection to a bitcoind node between the different clients within
lnd.
In this commit, we modify the test to explitlcy give the neutrino
backend more time to catch up compared to the RPC backends. We do this
as a recent change has been made in the neutrino backend to wait for the
filter headers to finish syncing before proceeding with the rescan
itself. As a result, we'll need to account for this in the test and
sleep enough to give the backend a chance to catch up.
In this commit, we ensure that the neutrino backend meets the target
interface, and also we update the API usage for the internal neutrino
rescan struct to use the new InputWithScript struct.
In this commit, we update the existing UpdateFilter method to take the
new channeldb.EdgePoint struct in place of the prior wire.OutPoint. We
must do this as the caller now typically has this type due to the
preparation to enable lnd to be able to be compatible with the new
neutrino protocol.
In this commit, we fix a slight race condition that can occur when we go
to add a shell node for a node announcement, but then right afterwards,
a new block arrives that causes us to prune an unconnected node. To
ensure this doesn't happen, we now add shell nodes within the same db
transaction as AddChannelEdge. This ensures that the state is fully
consistent and shell nodes will be added atomically along with the new
channel edge.
As a result of this change, we no longer need to add shell nodes within
the ChannelRouter, as the database will take care of this operation as
it should.
In this commit, we fix an existing bug that could at times lead to a
panic if a user manually crafts a route via SendToRoute, and that route
results in a payment error. The fix is simple: create the map even
though it won't be used in the sessions since the user is feeding the
router manual routes.
In this commit, we modify the granularity of the locking
around the filterMtx in the bitcoind chainview, such that
we only lock once per block connected or filter update.
Currently, we acquire and release the lock for every
update to the map.
We also fix a bug that would cause us to not fully remove
all previous outpoints spent by a txn when doing manual
filter, as we previously would only remove the first output
detected.
In this commit, we modify the granularity of the locking
around the filterMtx in the btcd chainview, such that we
only lock once per block connected or filter update.
Currently, we acquire and release the lock for every
update to the map.
We also fix a bug that would cause us to not fully remove
all previous outpoints spent by a txn when doing manual
filter, as we previously would only remove the first output
detected.
In this commit, we update the generateSphinxPacket to use newLogClosure
to delay the spew evaluation until log print time. Before this commit,
even if we weren't on the trace logging level, the spew call would
always be evaluated.
In this commit, a new weight function is introduced. This will create a
meaningful effect of time lock on route selection. Also, removes the
squaring of the fee term. This led to suboptimal routes.
Unit test added that covers the weight function and asserts that the
lowest fee route is indeed returned.
This comment extends the unit tests for NewRoute with checks
on the total time lock for a route as well as the expected time
lock values for every hop along the route.
This commit fixes the logic inside the newRoute function to
address the following problems:
- Fee calculation for a hop does not include the fee that needs
to be paid to the next hop.
- The incoming channel capacity "sanity" check does not include
the fee to be paid to the current hop.
In this commit, we fix the incorrect expiry values in the
spec_example.json test file. Many of the time locks were incorrect which
allowed bugs within the path finding logic related to CLTV deltas to go
un-detected.
In this commit, we fix an existing bug in the newRoute method. Before
this commit we would use the time lock delta of the current hop to
compute the outgoing time lock for the current hop. This is incorrect as
the time lock delta of the _outgoing_ hop should be used, as this is
what we're paying for "transit" on. This is a bug left over from when we
switched the meaning of the CLTV delta on the ChannelUpdate message
sometime last year.
The fix is simple: use the CLTV delta of the prior (later in the route)
hop.
- Extend SendRequest and QueryRoutesRequest protos
- newRoute function takes fee limit and cuts off routes that exceed it
- queryRoutes, payInvoice and sendPayment commands take the feeLimit inputs and pass them down to newRoute
- When no feeLimit is included, don't enforce any feeLimits at all (by setting feeLimit to maxValue)
In this commit, we modify the recent refactoring of the mission control
sub-system to overload the existing payment session, rather than create
a brand new one. This allows us to re-use more of the existing logic, and
also feedback into mission control the failures incurred by any user
selected routes.
In this commit, we introduce a new method to the channel router's config
struct: QueryBandwidth. This method allows the channel router to query
for the up-to-date available bandwidth of a particular link. In the case
that this link emanates from/to us, then we can query the switch to see
if the link is active (if not bandwidth is zero), and return the current
best estimate for the available bandwidth of the link. If the link,
isn't one of ours, then we can thread through the total maximal
capacity of the link.
In order to implement this, the missionControl struct will now query the
switch upon creation to obtain a fresh bandwidth snapshot. We take care
to do this in a distinct db transaction in order to now introduced a
circular waiting condition between the mutexes in bolt, and the channel
state machine.
The aim of this change is to reduce the number of unnecessary failures
during HTLC payment routing as we'll now skip any links that are
inactive, or just don't have enough bandwidth for the payment. Nodes
that have several hundred channels (all of which in various states of
activity and available bandwidth) should see a nice gain from this w.r.t
payment latency.
This commit alters the neutrino chainview such that it
caches the filter entries corresponding to watched
outpoints at the moment they are added to the filter.
Previously, we would rederive each filter entry when
reconstructing the relevant filter entries, which
would lead to unnecessary work on the gc. Now, each is
created at most once, and reused across subsequent
reconstructions.
Adds a new error ErrVBarrierShuttingDown that is returned
from WaitForDependants if the validation barrier's quit
chan is closed. This allows any blocked goroutines to
distinguish whether the dependent task has been completed,
or if validation should be aborted entirely.
This commit improves the shutdown of the router's
pending validation tasks, by ensuring the pending
tasks exit early if the validation barrier
receives a shutdown request.
Currently, any goroutines blocked by WaitForDependants
will continue execution after a shutdown is signaled.
This may lead to unnexpected behavior as the relation
between updates is no longer upheld. It also has the
side effect of slowing down shutdown, since we
continue to process the remaining updates.
To remedy this, WaitForDependants now returns an error
that signals if a shutdown was requested. The blocked
goroutines can exit early upon seeing this error,
without also signaling completion of their task to
the dependent tasks, which should will now properly
wait to read the validation barrier's quit signal.
In this commit, we update the TestSendPaymentErrorPathPruning test to
reflect the new behavior w.r.t how we respond to UnknownPeer errors. In
this new test, we expect that we'll find alternative route in light of
us getting an UnknownPeer error "pointing" to our destination node.
In this commit we fix an lingering bug in the Mission Control logic we
execute in response to the FailUnknownNextPeer error. Historically, we
would treat this as the _next_ node not being online. As a result, we
would then prune away the vertex from the current reachable graph all
together. It was recently realized, that this would at times be a bit
_tooo_ aggressive if the channel we attempt to route over was faulty,
down, or the incoming node had connectivity issues with the outgoing
node.
In light of this realization, we'll now instead only prune the _edge_
that we attempted to route over. This ensures that we'll continue to
explore the possible edges. Additionally, this guards us against failure
modes where nodes report FailUnknownNextPeer to other nodes in an
attempt to more closely control our retry logic.
This change is a stop gap on the path to a more intelligent set of
autopilot heuristics.
Fixes#1114.