This commit brings us inline with recent modifications to the spec, that
say we shouldn't pay nodes whose feature vectors signal unknown required
features, and also that we shouldn't route through nodes signaling
unknown required features.
Currently we assert that invoices don't have such features during
decoding, but now that users can specify feature vectors via the rpc
interface, it makes sense to perform this check deeper in call stack.
This will also allow us to remove the check from decoding entirely,
making decodepayreq more useful for debugging.
In this commit, we update the routing package to use the new
`sphinx.NewOnionPacket` method. The new version of this method allows us
to specify _how_ the packet should be filled before it's used to create
a mix-header. This isn't a fundamental change (totally backwards
compatible), instead it plugs a privacy leak that may have revealed to
the destination how long the true route was.
This commit adds success mission control
results for all hops along the route in
a mpp timeout and takes no action for
the final hop along the route. This is a
temporary measure to prevent the default
logic from penalizing the final node while
we decide how to penalize mpp timeouts.
This commit adds a getResolutionFailure function
which returns an appropriate wire failure based
on the outcome of a htlc resolution. It also updates
the MissionControlStore test to ensure that lnd
can handle failures which occur due to mpp timeout.
Also the max hop count check can be removed, because the real bound is
the payload size. By moving the check inside the search loop, we now
also backtrack when we hit the limit.
This commit fixes a potential bug in our test harness, by ensuring that
the constructed node policies are configured _after_ sorting. Currently
the node pubkeys are sorted, but additional parameters (max htlc,
disabled, etc) are applied using the unsorted policies.
Most of the constructors used today use the symmetric channel
constructor, so this shouldn't cause an issue with the majority of our
tests. We recently introduced an asymmetric channel constructor for
which this could have been an issue, however, no known issues were
discovered.
Lastly, we remove the direction from the configuration altogether, and
derive it purely from the final sorting of the pubkeys.
We move up the check for TLV support, since we will later use it to
determine if we can use dependent features, e.g. TLV records and payment
addresses.
This commit creates a wrapper struct, grouping all parameters that
influence the final hop during route construction. This is a preliminary
step for passing in the receiver's invoice feature bits, which will be
used to select an appropriate payment or payload type.
In this commit, we overwrite the final hop's features with either the
destination features or those loaded from the graph fallback. This
ensures that the same features used in pathfinding will be provided to
route construction.
In an earlier commit, we validated the final hop's transitive feature
dependencies, so we also add validation to non-final nodes.
This commit adds an optional PaymentAddr field to the RestrictParams, so
that we can verify the final hop can support it before doing an
expensive round of pathfindig.
In this commit, we fix a bug that prevents us from sending custom
records to nodes that aren't in the graph. Previously we would simply
fail if we were unable to retrieve the node's features.
To remedy, we add the option of supplying the destination's feature bits
into path finding. If present, we will use them directly without
consulting the graph, resolving the original issue. Instead, we will
only consult the graph as a fallback, which will still fail if the node
doesn't exist since the TLV features won't be populated in the empty
feature vector.
Furthermore, this also permits us to provide "virtual features" into the
pathfinding logic, where we make assumptions about what the receiver
supports even if the feature vector isn't actually taken from an
invoice. This can useful in cases like keysend, where we don't have an
invoice, but we can still attempt the payment if we assume the receiver
supports TLV.
This commit allows custom node features to be populated in specific test
instances. For consistency, we auto-populate an empty feature vector for
nodes that have nil feature vectors before writing them to the database.
Previously if a payment was sent with custom records attached, path
finding wouldn't perform a check whether the final node was capable of
receiving custom records in a tlv payload.
This commit prepares for more manipulation of custom records. A list of
tlv.Record types is more difficult to use than the more basic
map[uint64][]byte.
Furthermore fields and variables are renamed to make them more
consistent.
A unified policy differs between local channels and other channels on
the network. There is more information available for local channels and
this is used in the unified policy.
Previously we used the pathfinding source pubkey to determine whether to
apply the local channel logic or not. If queryroutes is executed with a
source node that isn't the self node, this wouldn't work.
When the (virtual) payment attempt cost is set to zero, probabilities
are no longer a factor in determining the best route. In case of routes
with equal costs, we'd just go with the first one found. This commit
refines this behavior by picking the route with the highest probability.
So even though probability doesn't affect the route cost, it is still
used as a tie breaker.
This prepares for routing to self. When checking the condition at the
start, the loop would terminate immediately because the source is equal
to the target.
This commit modifies the FetchPayment method to return MPPayment structs
converted from the legacy on-disk format. This allows us to attach the
HTLCs to the events given to clients subscribing to the outcome of an
HTLC.
This commit also bubbles up to the routerrpc/router_server, by
populating HTLCAttempts in the response and extracting the legacy route
field from the HTLCAttempts.
Previously we used the a priori probability also for our own untried
channels. This led to local channels that had seen a success already
being prioritized over untried local channels. In some cases, depending
on the configured payment attempt cost, this could lead to the payment
taking a two hop route while a direct payment was also possible.
An InvalidOnionPayload implies that the onion was successfully received
by the reporting node, but that they were unable to extract the
contents. Since we assume our own behavior is correct, this mostly
likely poins to an error in the reporter's implementation or that we
sent an unknown required type. Therefore we only penalize that single
hop, and consider the failure terminal if the receiver reported it.
Probabilities are no longer returned for querymc calls. To still provide
some insight into the mission control internals, this commit adds a new
rpc that calculates a success probability estimate for a specific node
pair and amount.
This prepares for decoupling the result interpretation of a single
payment attempt from the information stored in mission control memory
on the history of a node pair. A planned follow-up where we store both
the last success and last failure requires this decoupling.
In this commit we change path finding to no longer consider all channels
between a pair of nodes individually. We assume that nodes forward
non-strict and when we attempt a connection between two nodes, we don't
want to try multiple channels because their policies may not be identical.
Having distinct policies for channel to the same peer is against the
recommendation in the spec, but it happens in the wild. Especially since
we recently changed the default cltv delta value.
What this commit introduces is a unified policy. This can be looked upon
as the greatest common denominator of all policies and should maximize
the probability of getting the payment forwarded.
distance map now holds the edge the current path is coming from,
removing the need for next map.
Both distance map and distanceHeap now hold pointers instead of the full
struct to reduce allocations and copies.
Both these changes reduced path finding time by ~5% and memory usage by
~2mb.
Pre-sizing these structures avoids a lot of map resizing, which causes
copies and rehashing of entries. We mostly know that the map won't
exceed that size, and it doesn't affect memory usage in any significant
way.
Calling `ForEachNode` hits the DB, and allocates and parses every node
in the graph. Walking the channels also loads nodes from the DB, so this
meant that each node was read/parsed/allocated several times per run.
This reduces runtime by ~10ms and memory usage by ~4mb.
This commit changes mission control to partially base the estimated
probability for untried connections on historical results obtained in
previous payment attempts. This incentivizes routing nodes to keep all
of their channels in good shape.
Probability estimates are amount dependent. Previously we assumed an
amount, but that starts to make less sense when we make probability more
dependent on amounts in the future.
This commit modifies the interpretation of node-level failures.
Previously only the failing node was marked. With this commit, also the
incoming and outgoing connections involved in the route are marked as
failed.
The change prepares for the removal of node-level failures in mission
control probability estimation.
This commit changes the in-memory structure of the mission control
state. It prepares for calculation of a node probability. For this we
need to be able to efficiently look up the last results for all channels
of a node.
With the introduction of the max CLTV limit parameter, nodes are able to
reject HTLCs that exceed it. This should also be applied to path
finding, otherwise HTLCs crafted by the same node that exceed it never
left the switch. This wasn't a big deal since the previous max CLTV
limit was ~5000 blocks. Once it was lowered to 1008, the issue became
more apparent. Therefore, all of our path finding attempts now have a
restriction of said limit in in order to properly carry out HTLCs to the
network.
In the process of moving to use the new package, we no longer need to
fetch the outpoint directly, and instead only need to pass the funding
transaction into the new verification logic.
In this commit, we update the router and link to support users
updating the max HTLC policy for their channels. By updating these internal
systems before updating the RPC server and lncli, we protect users from
being shown an option that doesn't actually work.
The policy update logic that resided part in the gossiper and
part in the rpc server is extracted into its own object.
This prepares for additional validation logic to be added for policy
updates that would otherwise make the gossiper heavier.
It is also a small first step towards separation of our own channel data
from the rest of the graph.
Extends the invalid payment details failure with the new accept height
field. This allows sender to distinguish between a genuine invalid
details situation and a delay caused by intermediate nodes.
Currently the underlying array backing the hop's TLVRecords is modified
when combining custom records with the primitive forwarding info. This
commit uses a fresh slice to prevent modifications from mutating the
hop itself.
This commit modifies paymentLifecycle so that it not only feeds
failures into mission control, but successes as well.
This allows for more accurate probability estimates. Previously,
the success probability for a successful pair and a pair with
no history was equal. There was no force that pushed towards
previously successful routes.
In this commit, we extend the path finding to be able to recognize when
a node needs the new TLV format, or the legacy format based on the
feature bits they expose. We also extend the `LightningPayment` struct
to allow the caller to specify an arbitrary set of TLV records which can
be used for a number of use-cases including various variants of
spontaneous payments.
In this commit, we extend the Hop struct to carry an arbitrary set of
TLV values, and add a new field that allows us to distinguish between
the modern and legacy TLV payload.
We add a new `PackPayload` method that will be used to encode the
combined required routing TLV fields along any set of TLV fields that
were specified as part of path finding.
Finally, the `ToSphinxPath` has been extended to be able to recognize if
a hop needs the modern, or legacy payload.
This commit overhauls the interpretation of failed payments. It changes
the interpretation rules so that we always apply the strongest possible
set of penalties, without making assumptions that would hurt good nodes.
Main changes are:
- Apply different rule sets for intermediate and final nodes. Both types
of nodes have different sets of failures that we expect. Penalize nodes
that send unexpected failure messages.
- Distinguish between direct payments and multi-hop payments. For direct
payments, we can infer more about the performance of our peer because we
trust ourselves.
- In many cases it is impossible for the sender to determine which of
the two nodes in a pair is responsible for the failure. In this
situation, we now penalize bidirectionally. This does not hurt the good
node of the pair, because only its connection to a bad node is
penalized.
- Previously we always penalized the outgoing connection of the
reporting node. This is incorrect for policy related failures. For
policy related failures, it could also be that the reporting node
received a wrongly crafted htlc from its predecessor. By penalizing the
incoming channel, we surely hit the responsible node.
- FailExpiryTooSoon is a failure that could have been caused by any node
up to the reporting node by delaying forwarding of the htlc. We don't
know which node is responsible, therefore we now penalize all node pairs
in the route.
When an undecryptable failure comes back for a payment attempt, we
previously only penalized our own outgoing connection. However,
any node could have caused this failure. It is therefore better to
penalize all node connections along the route. Then at least we know for
sure that we will hit the responsible node.
This commit updates existing tests to not rely on mission control for
pruning of local channels. Information about local channels should
already be up to date before path finding starts. If not, the problem
should be fixed where bandwidth hints are set up.
This commit moves the payment outcome interpretation logic into a
separate file. Also, mission control isn't updated directly anymore, but
results are stored in an interpretedResult struct. This allows the
mission control state to be locked for a minimum amount of time and
makes it easier to unit test the result interpretation.
This commit converts several functions from returning a bool and a
failure reason to a nillable failure reason as return parameter. This
will take away confusion about the interpretation of the two separate
values.
Previously mission control tracked failures on a per node, per channel basis.
This commit changes this to tracking on the level of directed node pairs. The goal
of moving to this coarser-grained level is to reduce the number of required
payment attempts without compromising payment reliability.
Align naming better with the lightning spec. Not the full name of the
failure (FailIncorrectOrUnknownPaymentDetails) is used, because this
would cause too many long lines in the code.
This commit adds the BlockPadding value (currently 3) to sendpayment
calls so that if some blocks are mined while the htlc is in-flight, the
exit hop won't reject it.
The current approach iterates all channels in the graph in order to
filter those in need. This approach is time consuming, several seconds
on my mobile device for ~40,000 channels, while during this time the
db is locked in a transaction.
The proposed change is to use an existing functionality that utilize the
fact that channel update are saved indexed by date. This method enables
us to go over only a small subset of the channels, only those that
were updated before the "channel expiry" time and further filter
them for our need.
The same graph that took several seconds to prune was pruned, after
the change, in several milliseconds.
In addition for testing purposes I added Initiator field to the
testChannel structure to reflect the channeldEdgePolicy direction.
If nodes return a channel policy related failure, they may get a second
chance. Our graph may not be up to date. Previously this logic was
contained in the payment session.
This commit moves that into global mission control and thereby removes
the last mission control state that was kept on the payment level.
Because mission control is not aware of the relation between payment
attempts and payments, the second chance logic is no longer based
tracking second chances given per payment.
Instead a time based approach is used. If a node reports a policy
failure that prevents forwarding to its peer, it will get a second
chance. But it will get it only if the previous second chance was
long enough ago.
Also those second chances are no longer dependent on whether an
associated channel update is valid. It will get the second chance
regardless, to prevent creating a dependency between mission control and
the graph. This would interfer with (future) replay of history, because
the graph may not be the same anymore at that point.
This commit adds the pubkeyIndices map to the distanceHeap to avoid
duplicate entries on the heap. This happened in the earlier iteration
of the findPath algorithm and would cause the driving loop to
evaluate already evaluated entries when there was no need.
This commit modifies the nodeWithDist struct to use a route.Vertex
instead of a *channeldb.LightningNode. This change, coupled with
the new ForEachNodeChannel function, allows the findPath Djikstra's
algorithm to cut down on database lookups since we no longer need
to call the FetchOtherNode function.
This commit moves the call to PruneGraph outside of the loop
that collates all of the spentOutputs. With this change, if
a node has been offline for a long period of time, resyncing
with the chain no longer takes up as much memory (1MB vs 200MB
in some cases) or time. Previously, PruneGraph was called
for every block and allocated a very large map further down
in the pruneGraphNodes function. Now, pruneGraphNodes is only
called once.
Since nilling the pubkey curve will lead to a nil-pointer exception if
the key is later used for signature verification, we make sure to make a
copy before nilling and spewing.
This commit moves the default timeout out of router and thereby fixes a
bug that caused SendToRoute to not return the actual error, but a
timeout result instead. SendToRoute only tries a single route, so a
timeout should never happen.
This commit exposes the three main parameters that influence mission
control and path finding to the user as command line or config file
flags. It allows for fine-tuning for optimal results.
This commit adds an assertion to the SendToRoute test that the payment
value stored to the DB during SendToRoute execution is the correct one.
This assertion would fail before the previous commit that fixed a
missing value initialization.
Previously we would mistakenly use the payment value from the dummy
LightningPayment struct, which would obviously be 0 always. Now we
instead calculate the value from the given route.
Previously every payment had its own local mission control state which
was in effect only for that payment. In this commit most of the local
state is removed and payments all tap into the global mission control
probability estimator.
Furthermore the decay time of pruned edges and nodes is extended, so
that observations about the network can better benefit future payment
processes.
Last, the probability function is transformed from a binary output to a
gradual curve, allowing for a better trade off between candidate routes.
This PR replaces the previously used edge and node ignore lists in path
finding by a probability based system. It modifies path finding so that
it not only compares routes on fee and time lock, but also takes route
success probability into account.
Allowing routes to be compared based on success probability is achieved
by introducing a 'virtual' cost of a payment attempt and using that to
translate probability into another cost factor.
TestRouterPaymentStateMachine tests that the router interacts as
expected with the ControlTower during a payment lifecycle, such that it
payment attempts are not sent twice to the switch, and results are
handled after a restart.
This commit makes the router use the ControlTower to drive the payment
life cycle state machine, to keep track of active payments across
restarts. This lets the router resume payments on startup, such that
their final results can be handled and stored when ready.
This encapsulates all state needed to resume a payment from any point of
the payment flow, and that must be shared between the different stages
of the execution. This is done to prepare for breaking the send loop
into smaller parts, and being able to resume the payment from any point
from persistent state.
In this commit we move handing the deobfuscator from the router to the
switch from when the payment is initiated, to when the result is
queried.
We do this because only the router can recreate the deobfuscator after a
restart, and we are preparing for being able to handle results across
restarts.
Since the deobfuscator cannot be nil anymore, we can also get rid of
that special case.
This lets us distinguish an critical error from a actual payment result
(success or failure). This is important since we know that we can only
attempt another payment when a final result from the previous payment
attempt is received.
This commit moves the responsibility of generating a unique payment ID
from the switch to the router. This will make it easier for the router
to keep track of which HTLCs were successfully forwarded onto the
network, as it can query the switch for existing HTLCs as long as the
paymentIDs are kept.
The router is expected to maintain a map from paymentHash->paymentID,
such that they can be replayed on restart. This also lets the router
check the status of a sent payment after a restart, by querying the
switch for the paymentID in question.
This commit reevaluates the router's quit channel between each block
during the initial call to syncGraphWithChain, which, in the worst case,
may have to scan several thousand blocks on startup if the node has not
been active for some time. Without this, attempting to stop the daemon
will not exit until the rescan has completed, which for certain backends
could be several hours.
In this commit, we update the process that we use to generate a sphinx
packet to send our onion routed HTLC. Due to recent changes in the
`sphinx` package we use, we now need to use a new PaymentPath struct. As
a result, it no longer makes sense to split up the nodes in a route and
their per hop paylods as they're now in the same struct. All tests have
been updated accordingly.
In this commit, we make our findPath function use an edge's MaxHTLC as
its available bandwidth instead of its Capacity. We do this as it's
possible for the capacity of an edge to not exist when operating as a
light client. For channels that do not support the MaxHTLC optional
field, we'll fall back to using the edge's Capacity.
In this commit, we refactor DeleteChannelEdge to use ChannelIDs rather
than ChannelPoints. We do this as the only use of DeleteChannelEdge is
when we are pruning zombie channels from our graph. When running under a
light client, we are unable to obtain the ChannelPoint of each edge due
to the expensive operations required to do so. As a stop-gap, we'll
resort towards using an edge's ChannelID instead, which is already
gossiped between nodes.
Since light clients no longer have access to an edge's capacity, they
are unable to validate whether the max HTLC value for an updated edge
policy respects the capacity limit. As a stop-gap, we'll skip this
check.
This serves as a stop-gap for light clients as blocks need to be
downloaded from the P2P network, and even with caches, would be too
costly for them to verify. Doing this has two side effects however:
we'll no longer know of the channel capacity and outpoint, which are
essential for some of lnd's responsibilities.
In this commit, we disable attempting to determine when a channel has
been closed out on-chain whenever AssumeChannelValid is active. Since
the flag indicates that performing this operation is expensive, we do
this as a temporary optimization until we can include proofs of channels
being closed in the gossip protocol.
With this change, the only way for channels being removed from the graph
will be once they're considered zombies: which can happen when both
edges of a channel have their disabled bits set or when both edges
haven't had an update within the past two weeks.
To ensure we don't mark an edge as live again just because an update
with a fresh timestamp was received, we'll ensure that we reject any
new updates for zombie channels if they remain disabled when running
with AssumeChannelValid.
In this commit, we add an additional heuristic when running with
AssumeChannelValid. Since AssumeChannelValid being present assumes that
we're not able to quickly determine whether channels are valid, we also
assume that any channels with the disabled bit set on both sides are
considered zombie. This should be relatively safe to do, since the
disabled bits are usually set when the channel is closed on-chain. In
the case that they aren't, we'll have to wait until both edges haven't
had a new update within two weeks to prune them.
We do this to ensure we don't prune too aggressively, as it's possible
that we've only received the channel announcement for a channel, but not
its accompanying channel updates.
This commit removes the QueryRoutes route cache. It is causing wrong
routes to be returned because not all of the request parameters are
stored.
The cache allowed high frequency QueryRoutes calls to the same
destination and with the same amount to be returned fast. This behaviour
can also be achieved by caching the request on the client side. In case
a route is invalidated because of for example a channel update,
the subsequent SendToRoute call will fail. This is a trigger to call
QueryRoutes again for a fresh route.