Now that SendToRoute is no longer using the payment lifecycle, we move
the max hop check out of the payment shard's launch() method, and return
the error directly, such that it can be handled in SendToRoute.
Now that SendToRoute is no longer using the payment lifecycle, we
remove the error structs and vars used to cache the last encountered
error. For SendToRoute this will now be returned directly after a shard
has failed.
For SendPayment this means that the last error encountered durinng
pathfinding no longer will be returned. All errors encounterd can
instead be inspected from the HTLC list.
Instead of having SendToRoute pull routes from the payment session in
the payment lifecycle, we utilize the new methods on the paymentShard to
launch and collect the result for this single route.
This also let us remove the check for noRouteError, as we will always
have the result from the tried attempt returned. A result of this is
that we can finally remove lastError from the payment lifecycle (see
next commits).
Fetching the final shard result will also be done for calls to
SendToRoute, so we extract this code into a new method.
We move the call to the ControlTower to set the payment level failure
out into the payment loop, as this must be handled differently when
multiple shards are in flight, and for SendToRoute.
Define shardHandler which is a struct holding what is needed to send
attempts along given routes. The reason we define the logic on this
struct instead of the paymentLifecycle is that we later will make
SendToRoute calls not go through the payment lifecycle, but only using
this struct.
The launch shard is responsible for registering the attempt with the
control tower, failing it if the launch fails. Note that it is NOT
responsible for marking the _payment_ failed in case a terminal error is
encountered. This is important since we will later reuse this method for
SendToRoute, where whether to fail the payment cannot be decided on the
shard level.
We replace the cached attempt, and instead use the control tower
(database) to fetch any in-flight attempt. This is done as a
preparation for having multiple attempts in flight.
In addition we remove the cached circuit, as it won't be applicable when
multiple shards are in flight.
Instead of tracking the attemp we consult the database on every
iteration, and pick up any existing attempt. This also let us avoid
having to pass in the existing attempts from the payment loop, as we
just fetch them direclty.
This method is used to fetch a payment and all HTLC attempt that have
been made for that payment. It will both be used to resume inflight
attempts, and to fetch the final outcome of previous attempts.
We also update the the mock control tower to mimic the real control
tower, by letting it track multiple HTLC attempts for a given payment
hash, laying the groundwork for later enabling it for MPP.
The test case's preimage was (mistakenly) overwritten after crafting the
lightning payment, causing the parts of the testcases use the same
preimage causing problems when we are using the payment hash and
preimage in the mock control tower to distinguish paymennts.
In our quest to move calls to the ControlTower into the main payment
lifecycle loop, we move the edge case of a too long route out of
createNewPaymentAttempt.
loop
To prepare for multiple in flight payment attempts, we move
checkpointing the payment attempt out of createNewPaymentAttempt and
into the main payment lifecycle loop.
We'll attempt to move all calls to the DB via the ControlTower into this
loop, so we can more easily handle them in sequence.
active shards
In preparation for doing pathfinding for routes sending a value less
than the total payment amount, we let the payment session take the max
amount to send and the fee limit as arguments to RequestRoute.
This commit moves supplying of the information in the LightningPayment
to the initialization of the paymentSession, away from every call to
RequestRoute.
Instead the paymentSession will store this information internally, as it
doesn't change between payment attempts.
This is done to rid the RequestRoute call of the LightingPayment
argument, as for SendToRoute calls, it is not needed to supply the next
route.
This commit extends the htlc fail info with the full failure reason that
was received over the wire. In a later commit, this info will also be
exposed on the rpc interface. Furthermore it serves as a building block
to make SendToRoute reliable across restarts.
This commit converts the database structure of a payment so that it can
not just store the last htlc attempt, but all attempts that have been
made. This is a preparation for mpp sending.
In addition to that, we now also persist the fail time of an htlc. In a
later commit, the full failure reason will be added as well.
A key change is made to the control tower interface. Previously the
control tower wasn't aware of individual htlc outcomes. The payment
remained in-flight with the latest attempt recorded, but an outcome was
only set when the payment finished. With this commit, the outcome of
every htlc is expected by the control tower and recorded in the
database.
Co-authored-by: Johan T. Halseth <johanth@gmail.com>
To better distinguish payments from HTLCs, we rename the attempt info
struct to HTLCAttemptInfo. We also embed it into the HTLCAttempt struct,
to avoid having to duplicate this information.
The paymentID term is renamed to attemptID.
Adds an integrated routing test of probability extrapolation for untried
channels. The larger part of this commit is mock code to simulate the
Lightning Network.
The difference between this test and the existing pathfinding tests, is that
this test focuses on the feedback loop from result interpretation via
mission control updates and probability estimation back to pathfinding.
Improvements like probability extrapolation were previously only
validated by reasoning, while this setup makes it possible to assert the
improvement in a test and guard it for the future.
Previously we only penalized the outgoing connections of a failing node.
This turned out not to be sufficient, because the next route sometimes
went into the same failing node again to try a different outgoing
connection that wasn't yet known to mission control and therefore not
penalized before.
This shortcut does not work when the destination is a private node. We
also don't have this shortcut for regular payments. This commit
aligns the behavior between SendPayment and QueryRoutes.
The default was increased for the main sendpayment RPC in commit
d3fa9767a9729756bab9b4a1121344b265410b1a. This commit sets the
same default for QueryRoutes, routerrpc.SendPayment and
router.EstimateRouteFee.
Update the type check used for checking local payment
failures to check on the ClearTextError interface rather
than on the ForwardingError type. This change prepares
for splitting payment errors up into Link and Forwarding
errors.
This commit adds a ClearTextError interface
which is implemented by non-opaque errors that
we know the underlying wire failure message for.
This interface is implemented by ForwardingErrors,
because we can fully decrypt the onion blob to
obtain the underlying failure reason. This interface
will also be implemented by errors which originate
at our node in following commits, because we know
the failure reason when we fail the htlc.
The lnwire interface is un-embedded in the
ForwardingError struct in favour of implementing
this interface. This change is made to protect
against accidental passing of a ForwardingError
to the wire, where the embedded FailureMessage
interface will present as wire failure but
will not serialize properly.
Add a constructor for the creation of forwarding errors.
A special constructor is added for the case where we have
an unknown wire failure, and must set a nil failure message.
Modifies TestMissingFeatureDep and TestDestPaymentAddr to use the test
ctx directly instead of generating a closure and using local state to
modify restrictions.
This commit brings us inline with recent modifications to the spec, that
say we shouldn't pay nodes whose feature vectors signal unknown required
features, and also that we shouldn't route through nodes signaling
unknown required features.
Currently we assert that invoices don't have such features during
decoding, but now that users can specify feature vectors via the rpc
interface, it makes sense to perform this check deeper in call stack.
This will also allow us to remove the check from decoding entirely,
making decodepayreq more useful for debugging.
In this commit, we update the routing package to use the new
`sphinx.NewOnionPacket` method. The new version of this method allows us
to specify _how_ the packet should be filled before it's used to create
a mix-header. This isn't a fundamental change (totally backwards
compatible), instead it plugs a privacy leak that may have revealed to
the destination how long the true route was.
This commit adds success mission control
results for all hops along the route in
a mpp timeout and takes no action for
the final hop along the route. This is a
temporary measure to prevent the default
logic from penalizing the final node while
we decide how to penalize mpp timeouts.
This commit adds a getResolutionFailure function
which returns an appropriate wire failure based
on the outcome of a htlc resolution. It also updates
the MissionControlStore test to ensure that lnd
can handle failures which occur due to mpp timeout.
Also the max hop count check can be removed, because the real bound is
the payload size. By moving the check inside the search loop, we now
also backtrack when we hit the limit.
This commit fixes a potential bug in our test harness, by ensuring that
the constructed node policies are configured _after_ sorting. Currently
the node pubkeys are sorted, but additional parameters (max htlc,
disabled, etc) are applied using the unsorted policies.
Most of the constructors used today use the symmetric channel
constructor, so this shouldn't cause an issue with the majority of our
tests. We recently introduced an asymmetric channel constructor for
which this could have been an issue, however, no known issues were
discovered.
Lastly, we remove the direction from the configuration altogether, and
derive it purely from the final sorting of the pubkeys.
We move up the check for TLV support, since we will later use it to
determine if we can use dependent features, e.g. TLV records and payment
addresses.
This commit creates a wrapper struct, grouping all parameters that
influence the final hop during route construction. This is a preliminary
step for passing in the receiver's invoice feature bits, which will be
used to select an appropriate payment or payload type.
In this commit, we overwrite the final hop's features with either the
destination features or those loaded from the graph fallback. This
ensures that the same features used in pathfinding will be provided to
route construction.
In an earlier commit, we validated the final hop's transitive feature
dependencies, so we also add validation to non-final nodes.
This commit adds an optional PaymentAddr field to the RestrictParams, so
that we can verify the final hop can support it before doing an
expensive round of pathfindig.
In this commit, we fix a bug that prevents us from sending custom
records to nodes that aren't in the graph. Previously we would simply
fail if we were unable to retrieve the node's features.
To remedy, we add the option of supplying the destination's feature bits
into path finding. If present, we will use them directly without
consulting the graph, resolving the original issue. Instead, we will
only consult the graph as a fallback, which will still fail if the node
doesn't exist since the TLV features won't be populated in the empty
feature vector.
Furthermore, this also permits us to provide "virtual features" into the
pathfinding logic, where we make assumptions about what the receiver
supports even if the feature vector isn't actually taken from an
invoice. This can useful in cases like keysend, where we don't have an
invoice, but we can still attempt the payment if we assume the receiver
supports TLV.
This commit allows custom node features to be populated in specific test
instances. For consistency, we auto-populate an empty feature vector for
nodes that have nil feature vectors before writing them to the database.
Previously if a payment was sent with custom records attached, path
finding wouldn't perform a check whether the final node was capable of
receiving custom records in a tlv payload.
This commit prepares for more manipulation of custom records. A list of
tlv.Record types is more difficult to use than the more basic
map[uint64][]byte.
Furthermore fields and variables are renamed to make them more
consistent.
A unified policy differs between local channels and other channels on
the network. There is more information available for local channels and
this is used in the unified policy.
Previously we used the pathfinding source pubkey to determine whether to
apply the local channel logic or not. If queryroutes is executed with a
source node that isn't the self node, this wouldn't work.
When the (virtual) payment attempt cost is set to zero, probabilities
are no longer a factor in determining the best route. In case of routes
with equal costs, we'd just go with the first one found. This commit
refines this behavior by picking the route with the highest probability.
So even though probability doesn't affect the route cost, it is still
used as a tie breaker.
This prepares for routing to self. When checking the condition at the
start, the loop would terminate immediately because the source is equal
to the target.
This commit modifies the FetchPayment method to return MPPayment structs
converted from the legacy on-disk format. This allows us to attach the
HTLCs to the events given to clients subscribing to the outcome of an
HTLC.
This commit also bubbles up to the routerrpc/router_server, by
populating HTLCAttempts in the response and extracting the legacy route
field from the HTLCAttempts.
Previously we used the a priori probability also for our own untried
channels. This led to local channels that had seen a success already
being prioritized over untried local channels. In some cases, depending
on the configured payment attempt cost, this could lead to the payment
taking a two hop route while a direct payment was also possible.
An InvalidOnionPayload implies that the onion was successfully received
by the reporting node, but that they were unable to extract the
contents. Since we assume our own behavior is correct, this mostly
likely poins to an error in the reporter's implementation or that we
sent an unknown required type. Therefore we only penalize that single
hop, and consider the failure terminal if the receiver reported it.
Probabilities are no longer returned for querymc calls. To still provide
some insight into the mission control internals, this commit adds a new
rpc that calculates a success probability estimate for a specific node
pair and amount.
This prepares for decoupling the result interpretation of a single
payment attempt from the information stored in mission control memory
on the history of a node pair. A planned follow-up where we store both
the last success and last failure requires this decoupling.
In this commit we change path finding to no longer consider all channels
between a pair of nodes individually. We assume that nodes forward
non-strict and when we attempt a connection between two nodes, we don't
want to try multiple channels because their policies may not be identical.
Having distinct policies for channel to the same peer is against the
recommendation in the spec, but it happens in the wild. Especially since
we recently changed the default cltv delta value.
What this commit introduces is a unified policy. This can be looked upon
as the greatest common denominator of all policies and should maximize
the probability of getting the payment forwarded.
distance map now holds the edge the current path is coming from,
removing the need for next map.
Both distance map and distanceHeap now hold pointers instead of the full
struct to reduce allocations and copies.
Both these changes reduced path finding time by ~5% and memory usage by
~2mb.
Pre-sizing these structures avoids a lot of map resizing, which causes
copies and rehashing of entries. We mostly know that the map won't
exceed that size, and it doesn't affect memory usage in any significant
way.
Calling `ForEachNode` hits the DB, and allocates and parses every node
in the graph. Walking the channels also loads nodes from the DB, so this
meant that each node was read/parsed/allocated several times per run.
This reduces runtime by ~10ms and memory usage by ~4mb.
This commit changes mission control to partially base the estimated
probability for untried connections on historical results obtained in
previous payment attempts. This incentivizes routing nodes to keep all
of their channels in good shape.
Probability estimates are amount dependent. Previously we assumed an
amount, but that starts to make less sense when we make probability more
dependent on amounts in the future.
This commit modifies the interpretation of node-level failures.
Previously only the failing node was marked. With this commit, also the
incoming and outgoing connections involved in the route are marked as
failed.
The change prepares for the removal of node-level failures in mission
control probability estimation.
This commit changes the in-memory structure of the mission control
state. It prepares for calculation of a node probability. For this we
need to be able to efficiently look up the last results for all channels
of a node.
With the introduction of the max CLTV limit parameter, nodes are able to
reject HTLCs that exceed it. This should also be applied to path
finding, otherwise HTLCs crafted by the same node that exceed it never
left the switch. This wasn't a big deal since the previous max CLTV
limit was ~5000 blocks. Once it was lowered to 1008, the issue became
more apparent. Therefore, all of our path finding attempts now have a
restriction of said limit in in order to properly carry out HTLCs to the
network.
In the process of moving to use the new package, we no longer need to
fetch the outpoint directly, and instead only need to pass the funding
transaction into the new verification logic.
In this commit, we update the router and link to support users
updating the max HTLC policy for their channels. By updating these internal
systems before updating the RPC server and lncli, we protect users from
being shown an option that doesn't actually work.
The policy update logic that resided part in the gossiper and
part in the rpc server is extracted into its own object.
This prepares for additional validation logic to be added for policy
updates that would otherwise make the gossiper heavier.
It is also a small first step towards separation of our own channel data
from the rest of the graph.
Extends the invalid payment details failure with the new accept height
field. This allows sender to distinguish between a genuine invalid
details situation and a delay caused by intermediate nodes.
Currently the underlying array backing the hop's TLVRecords is modified
when combining custom records with the primitive forwarding info. This
commit uses a fresh slice to prevent modifications from mutating the
hop itself.
This commit modifies paymentLifecycle so that it not only feeds
failures into mission control, but successes as well.
This allows for more accurate probability estimates. Previously,
the success probability for a successful pair and a pair with
no history was equal. There was no force that pushed towards
previously successful routes.
In this commit, we extend the path finding to be able to recognize when
a node needs the new TLV format, or the legacy format based on the
feature bits they expose. We also extend the `LightningPayment` struct
to allow the caller to specify an arbitrary set of TLV records which can
be used for a number of use-cases including various variants of
spontaneous payments.
In this commit, we extend the Hop struct to carry an arbitrary set of
TLV values, and add a new field that allows us to distinguish between
the modern and legacy TLV payload.
We add a new `PackPayload` method that will be used to encode the
combined required routing TLV fields along any set of TLV fields that
were specified as part of path finding.
Finally, the `ToSphinxPath` has been extended to be able to recognize if
a hop needs the modern, or legacy payload.
This commit overhauls the interpretation of failed payments. It changes
the interpretation rules so that we always apply the strongest possible
set of penalties, without making assumptions that would hurt good nodes.
Main changes are:
- Apply different rule sets for intermediate and final nodes. Both types
of nodes have different sets of failures that we expect. Penalize nodes
that send unexpected failure messages.
- Distinguish between direct payments and multi-hop payments. For direct
payments, we can infer more about the performance of our peer because we
trust ourselves.
- In many cases it is impossible for the sender to determine which of
the two nodes in a pair is responsible for the failure. In this
situation, we now penalize bidirectionally. This does not hurt the good
node of the pair, because only its connection to a bad node is
penalized.
- Previously we always penalized the outgoing connection of the
reporting node. This is incorrect for policy related failures. For
policy related failures, it could also be that the reporting node
received a wrongly crafted htlc from its predecessor. By penalizing the
incoming channel, we surely hit the responsible node.
- FailExpiryTooSoon is a failure that could have been caused by any node
up to the reporting node by delaying forwarding of the htlc. We don't
know which node is responsible, therefore we now penalize all node pairs
in the route.
When an undecryptable failure comes back for a payment attempt, we
previously only penalized our own outgoing connection. However,
any node could have caused this failure. It is therefore better to
penalize all node connections along the route. Then at least we know for
sure that we will hit the responsible node.
This commit updates existing tests to not rely on mission control for
pruning of local channels. Information about local channels should
already be up to date before path finding starts. If not, the problem
should be fixed where bandwidth hints are set up.
This commit moves the payment outcome interpretation logic into a
separate file. Also, mission control isn't updated directly anymore, but
results are stored in an interpretedResult struct. This allows the
mission control state to be locked for a minimum amount of time and
makes it easier to unit test the result interpretation.
This commit converts several functions from returning a bool and a
failure reason to a nillable failure reason as return parameter. This
will take away confusion about the interpretation of the two separate
values.
Previously mission control tracked failures on a per node, per channel basis.
This commit changes this to tracking on the level of directed node pairs. The goal
of moving to this coarser-grained level is to reduce the number of required
payment attempts without compromising payment reliability.
Align naming better with the lightning spec. Not the full name of the
failure (FailIncorrectOrUnknownPaymentDetails) is used, because this
would cause too many long lines in the code.
This commit adds the BlockPadding value (currently 3) to sendpayment
calls so that if some blocks are mined while the htlc is in-flight, the
exit hop won't reject it.
The current approach iterates all channels in the graph in order to
filter those in need. This approach is time consuming, several seconds
on my mobile device for ~40,000 channels, while during this time the
db is locked in a transaction.
The proposed change is to use an existing functionality that utilize the
fact that channel update are saved indexed by date. This method enables
us to go over only a small subset of the channels, only those that
were updated before the "channel expiry" time and further filter
them for our need.
The same graph that took several seconds to prune was pruned, after
the change, in several milliseconds.
In addition for testing purposes I added Initiator field to the
testChannel structure to reflect the channeldEdgePolicy direction.
If nodes return a channel policy related failure, they may get a second
chance. Our graph may not be up to date. Previously this logic was
contained in the payment session.
This commit moves that into global mission control and thereby removes
the last mission control state that was kept on the payment level.
Because mission control is not aware of the relation between payment
attempts and payments, the second chance logic is no longer based
tracking second chances given per payment.
Instead a time based approach is used. If a node reports a policy
failure that prevents forwarding to its peer, it will get a second
chance. But it will get it only if the previous second chance was
long enough ago.
Also those second chances are no longer dependent on whether an
associated channel update is valid. It will get the second chance
regardless, to prevent creating a dependency between mission control and
the graph. This would interfer with (future) replay of history, because
the graph may not be the same anymore at that point.