Move our more generic terminal check forward so that we only
need to handle a single class of expected errors. This change
is mirrored in our mock, and our reproducing tests are updated
to assert that this move catches both classes of errors we get.
Add an additional stuck-payment case, where our payment gets
a terminal error while it has other htlcs in-flight, and a
shard fails with ErrTerminalPayment. This payment also falls in
our class of expected errors, but is not currently handled. The
mock is updated accordingly, using the same ordering as in our
real RegisterAttempt implementation.
This commit adds a test which demonstrates that payments can
get stuck if we receive a payment failure while we're pathfinding
for another shard, then try to dispatch a shard after we've
recorded a permanent failure. It also updates our mock to
only consider payments with no in-flight htlcs as in-flight,
to more closely represent our actual RegisterAttempt.
This commit adds a step to our payment lifecycle test to add
control over when we find a path for our payment, This is
required for testing race conditions around pathfinding
completing and payment failures being reported.
This commit updates our mock to more closely follow the behavior of the
switch for mocked calls to GetPaymentResult. As it stands, our tests
send a test-created error from the switch when we want to mock shutdown.
In reality, the switch will close its result channel, so we update this
test to follow that behavior. This matters for the commit that follows,
because we start checking the error our payments return. If we have an
error from the switch, our tests will fail with an error that we do
not encounter in practice.
In our payment lifecycle tests, we have two goroutines that
compete for the lock in our mock control tower: the resumePayment
loop which tries to call RegisterAttempt, and the collectResult
handler which is launched in a goroutine by collectResultAsync
and is responsible for various settle/fail calls.
The order that the lock is acquired by these goroutines is
arbitrary, and can lead to flakes in our tests if the step
that we do not intend to execute first gets the lock (eg,
we want to fail a payment in collectResult, but RegisterAttempt
gets there first). This commit moves contention for this lock
after our mock's various "state driving" channels, so that the
lock will be acquired in the order that the test intends it.
This commit fixes the inconsistency between the payment state as
reported by routerrpc.SendPayment/routerrpc.TrackPayment and the main
rpc ListPayments call.
In addition to that, payment state changes are now sent out for every
state change. This opens the door to user interfaces giving more
feedback to the user about the payment process. This is especially
interesting for multi-part payments.
We whitelist a set of "expected" errors that can be returned from
RequestRoute, by converting them into a new type noRouteError. For any
other error returned by RequestRoute, we'll now exit immediately.
This commit redefines how the control tower handles shard and payment
level settles and failures. We now consider the payment in flight as
long it has active shards, or it has no active shards but has not
reached a terminal condition (settle of one of the shards, or a payment
level failure has been encountered).
We also make it possible to settle/fail shards regardless of the payment
level status (since we must allow late shards recording their status
even though we have already settled/failed the payment).
Finally, we make it possible to Fail the payment when it is already
failed. This is to allow multiple concurrent shards that reach terminal
errors to mark the payment failed, without havinng to synchronize.
This method is used to fetch a payment and all HTLC attempt that have
been made for that payment. It will both be used to resume inflight
attempts, and to fetch the final outcome of previous attempts.
We also update the the mock control tower to mimic the real control
tower, by letting it track multiple HTLC attempts for a given payment
hash, laying the groundwork for later enabling it for MPP.
active shards
In preparation for doing pathfinding for routes sending a value less
than the total payment amount, we let the payment session take the max
amount to send and the fee limit as arguments to RequestRoute.
This commit moves supplying of the information in the LightningPayment
to the initialization of the paymentSession, away from every call to
RequestRoute.
Instead the paymentSession will store this information internally, as it
doesn't change between payment attempts.
This is done to rid the RequestRoute call of the LightingPayment
argument, as for SendToRoute calls, it is not needed to supply the next
route.
This commit converts the database structure of a payment so that it can
not just store the last htlc attempt, but all attempts that have been
made. This is a preparation for mpp sending.
In addition to that, we now also persist the fail time of an htlc. In a
later commit, the full failure reason will be added as well.
A key change is made to the control tower interface. Previously the
control tower wasn't aware of individual htlc outcomes. The payment
remained in-flight with the latest attempt recorded, but an outcome was
only set when the payment finished. With this commit, the outcome of
every htlc is expected by the control tower and recorded in the
database.
Co-authored-by: Johan T. Halseth <johanth@gmail.com>
To better distinguish payments from HTLCs, we rename the attempt info
struct to HTLCAttemptInfo. We also embed it into the HTLCAttempt struct,
to avoid having to duplicate this information.
The paymentID term is renamed to attemptID.
Update the type check used for checking local payment
failures to check on the ClearTextError interface rather
than on the ForwardingError type. This change prepares
for splitting payment errors up into Link and Forwarding
errors.
This commit modifies paymentLifecycle so that it not only feeds
failures into mission control, but successes as well.
This allows for more accurate probability estimates. Previously,
the success probability for a successful pair and a pair with
no history was equal. There was no force that pushed towards
previously successful routes.
This commit converts several functions from returning a bool and a
failure reason to a nillable failure reason as return parameter. This
will take away confusion about the interpretation of the two separate
values.
Previously mission control tracked failures on a per node, per channel basis.
This commit changes this to tracking on the level of directed node pairs. The goal
of moving to this coarser-grained level is to reduce the number of required
payment attempts without compromising payment reliability.
If nodes return a channel policy related failure, they may get a second
chance. Our graph may not be up to date. Previously this logic was
contained in the payment session.
This commit moves that into global mission control and thereby removes
the last mission control state that was kept on the payment level.
Because mission control is not aware of the relation between payment
attempts and payments, the second chance logic is no longer based
tracking second chances given per payment.
Instead a time based approach is used. If a node reports a policy
failure that prevents forwarding to its peer, it will get a second
chance. But it will get it only if the previous second chance was
long enough ago.
Also those second chances are no longer dependent on whether an
associated channel update is valid. It will get the second chance
regardless, to prevent creating a dependency between mission control and
the graph. This would interfer with (future) replay of history, because
the graph may not be the same anymore at that point.
Previously every payment had its own local mission control state which
was in effect only for that payment. In this commit most of the local
state is removed and payments all tap into the global mission control
probability estimator.
Furthermore the decay time of pruned edges and nodes is extended, so
that observations about the network can better benefit future payment
processes.
Last, the probability function is transformed from a binary output to a
gradual curve, allowing for a better trade off between candidate routes.
TestRouterPaymentStateMachine tests that the router interacts as
expected with the ControlTower during a payment lifecycle, such that it
payment attempts are not sent twice to the switch, and results are
handled after a restart.
In this commit we move handing the deobfuscator from the router to the
switch from when the payment is initiated, to when the result is
queried.
We do this because only the router can recreate the deobfuscator after a
restart, and we are preparing for being able to handle results across
restarts.
Since the deobfuscator cannot be nil anymore, we can also get rid of
that special case.
This lets us distinguish an critical error from a actual payment result
(success or failure). This is important since we know that we can only
attempt another payment when a final result from the previous payment
attempt is received.
This commit moves the responsibility of generating a unique payment ID
from the switch to the router. This will make it easier for the router
to keep track of which HTLCs were successfully forwarded onto the
network, as it can query the switch for existing HTLCs as long as the
paymentIDs are kept.
The router is expected to maintain a map from paymentHash->paymentID,
such that they can be replayed on restart. This also lets the router
check the status of a sent payment after a restart, by querying the
switch for the paymentID in question.