diff --git a/docs/leader_election.md b/docs/leader_election.md new file mode 100644 index 00000000..2ca77e15 --- /dev/null +++ b/docs/leader_election.md @@ -0,0 +1,112 @@ +# Increasing LND reliablity by clustering + +Normally LND nodes use the embedded bbolt database to store all important states. +This method of running has been proven to work well in a variety of environments, +from mobile clients to large nodes serving hundreds of channels. With scale however +it is desirable to be able to replicate LND's state to quickly and reliably move nodes, +do updates and be more resilient to datacenter failures. + +It is now possible to store all essential state in a replicated etcd DB and to +run multiple LND nodes on different machines where only one of them (the leader) +is able to read and mutate the database. In such setup if the leader node fails +or decomissioned, a follower node will be elected as the new leader and will +quickly come online to minimize downtime. + +The leader election feature currently relies on etcd to work both for the election +itself and for the replicated data store. + +## Building LND with leader election support + +To create a dev build of LND with leader election support use the following command: + +```shell +⛰ make tags="kvdb_etcd" +``` + +## Running a local etcd instance for testing + +To start your local etcd instance for testing run: + +```shell +⛰ ./etcd \ + --auto-tls \ + --advertise-client-urls=https://127.0.0.1:2379 \ + --listen-client-urls=https://0.0.0.0:2379 \ + --max-txn-ops=16384 \ + --max-request-bytes=104857600 +``` + +The large `max-txn-ops` and `max-request-bytes` values are currently recommended +but may not be required in the future. + +## Configuring LND to run on etcd and participate in leader election + +To run LND with etcd, additional configuration is needed, specified either +through command line flags or in `lnd.conf`. + +Sample command line: + +```shell +⛰ ./lnd-debug \ + --db.backend=etcd \ + --db.etcd.host=127.0.0.1:2379 \ + --db.etcd.certfile=/home/user/etcd/bin/default.etcd/fixtures/client/cert.pem \ + --db.etcd.keyfile=/home/user/etcd/bin/default.etcd/fixtures/client/key.pem \ + --db.etcd.insecure_skip_verify \ + --cluster.enable-leader-election \ + --cluster.leader-elector=etcd \ + --cluster.etcd-election-prefix=cluster-leader \ + --cluster.id=lnd-1 +``` +The `cluster.etcd-election-prefix` option sets the election's etcd key prefix. +The `cluster.id` is used to identify the individual nodes in the cluster +and should be set to a different value for each node. + +Optionally users can specify `db.etcd.user` and `db.etcd.pass` for db user +authentication. If the database is shared, it is possible to separate our data +from other users by setting `db.etcd.namespace` to an (already existing) etcd +namespace. In order to test without TLS, we can set `db.etcd.disabletls` +flag to `true`. + +Once the node is up and running we can start more nodes with the same command line. + +## Identifying the leader node + +The above setup is useful for testing but is not viable when running in a production +environment. For users relying on containers and orchestration services, it is +essential to know which node is the leader to be able to automatically route +network traffic to the right instance. For example in Kubernetes, the load balancer +will route traffic to all "ready" nodes. This readiness may be monitored by a +readiness probe. + +For readiness probing we can simply use LND's state RPC service where a special state +`WAITING_TO_START` indicates that the node is waiting to become the leader and is +not started yet. To test this we can simply curl the REST endpoint of the state RPC: + +``` +readinessProbe: + exec: + command: [ + "/bin/sh", + "-c", + "set -e; set -o pipefail; curl -s -k -o - https://localhost:8080/v1/state | jq .'State' | grep -E 'NON_EXISTING|LOCKED|UNLOCKED|RPC_ACTIVE'", + ] + periodSeconds: 1 +``` + +## Replication of non-critical data + +All critical data is written to the replicated database, including LND's wallet +data which contains the key material and node identity. Some less critical data +however is currently not written to that same database for performance reasons +and is instead still kept in local `bbolt` files. + +For example the graph data is kept locally to improve path finding. Other examples +are the macaroon database or watchtower client database. To make sure a node can +become active and take over quickly if the leader fails, it is therefore still +recommended to have the LND data directory on a shared volume that all active and +passive nodes can access. Otherwise the node that is taking over might first need +to sync its graph. + +As we evolve our cluster support we'll provide more solutions to make replication +and clustering even more seamless.