In Search of an Understandable Practical Byzantine Fault Tolerant Raft

Byzantine General’s Problem, again

My illegitimate father, Leslie Lamport, along with Shostak and Pease, wrote about the Byzantine General’s Problem. The Byzantine’s General problem takes a distributed systems problem and places it in the Ottoman Empire, to connect with the history buff computer nerd crowd. How does one coordinate a multi-militia attack without falling to bad actors changing their transmissions, or messages not being received at all?

Relationship experts, The Gottman Institute, would champion that it is all about communication and establishing trust. With effective communication we build resiliency in our system, or our militia. We must also trust that the messages we receive are from who they say they are. In the Byzantine General’s Problem each Lieutenant communicates with another to prove the correctness of a message they receive. A commander can send a message to attack the following day at dawn, but this attack will only occur if the lieutenants can come to a consensus that this is the plan.

This introduces a class of failures in computing called “Byzantine Failures”, where something can wobble between the status of functioning and non-functioning at the same time. The Gottmans would just call this marriage.

Crash Fault Tolerance versus Byzantine Fault Tolerance

Many systems are designed around consensus algorithms such as Raft, or for the cool complicated C kids, Paxos. These are crash fault tolerant, which means it can withstand system failures such as a node coming offline. Crash Fault Tolerance is the number of nodes you have (N), divided by 2, is less than followers (f) (f < N/2).

However, when we fire Greg from accounting for embezzlement, he may become a malicious actor within our production system. Crash fault tolerance helps us for predictable system failures, but Byzantine Fault Tolerance withstands bad actors in our system. Byzantine Fault Tolerance is the number of nodes you have (N), divided by 3, less than followers (f). (f < N/3)

Byzantine Fault Tolerance is what most of us should want to achieve to avoid being paged - as it will withstand more human error (misconfiguration) as well as bad humans. However, it is not commonly adopted except for outside of “The Blockchain" due to cost and complexity.

Greg from accounting, probably

A Byzantine Fault Tolerant Raft

There are many papers on Byzantine Fault Tolerant Raft and Paxos, but today I chose this one. This starts off similarly to our ordinary Raft, with three stages - pre-append, append, commit.

Pre-append and append order server commits, where append and rollup ensure that servers and clients communications have been properly stored on both sides.

A client will send a transaction to a server. These transactions will contain a body, a timestamp (which is an integer the client increments with each request), a client identifier, and a signature. That transaction needs to be replicated throughout the cluster in a way that ensures there are no bad actors in the cluster, but also in a way that is resilient and timely. If a client sends a transaction and does not hear a response within a timeframe, then it will broadcast the transaction to all nodes in the cluster - it will then be on those nodes to forward the transaction to who they believe the current leader is.

When the leader receives this client transaction it will add it to a queue to be validated, which is to check that the client signature is valid and it matches the client number, the client timestamp is greater than previous client timestamps, there are no conflicts between transactions, and that the transaction is well formed.

Pre-Append

The leader will eventually combine many transactions into a single pre-append entry, which should reduce the amount of data being transmitted and have a slight performance increase. Due to how the pre-append stage is created, the data order is already decided, so only the new log entry number for this new entry needs to be submitted.

The leader requires 2f+1 pre-append signatures - where f is followers. In the case of a cluster with one client, one leader, and three followers - this would be 2 * 3 + 1. When the leader receives 7 pre-append signatures we move to the append stage

When the follower receives the pre-append message it validates the transactions in the same way the leader does in the leader’s pre-append phase. Then, it validates the hashes of the transactions, any follower signatures the leader attached as proof, and that the log entry is one higher than the latest committed entry. If all this goes without fail, it will append the entry to its log.

After it has done so, it will respond to the client for the previous entry (the one committed, not appended). The leader also gets a message with the node number, a signature, and a hash of the current append entry along with the last entry.

Commit

In order to fully commit, the leader again has to collect 2f+1 valid signatures. Once it does so, it will store the entry in its stable log. Then, the client will receive another message, this time from the leader, with a hash of the client request, its node number, and a signature.

Append

This is where, due to needing trust, we turn into everyone shaking each other's hands vigorously. The leader will send all the signatures it received in the previous stage as part as an AppendEntries message to backups. The backups also need to validate this entry to apply it to their stable log, and when they have done so, they send a message to the client. When the client has received f + 1 identical messages, the commit is successful.

Opinions from a three time college dropout

Are you proud of me, dad?

I realize I am from THE privileged class of being an executive and therefore could naturally speak at Stanford without having to go to college. However, after reading this, I am left with some questions.

The paper talks of three phases, one of them being the “rollup” phase, which I may be conveniently missing or it is never expanded upon. I will assume this means something similar to rollout, as in replicate to followers and backup nodes. (See 5.3 Log Replication of the Original Raft Paper)

It states the way it reduces data is by creating a single entry for multiple transactions which doesn’t need previous entry numbers from a leader, but I’m not believing that removing an integer field will be that impactful. The paper shys away from discussing queueing to keep it understandable - but as a reader the part that needs to be expanded on to be the most understandable is the coordination of signatures and trust.

The key item in Raft-BFT is that everything is signed. Not only do we produce and need to trust that the message we receive is valid, for every step we need 2f+1 signatures and validation that the step is trusted. For the signature piece upon every message, this would drastically increase network traffic. Also, how do we trust that the signatures are valid? There needs to either be a management plane to distribute for keys as a trusted third party (a Vault, for example) and act as a source of truth, or one server is trusted on startup to distribute the keys to the others.

If you pair Raft with a Gossip Protocol then you may be able to achieve BFT - just as long as the Gossip Protocol leverages some way to establish shared trust.

It does draw us back to reality though, where security start-ups are once again pushing “zero trust”. They’re happy to take your money for hardened whatevers without so much as mentioning that investing in BFT may cost the same as their subscriptions, but provide more resiliency.

Resources:

Byzantine fault tolerance (BFT) and Crash fault tolerance (CFT) - Stack Overflow

A Byzantine Fault Tolerant Raft - John Clow Stanford University

The Byzantine Generals Problem

Recent progress on BFT in the era of blockchains - PMC

Raft does not Guarantee Liveness in the face of Network Faults

XFT: Practical Fault Tolerance beyond Crashes | USENIX

In Search of an Understandable Practical Byzantine Fault Tolerant Raft

Byzantine General’s Problem, again

Crash Fault Tolerance versus Byzantine Fault Tolerance

A Byzantine Fault Tolerant Raft

Opinions from a three time college dropout

Keep Reading

chief girl officer press