Playing around with replication
— 12 minAbout two months ago, I was joking with colleagues that I wanted to build service X, which should be a fault tolerant, highly available service. But I didn’t want to build the replication needed for it.
Because it is hard to do, and there is a ton of anecdotes on the internet that prove as much, and that it is really hard to create safe and correct replication.
But as luck would have it, maybe I will implement something myself anyway. At the very least as a learning exercise, as I do believe that implementing something yourself is the best way to learn how it works.
So I have been reading up on State Machine Replication (SMR) for the last couple of weeks, and experimenting with implementing one, or even more of these protocols. And oh boy, I’m very disappointed with the state of things. Here I want to summarize what I have learned, and rant a little bit about what I didn’t like. This is not a guide on how to implement a replication protocol yourself, and unfortunately, I can’t offer a readable, understandable, and correct project either.
# What is SMR?
State Machine Replication, or Replicated State Machines (RSM), are well, state machines. Which means they hold some internal state, and can deterministically change that state in reaction to outside operations/commands.
One simple example we could think of is just a number register, or accumulator. The internal state is just a number,
and the operations could be add(N)
, subtract(N)
, clear
, multiply(N)
, etc.
Given an initial state (0
), you can apply operations one after another, and you will always end up with the same state.
You can already see in this example, that these operations are order-dependant.
They have to be executed in a specific order. Only then can you guarantee a deterministic outcome.
This order is often established by a leader, which appends operations to a log, and replicates that log to other replicas. Once enough replicas have received those operations, we can consider them committed and actually apply those operations to the underlying state machine.
There is also leaderless SMR, if you actually have an underlying state machine that is either fully, or partially order-independent.
Most of the protocols agree on the following assumptions:
- We assume an unreliable network, meaning messages between replicas can be delayed, dropped, delivered out of order, or delivered multiple times.
- We assume there is no malicious replicas that intentionally send wrong messages (which is called byzantine faults).
- We want our system to handle and continue to function correctly with up to
f
failures (we will discuss what a failure is).
There is three big schools of thought: Paxos, Raft and Viewstamped Replication.
Paxos is always said to be hard to understand, and it is also very bare-bones. The original paxos does not even talk about replicating a log, but only a single value. Raft aims specifically to be easier to understand and teachable, though I might disagree a bit. Viewstamped Replication (VR) is the third in the group, and it has some interesting properties.
Of all the three, paxos is probably the most studied one in academia, as you will find a plethora of X-Paxos papers, extending it in some form or another. E(galitarian)-Paxos is one example, which is a leaderless protocol, replicating mostly order-independent operations. There is a ton more. Though surprisingly not for the other two major protocols.
There is also some fundamental disagreements between these schools of thought. For example, raft emphasizes that a replicated log entry absolutely has to be written to stable storage, whereas viewstamped replication is supposed to run completely in memory.
The question here is really what do you consider a valid failure, and how do you recover from it? One reason given why Raft needs stable storage, is that a replica has to remember what it did prior to a crash. VR has also been proven incorrect in case a replica rejoins with non-uptodate state.
If a Raft node fails and has partial memory loss, it cannot safely rejoin, but you would rather have to replace it with a completely new replica, and remove the failed one completely. There is also ways to fix VR in that regard, so a replica can also safely rejoin after partial memory loss.
Another reason given why Raft needs stable storage is durability. Once the cluster replies that an operation is committed,
it should not be lost, no matter what. Where the no matter what in Raft includes all the replicas going down at once.
Which is not really what the fault tolerant up to f
failures really means, now, does it?
If we turn this around and are talking about an always available system, tolerating up to f
failures can also be
rephrased as having N-f
replicas, well, always available.
In which case VR being advertised as working purely in-memory works out just fine.
# Batteries included?
We have talked a bit about recovering from a fault already. But that is something that I believe most of the paxos related literature (and I must admit, I have only read EPaxos) does not talk about at all.
The whole topic of SMR covers a bunch of sub-topics, which some of the papers also call sub-protocols, they deal with:
- Replication itself, making sure operations are appended to the log, replicated across the replicas.
- Leader-election, which covers agreeing on who the leader is that maintains the order of said log.
- The mentioned recovery, or what to do when a replica has missed some log updates, or has failed completely.
- Reconfiguration, which deals with changes to the cluster, like adding or removing replicas.
Here, Raft and VR are batteries-included, meaning they cover all those topics, even though some just as an after-thought.
Leaderless protocols obviously skip leader election, as that becomes irrelevant there. But most of those papers also completely gloss over recovery and reconfiguration.
VR is interesting in the sense that it splits recovery up further into what it calls state transfer. I don’t really understand the reasoning behind that. Whether a replica has to catch up on just a small tail of the log it has missed because of a restart, or the complete log is has been lost because of a disk failure, both are pretty much the same thing, conceptually.
Though doing state transfer or recovery efficiently is quite challenging. Raft tries to answer this by introducing the concept of snapshots, and transferring those snapshots. VR just mentions this topic superficially, but does not give an actual implementation.
I think this is a very important piece of the puzzle for a real life system, one that cannot be neglected. Some parts of the VR protocol transfer the whole log within a message. Which is just not possible in reality. Also, a monotonically growing log is also often infeasible, and a proper way of how to compact, or prune the log should be baked into these protocols, and not left as an afterthought.
A bit related to all of this is the topic of reconfiguration, or how to add new replicas to the cluster. Those replicas
need to be brought up to speed by conceptually transmitting the whole log, or a mix of snapshot and log tail, if we
want to focus more on a real world scenario.
These snapshots are also very much application dependant. In our starting example above, its trivially just a number.
But real world systems have a lot more they have to take care of. And snapshots can be a lot more complex than a Vec<u8>
.
And depending on the application, they can be gigabytes or even terabytes big, and they can definitely not be
transmitted in a single message.
# State machines all the way down
Speaking of which, one of the core assumptions there protocols are built around is the idea of an unreliable network, which can drop, delay, reorder, or re-deliver messages. But none of them are talking about fragmentation of messages. In my last blog post, I briefly mentioned the MTU (maximum transfer unit), which is the largest amount of bytes which can be transmitted as a single network packet without being fragmented into multiple, where each one of those fragments could individually be dropped, delayer, delivered multiple times, or out of order.
Transmitting the complete log like in VR, or a snapshot like in Raft in a single message is just not going to fly.
Raft is also explained in terms of RPCs (remote procedure calls), which kind of implies that they are a bidirectional request/response. This is also a deviation from the fundamental assumptions these group of papers make about the network.
The style in which most of these papers are written I think is also way too high-level. The pseudo-code given in some
papers actually assumes a request-response style, like broadcast message M, wait for (f+1) replies
.
A somewhat related question would be, what kind of networking layer one wants to build this on? TCP handles reliable transmission for us already, so we probably don’t need to handle some of the finer details. While UDP does not deal with packet fragmentation well. Not sure.
I think what I would like to see here is that we should rather embrace:
Yo dawg, I heard you like state machines. So I built a state machine that can replicate your state machine.
Or phrased differently, I would love these papers to rather think about, and more importantly explain the algorithms/protocols in terms of a state machine. Given the current state and a side effect (either an incoming message, or a timeout), how will the state change, and what side effects (outgoing messages, registering timeouts) will it cause?
I believe that explaining these protocols as state machines themselves would make it much easier to understand, implement, and also verify them.
Some papers only have prose, others have some pseudo-code which is not readily translatable to a real programming language. Yet more papers have TLA+ definitions which is a popular model checker. Which is also a completely different language, and cannot be translated directly into real code.
It is nice that a protocol is formally verified with something like TLA+, but it is also completely meaningless if what is being verified is something completely different from what is actually running in the real world.
Another benefit of modelling this as a state machine itself would be that it is deterministic itself, and lends itself to a more generic, sans-IO implementation. If we could define these things as:
fn tick(State, SideEffects) -> (State, SideEffects);
We could abstract all the networking, timeouts, etc, as incoming and outgoing side-effects. A lot of these papers gloss over some very fine details, which make actually implementing one of the protocols a real pain.
# Where to go from here?
As I said, I have been reading quite a bit about this topic, and I tried, multiple times even, to implement such a protocol myself. I tried implementing EPaxos, but hit a wall because I didn’t want to deal with conflict resolution, as well as the fact that EPaxos is only about replication, and not about recovery or reconfiguration at all.
I then shifted by focus to VR. The main selling point for me is the idea that it should supposedly work only in-memory, without having to commit everything to stable storage. I also think that the leader election part in VR is a bit simpler, as there is no election at all. VR is more like a hereditary monarchy than a democracy. You don’t need to elect a new ruler, as the next succession is predetermined. All replicas just have to agree.
I think this might make graceful shutdowns easier as well, as the current leader could just pass on its role to the next in line during its graceful shutdown sequence. I was also thinking about sharding, in which case it should be easier to rebalance by intentionally redistributing the leader-role and thus the load across the replicas. But hey, those are just ideas and I haven’t actually implemented something that is good enough to share with the world.
Most recently, I have also looked into using a ready-made Raft crate, openraft
in my case. Though this crate also
makes some interesting design choices. Some may be related to Raft itself, like how reconfiguration (member change) is
being handled, or how the state and log has to be persisted. The crate tries to bake in the concept of snapshots and
log compaction/truncation, but it does so in a way that I do not fully understand.
And it is also based around the assumption that a snapshot is a Vec<u8>
, with a feature flag that can relax that
assumption a little, though I also don’t quite understand how those things are supposed to work.
So here I am. I have read a lot about this topic, but I am also quite unsatisfied with the state of how things are looking. The papers I read are really hard to understand. They gloss over a lot of details, pseudo-code is either missing completely, it is incomplete or at best written in a style that does not fit a state machine well.
But hey, I’m still interested in exploring this a bit further. So maybe at some point I might end up with a sans-IO state machine implementing state machine replication. Or maybe I will just lose interest and give up.
So long…