Swatinem Blog Resume

Finding a usable sans-io pattern

— 5 min

In my last blogpost about State Machine Replication, I discovered the interesting property that you can think of a state machine replication (SMR) protocol as a state machine itself. One that receives messages from the network, and sends messages to other replicas, along with handling some timeout logic.

I mentioned that this lends itself very well to the sans-io pattern, so I want to explore and elaborate that idea a bit further. I have started tinkering in a not-yet-public repo with these ideas, and I already implemented the sans-io bits in at least two different ways. So lets talk about it.


On the topic of sans-io, I would highly recommend this blog post. It gives a deep dive, along with a simple implementation of a sans-io protocol implementation.

The essence of the blog post is a sans-io implementation with essentially the following functions that abstract over input/output and time, and a struct representing a message:

impl SansIo {
    fn handle_input(&mut self, packet: &[u8], now: Instant);
    fn handle_timeout(&mut self, now: Instant);
    fn poll_transmit(&mut self) -> Option<Transmit>;
    fn poll_timeout(&self) -> Option<Instant>;
}

pub struct Transmit {
    dst: SocketAddr,
    payload: Vec<u8>
}

Now, I think there is at least three things with that particular pattern that I would do differently, or improve if you will.

First, having four different functions, for handling, and emitting both messages and timeouts is good from an illustration perspective, the duplication of the now: Instant parameter on the handle_input and handle_timeout methods kind of gives you a hint that maybe its a good idea to just have one method there instead. In fact, the handle_XXX methods do not return anything, and thats why you have the poll_XXX methods.

Second, I think that a Rust Instant is very opaque, and hard to work with in particular in tests. Instead of using Instance, I would rather work with Duration, as you can more easily construct one, so it is better to use for testing in my opinion.

Last but not least, the Transmit type has a SocketAddr, which is too specific, and a Vec<u8> as payload, which is too opaque. Mind you, those might be the right choice for implementing a STUN protocol which that blogpost is all about. Though for my usecase, I would rather change this to an opaque sender/recipient Id, but a fully concrete message type on the other hand.

These changes would give the driver / event loop the freedom to chose whichever transport mechanism, and message serialization format it wants. Along with just not going through a network at all, but staying within a single process for testing.


With all that, I would suggest the following interface for my sans-io implementation:

impl Replica<M> {
    pub fn tick(&mut self, messages: Vec<Message<M>>, elapsed: Duration)
        -> (Vec<Message<M>>, Duration);
}

pub struct Message<M> {
    remote: RemoteId,
    message: M
}

This does put a bit more responsibility on the driver. It does have to manage a mapping between an opaque RemoteId (which itself could just be a usize). Also for my example, I’m using the same type for the input and output messages. As my goal is to implement replication, the output of one replica is the input to another, so thats a good fit. Though there is a bit of more complexity involves when it comes to client communication.

An event-loop could roughly look like this (just pseudo-code):

let mut replica = todo!();
let mut messages = vec![];
let mut last_tick = Instant::now();
loop {
    let elapsed = last_tick.elapsed();
    last_tick = Instant::now();

    let mut timeout;
    (messages, timeout) = replica.tick(messages, elapsed);

    for Message { remote, message } in messages.drain(..) {
        network.send(remote, message);
    }

    while let Some(message) = network.try_recv() {
        messages.push(message);
    }
    if let Some(message) = network.recv_deadline(last_tick + timeout) {
        messages.push(message)
    }
}

This is just a made-up toy example, but it also highlights a couple of optimizations I’m after. First, it is reusing the list of messages for send/receive, and second, it is using try_recv/recv_deadline to process as many messages in one go as directly available within a receive buffer.

# Return-value or out-parameter

As I mentioned earlier, I already went through a couple iterations of ideas. One idea has been inspired by the stateright crate, which is a model checker that is well suited for state machines, and networked/replication protocols in particular.

Stateright defines an Actor trait, and an Out struct, which look a little bit like this (heavily simplified):

trait Actor {
    fn on_msg(&mut self, src: Id, msg: Message, out: &mut Out);
    fn on_timeout(&mut self, timer: Timer,  out: &mut Out);
}

impl Out {
    fn set_timer(&mut self, timer: Timer, duration: Duration);
    fn send(&mut self, recipient: Id, msg: Message);
}

We can see a very simple a very similar sans-io pattern here. We have two methods to handle messages and timeouts respectively, but instead of returning messages and new timeouts to be set, it is using an out-parameter approach.

This is also an interesting approach, as it would free up the implementation to not having to worry about buffering messages, and dealing with the associated allocations.

However, after trying the out-parameter approach myself, it became quite clear that it is a pain to actually work with at scale, as that out-parameter is quite infectious, and you have to thread it through all the functions, which becomes very annoying very soon. In particular if it is not a concrete type, but rather a trait impl.


In the end, I think the approach I chose with just taking and returning a list of messages should be flexible enough to also adapt to stateright, so I don’t think there is anything fundamentally wrong with it.

It is dealing with concrete types, which are easier to handle than generics.

One thing that I haven’t yet thought about is abstracting randomness though. Stateright does offer methods for that though.