The next step for my humble key value store is to be able to scale reads, the two typical solutions for that are sharding and replication, or even, in some cases, both of them.
In my case I’m going to start adding replication because I think this is going to lead me to learn more things about Distributed Systems. Supporting Replication means that I have to understand at least these things:
- Types of Replication: Master Slave (leader based replication), MultiMaster Replication, Quorum-Based Replication…..
- Failover. What happens when a node crashes? What happens if your replication is Master/Slave and the master goes down?
- How do you solve conflicts? is It possible to avoid conflicts?
- Strong consistency, Eventual consistency, Weak consistency.
For this first approach I’m going to implement Leader based replication, this is probably the most simple case of replication and once I’ve mastered it I’ll move forward to a more difficult model.
One of the advantages of this replication is that I don’t have to worry about conflicts as the Master is the only one accepting writes. This makes the implementation much much easier. When you have to control different versions of the data you have to start worrying about the order of the events and then, synchronization becomes part of your problem. After reading about that you realize that this is quite tricky and solutions like NTP are not going to solve the problem for you, which leads you to vector clocks or version vectors, which seems easy but they aren’t.
When you’re implementing replication one of the first decision you have to make is whether the replication is going to be asynchronous or synchronous.
- If it’s synchronous the database can’t accept any more writes until all the replicas have confirmed the operation. This means that one simple node can stop the database from accepting writes. In exchange you have strong consistency.
This model is also affects the performance of the database. The more replicas you have the more time it takes to complete the write operation and the more likely is to have a problem in a node.
- With asynchronous replication the performance is not affected and the database continue accepting writes even if several replicas go down. The tradeoff is that the replicas can have stale data. t some point this data will be updated but you don’t have any guarantee of when that will happen. This is know as Eventual Consistency.
This model has a big drawback, if the leader crashes and the replica take over, you can have data lost.
There is a mixed model in which the database has synchronous replication with one node and asynchronous replication with the rest. In case the leader goes down the asynchronous replica can take over.
In my case I decided to start implementing asynchronous replication.
Ok, so how do I implement replication?
One option would be statement-based replication where each update is sent to the replicas, then each replica applies the change and that’s it. However this presents some flaws:
- What happen when a node couldn’t apply the changes due to a problem? e.g a connectivity problem. In this case if the node just continues receiving the last change It would have inconsistent data, it wouldn’t have the changes that were sent while it was down.
- If we just send the last change we couldn’t start new nodes on demand. As they will only have the data from the moment in which they joined the cluster.
- In case of failover It would be really difficult to decide which node has to take over as they would have inconsistent data.
To solve these problems each node should receive the all the changes after the last one they have confirmed. With all this in mind it is clear that we need to store the changes and we need to know the time order of the changes.
In addition to that as of today if the leader goes down there is no way to recover the data because everything is in memory. So I would also need the stored changes to recover a node from a crash. In fact I would need this even if the database weren’t distributed.
The common way of doing this is using a write ahead log where every operation is stored in the log and then applied to the database. The file is append-only so operations are much faster.
So I’m going to kill two birds with one stone and implement something similar to BitCask for Riak, this will be used both for replication and crash recovery.
- We store every key and value in an append-only file, so that operations are much much faster and simple.
- The in-memory hashmap works as an Index for byte offsets. Every key in the hashmap points to an offset in the file.
- Every write that arrives will only do the above operations. No replication operations will be done here.
- Once a replica receives a change it will apply it to its storage and it will store the offset in the file. Remember that it is append-only so this data represents where this replica is in the feed of changes.
- There will be a background process polling each node every X ms:
- Tell me you’re last offset
- Send to the replica every change since that offset. This means that we need to do random reads to that file which will affect the performance. (Should I have an In-Memory cache for the last N changes?)
This solution has the advantage which is that we don’t even have to wait for the ack, as every X ms we will start again sending all the changes a replica needs.
There are quite a few things that BitCask does that I’m not going to implement for now, for example in my implementation the file grows for ever, I would have to do merge and compaction.
Designing Data Intensive Application http://dataintensive.net/
An Introduction to Distributed Systems http://webdam.inria.fr/Jorge/html/wdmch15.html
Eventually Consistent http://www.allthingsdistributed.com/2007/12/eventually_consistent.html
Conflict Resolution http://pl.atyp.us/wordpress/index.php/2010/03/conflict-resolution/
The trouble with timestamps https://aphyr.com/posts/299-the-trouble-with-timestamps
Why vectors clock are hard http://basho.com/posts/technical/why-vector-clocks-are-hard/