Why I hate CAP theorem and why you should too
When it comes to distributed systems, the CAP theorem is highly ubiquitus. Named after its creator Eric Brewer, this theorem states that it is impossible for a distributed system to simultaneously provide more than two of the following three guarantees:
- Consistency (C)
- Availability (A)
- Partition tolerance (P)
Except it doesn’t. This interpretation just doesn’t make sense, and I’ll tell you why.
First, let’s start with consistency (C). A system is consistent if all nodes see the same data at the same time. And every read operation guarantees that it fetches up-to-date information regardless which node the data was written to or being read from. If a data has been written successfully, all subsequent reads must fetch that information.
Second, availability (A) refers to how often the system is available for use. If a node goes down, or network partitions occur, then the system as a whole must still be able to function correctly and provide access to data.
Lastly, partition tolerance (P) ensures that even in cases of network failure or unexpected outages, the distributed system can continue operating without issue. It’s important to note that P does not imply zero downtime - rather it means that when failures do happen, they are handled gracefully by the system so users experience minimal impact.
The three musketeers of CAP Theorem - CA, CP and AP
Let’s start with all the situation and what they say about a DB system
- CA - If a distributed system is Consistent and Available, but not partition tolerant, that means it can’t handle partition failures at all, effectively, it’s a single node system, i.e., it’s no longer distributed
- CP - If a distributed system is Consistent and Partition tolerant, but not available, that means, when a partition failure does happen, the affected partition becomes unavailable.
- AP - f a distributed system is Available and Partition tolerant, but not consistent, that means, in case of a partition failure, it maintains availability at the risk of returning stale data.
The Problem with the CAP Theorem
If you look at the above scenarios, the CA scenario doesn’t make sense. Why would a theorem about distributed systems include a non-distributed component? I mean, if a system is CA, that effectively means it doesn’t handle partitions at all, because if a system does handle partitions, it needs to do something to handle partition failures, right?
A Better interpretation of CAP Theorem
The aforementioned definition, while ubiquitous, is misleading. A better, more accurate interpretation of CAP Theorem is as follows
In case of a Partition Failure, a distributed system can either choose to be available or consistent but not both.
The above definition makes intuitive sense, but it isn’t what is being taught all over the world. And to add water to sodium metal, CAP Theorem applies only in the worst case. This is the reason I hate CAP Theorem, the theorem itself is benign, but the way people (mis)interpret it, and the way it is still being asked in interviews and all is kind of ridiculous.
CAP Theorem misses a key factor about Distributed Systems - Latency
When evaluating a distributed systems’ performance, one common question keeps popping up - How does it handle latency? And CAP Theorem does not address this question at all.
And if we agree on the better, simpler definition of CAP Theorem, then, first of all, the acronym is backwards, because PAC suits the definition a hell lot more than CAP. This leads us to…
PACELC Theorem: The solution to all your CAP Problems
PACELC is a newer, better and more robust framework to evaluate distributed systems, it takes into account partition failures as well as latency.
PACELC simply states,
In case of a Partition failure (P), we can choose Availability (A) or Consistency (C), Else (E), we can choose Latency (L) or Consistency (C)
What does it mean? PACELC is simply an extension and beautification of CAP Theorem to make it more useful.
PACELC helps us in two different scenarios, first, how does it handle partition failures (worst case) and how does it handle latency (general case).
Let’s imagine twitter, the number of likes in a tweet isn’t that important, so, we can easily make it eventually consistent, and instead prefer latency, and in case of a partition failure, we should choose availability. Thus, we need a PA/EL system.
On the other hand, let’s say, you are requesting your bank statement, will you rather have your statement reflect a stale value just to receive it faster? And will you rather have a failed statement or a partial statement leading to incorrect balance? As the answer to both of these is No, a bank needs a PC/EC system.
There are scenarios when we need a PC/EL system as well. Let’s say, you have a ticketing system, its fine, if your booking status gets confirmed or declined after some time, as long as it is consistent, so, we might prefer latency normally, but during partition failure, we absolutely prefer consistency.
Conclusion: CAP Theorem is a relic of the past, use PACELC
The CAP Theorem has rightfully secured its place in the heart of the general population. But its utility in a modern day distributed system is highly limited at best, and absolute nightmarish at worst.
But my biggest criticism is the zombiesque rote-learning of the theorem by the general population at large, without paying much thought about the broader system.