Placid ACID or embrace BASE? Mind the CAP

SQL, ACID and BASE

The term “NoSQL” has been used for several years referring to non-relational databases. Some of the concepts attached with NoSQL implied schema-less and scaling out capabilities. However, relational databases (RDBMSs) are adding features like JSON compatibility, that can make them schema-free, as well as architecture designs involving clustering and partitioning. On the other hand, NoSQL DBMSs are adding features from RDBMSs, such as higher levels of isolation and durability.

SQL is actually a defined standard language, typically related to RDBMSs, but some of the labeled “NoSQL” products offer a SQL-like language also.

Therefore, the term “NoSQL” is starting to get blurred and soon will be not useful as a product distinction. Nevertheless, there are a couple of terms that still remain (and will do) as a differential factor between two worlds.

RDBMSs are, or must be, ACID in order to operate efficiently and accurately. The term stands for atomicity, consistency, isolation and durability.

In contrast with ACID, we have non-relational databases, with a set of properties summed up in BASE: basically available, soft-state and eventual consistency.

The low pH of a chemical substance implies a high concentration of Hydrogen atoms. Acids are substances with a sub-pair (< 7) pH, while bases has a high (> 7)  pH. To keep on with the analogy, the high concentration of Hydrogen could be seen as a high number of checking operations in ACID databases, while BASE systems avoid this kind of checks because they have looser requirements of consistency and availability. These concepts drive us to the next acronym.

CAP

The CAP theorem states that there are three requirements needed for a right design in distributed computing systems but, sadly, it is not possible for the system to satisfy all of them.

  • Consistency.  C in CAP has nothing to do with the C in ACID (it has more to do with Atomic and Isolation, indeed). A system is consistent if an update is applied to all nodes at the same time.
  • Availability.  It is defined as “every request received by a non-failing database node must result in a non-error response, without guarantee that it contains the most recent write”.
  • Partition Tolerance. Some definitions of the concept state that the system keeps on working even if some nodes, or the connection between two of them, fail. What is most important about that is that P in CAP can’t be excluded by decision, since a multi-node system can not be designed to run on a network that never drops messages and whose nodes can’t fail.

If there is a two-node database (examples with more nodes, or even between datacenters, are also valid), with replication between both nodes, and a network partition occurs, there are two possibilities:

  1. Writes to the database are allowed, so both nodes remain fully available. However, as long as the replication has been interrupted, any changes that are written in one node will not appear in the other one. This election violates CAP-consistency.
  2. If the election is not to lose consistency, all the reads and writes must be made in one (master) node. The other node can not be up-to-date, due to the network partition, so it must stop accepting reads and writes until the replication is restored and the database is in sync again. Thus, although the non-master node has not failed, it can not process requests, so it is not CAP-available.

This is a typical graph to show the classification of databases according with the election made to satisfy the CAP theorem:

 

 

Availability, in a practice way, does not correspond to CAP-availability. Applications’ availability is often measured with some SLA (X% of requests must return a successful response within Y seconds), but such an SLA can be reached also with CAP-unavailable systems.

Concrete definitions of CAP-consistency and CAP availability make invalid some classifications of CP and AP systems (for instance, MongoDB can’t be considered as consistent at all). Some of them are just P, which means that they keep working in the case of a network partition, but with no CAP-consistency neither CAP-availability. Since both concepts are rather strong, it does not imply that systems unable to satisfy CP or CA are bad, or worse than those that do it. They have their features and capabilities, and they can fit for several use cases.

The typical criteria to choose one kind of system over the other include scalability and normalization, as well as “transactionality”, what is sometimes referred as ACID-capabilities.

ACID systems tend to scale worse than BASE systems both from complexity and from performance points of view, but their trade-offs of consistency and availability are less polarized (i.e., a higher level of both concepts can be reached).

BASE systems usually avoid joins and most have no need of a schema (maybe not a key factor, since a majority of cases actually uses it). Typically, they get a burst in throughput by not performing too many check operations during the inserts. On the other side, they usually must set a strong trade-off between consistency and availability.

CONCLUSION

So, as a rule of thumb, the recommendation is to use BASE systems when:

  • scalability is a concern
  • transactionality is not a requirement
  • fashion is a concern (this happens!)

and, then, to choose the CAP combination that better fits you.

 

mm

Raúl Alonso

As a physicist, I've always been interested on how things work. I started my career as a developer and analyst for a consulting company and in 2011 I signed for BBVA. After some years dedicated to project management, since 2015 I am part of the Data Architecture team, where I'm specialized in databases and DaaS.

More Posts