Depending on the circles you travel in, you might have heard people talking about ditching their relational database management systems (RDBMS) and moving on to working with a different class of data stores… something they call “NoSQL”. The first time I’ve heard of it was a few years ago (I don’t remember exactly when or how) and, although it seemed like a relevant and an interesting concept, it wasn’t interesting enough for me to free some time to dig deeper. It all changed five months ago after I “bumped” into an unexpected and a very interesting opinion (more about that shortly). So, if until that moment I was asking myself questions like “When should I start using a NoSQL data store?” and “What NoSQL product should I choose?” from that moment on the questions were “What is NoSQL?” and “What’s the reason for this movement?” and “Do NoSQL data stores fit my needs better than the way traditional SQL ones do?” etc.
This post is a summary of some of the knowledge I gained through the process of “digging deeper” into the world of relational and non-relational data-stores. So if you are confused and wondering about what this NoSQL movement is all about (exactly where I was until not-so-long ago) or if this concept is all news to you and you’re only begin curious, or even if you master this world and just looking for a victim to criticize, then I hope this post will help you fulfill your needs… We start with the original NoSQL.
The original NoSQL
Back in the 90’s, Carlo Strozzi (a member of the Italian Linux society) decided to implement his own breed of a relational database management system. The requirements were for a simple, lightweight, reliable, not heavily featured, and relational database. Strozzi also wanted his database to have a special “shell-level approach”, a one that follows the philosophy described in an article titled “The UNIX Shell As a Fourth Generation Language” [Evan Schaffer and Mike Wolf, 1991]. Here’s a peek:
"UNIX provides hundreds of programs that can be piped together to easily perform almost any function imaginable. Nothing comes close to providing the functions that come standard with UNIX…"
"The shell, extended with a few relational operators, is the fourth generation language most appropriate to the UNIX environment."
That’s exactly what the final product was (and, in fact, still is) all about – a RDBMS accessible through common Unix shell utilities (ls, mv, cp, cat, head, more, less, etc.) instead of using traditional SQL… It was named “NoSQL”. When asked about the relation between his NoSQL and the new NoSQL movement Strozzi replied with this:
NoSQL has been around for more than a decade now and it has nothing to do with the newborn NoSQL Movement, which has been receiving much hype lately. While the former is a well-defined software package, is a relational database to all effects and just does intentionally not use SQL as a query language, the newcomer is mostly a concept (and by no means a novel one either), which departs from the relational model altogether and it should therefore have been called more appropriately "NoREL", or something to that effect.
If you remember, on the first paragraph I wrote about “an unexpected and a very interesting opinion” I ran into… well, this is it! This short quote from a man I never heard of until that moment made me go back to the roots of relational/non-relational databases and learn about the role they play in modern enterprise applications.
For me, the most significant part of this quote is “… mostly a concept (and by no means a novel one either)”, particularly “mostly a concept”. Concepts are problematic because:
- They are open for subjective interpretations.
- There’s nothing that stops a concept from begin changed over time (end users aren’t guaranteed to have the initial features-set or backward compatibility as the product evolves).
- If you’re disappointed with one implementation of the “concept” there’s no promise you’ll get a replacement. Even if you do get one, there’s no standard (like, for example, SQL) to ease your migration pain.
There are other arguments supporting the problematic nature of concepts, but I think these three are enough for you to at least acknowledge that building your product around a NoSQL database involves uncertain risks. That’s a very big deal! Especially when the only thing that is common for most popular NoSQL products is that they have nothing in common.
But still, you can’t argue with success, right? So what’s missing? We’ll get to this in a moment… Lets try to understand first what Strozzi meant by saying “…by no means a novel one…”
The CAP Theorem
Ten years ago [July 2000], Eric Brewer, a professor at the computers science division of Berkeley CA, said (in what became known as Brewer's Conjecture) that as applications become more web-based we should stop worrying about data consistency, because if we want high availability in these new distributed applications, then guaranteed consistency of data is something we cannot have.
Brewer described three core systemic requirements and the special relationship between them when applied in a distributed system: Consistency, Availability and Partition tolerance (CAP).
Consistency – Means that after data is written to the database, following read operations will always return the latest version of the written data. Database systems without strong consistency, where written data won’t necessarily be available for consequent read operations, are said to support eventual-consistency (or weak consistency).
Availability – Means you can always expect the database to be responsive whenever you need it. This is usually accomplished when a large numbers of physical servers act as a single database (through sharing – splitting the data between various database nodes, or replication – storing multiple copies of each piece of data on different nodes).
Partition tolerance – Means that the database remains operational when parts of it are completely inaccessible (like when the network link between a number of database nodes is interrupted). One way to achieve Partition tolerance involves some sort of a mechanism whereby writes destined for unreachable nodes are sent to nodes that are still accessible. Later, when the failed nodes come back, they receive the writes they missed.
So, web based applications do not guarantee data consistency… A dramatic statement… But can it really be true? The answer is Yes! Two years later, in 2002, Nancy Lynch and Seth Gilbert of MIT formally proved Brewer to be correct, laying the groundwork for the rise of a new type of data-stores. To understand what this new type offers or how it differs from the existing one lets first review the significant characteristics of a RDMBS.
Historically speaking, databases almost always have tried to implement the relational model and be fully ACID-compliant. It was common to think that if a database’s transactions weren’t ACID, or if the database wasn’t relational, then it wasn’t a “real” database. So what are “relational” and “ACID”?
The ‘R’ in RDBMS
What is it? Amazingly, many of the people working with relational databases fail to answer this question right, saying that “relational” describes the way in which tables are related to each other via keys. This is actually wrong! The theory behind the “relational model” (developed by Edgar F. Codd) describes “relation” as “a data structure, which consists of a heading and an unordered set of tuples which share the same type” [Wikipedia]. In other words – “relation” is what we usually call “table” (this is the SQL term). This is an important thing to realize since one of the differences between relational and non-relational databases is, well, the “relational” part, which is the way the data is structured.
A relation (based on a wikipedia image)
What is ACID?
[Daniel Bartholomew, Linux Journal] ACID is the classic measure of determining whether your database is good. A transaction in a database is a single logical operation. An example would be inserting an address or updating a phone number in an employee database. Every database provides methods to do operations like those, but ACID formalizes the process.
Atomicity means that the transaction either succeeds or fails. If the transaction fails, it should fail completely, and the database should be left in the state it was in before the transaction started.
Consistency means that the database is in a known good state both before and after the transaction.
Isolation means that transactions are independent of one another, and if two transactions are trying to modify the same data, one of them must wait for the other to finish before it can begin.
Durability means that once the transaction has completed, the changes made by the transaction will persist, even if there is a system failure. A transaction log of some sort usually is used for this purpose. In MariaDB and MySQL, this is called the binary log.
So, now that we know what a “real” database should act like, let’s find out the “opposite for real”… What is the opposite of ACID?
The answer is BASE – Basically Available, Soft-state, Eventual consistency. [Daniel Bartholomew, Linux Journal] BASE is a retronymn coined by Dan Pritchett in an article in the ACM Queue magazine for describing a database that does not implement the full ACID model, with the main difference being that it is eventually consistent. The idea is that if you give up some consistency, you can gain more availability and greatly improve the scalability of your database.