Thursday, July 16, 2009

NoSQL: A long-time relation(ship) comes to an end

(cross-posting from here)

OK, I admit it, declaring that "the RDBMS is dead" is a meme that has been going around the software industry for a while. Remember object-oriented data bases that were supposed to replace the relational ones? Well, guess who is still here. However, despite the RDBMS's amazing survival skills I would like to propose a related prediction:

I believe that the year 2009 will go down in history as the year when the "relational model default" ended. The term "relational model default" was coined by me to describe a peculiar thing that goes on in application development: start talking to your average application developer about some arbitrary business requirement and chances are that simultaneously he mentally constructs a relational model to fit those requirements.

That relational approach to modeling your problem may or may not be suitable. The real problem is that all too often this default does not get challenged. As a consequence, whatever the fitting data model might be, it gets shoehorned into tables and relations.

This default "thinking" has not yet changed for the masses, but I believe that it has changed for the early adopters (which means that invariably it will change for the masses in some years).

I see the default to change from:

"I need to store some data i.e. I need a relational database"
to:
"I need to store something, let me see the data to decide how to store it."
The most concrete and visible manifestation of the rising interest in non-relational data store is the "NoSQL" movement. NoSQL denotes a group of people interested in exploring and comparing alternatives to the traditional relational data storages like MySQL or Postgres. The inaugural get-together has been covered in Computerworld, see also Johan Oskarsson's post and there is, of course, a Hashtag.

Other than the NoSQL group I have a second data point to offer: there is a Cambrian Explosion happening in terms of projects exploring non-relational data stores. During the Cambrian Explosion a major diversification of organisms took place. Similarly a plethora of new projects that explore alternatives to relational models continue to gain interest. Here is an incomplete list:


AllegroGraph, Amazon's SimpleDB, Cassandra, CouchDB, Dynomite, Google's App Engine datastore, HBase, Hypertable, Kai, MemcacheDB, Mongo DB, Neo4J, OpenRDF, Project Voldemort, Redis, Ringo, Scalaris , ThruDB, Tokyo Cabinet (and Tokyo Tyrant and LightCloud)

Last, but certainly not least, there are Apache Jackrabbit and Apache Sling.

From my perspective there are three main areas of innovation in this Cambrian Explosion of data stores:

1. Models

In the relational model you break down your data into tables and relations. This model implies that the data is somewhat tabular. However, in some cases the data simply is not tabular.

Consider web content, which is hierarchical and mixes fine-granular data with binary files (this model is implemented in Jackrabbit). Other (not mutually exclusive) alternative models are document-oriented, key-value pairs, or Graphs/RDF.

One very important aspect of many alternative models is that they are schemaless. That means that they accommodate for Data First approaches where it is not required to define the data structure before one can actually store any data. This enables agile approaches to software development in the short term as well as more flexibility in the long term evolution of business requirements.

Without defining a data structure first it is not possible to store anything at all in an RDBMS. This fact is probably one of the root causes of the relational default thinking. An RDBMS-based developer simply cannot develop anything without thinking about table structure.


2. Scalability

A second area of innovation is scalability. This can be split down into two sections: One is scalability achieved by distributing the data store across separate machines, the approach pioneered by Google. Opposed to classical clustering of RDBMSs the order of magnitude of machines that are considered is hundreds rather than ten. Obviously, different trade-offs regarding consistency and availability of individual cluster nodes must be taken when architecting for such a high number of cluster nodes. Eventual consistency is one of the interesting concepts invented in this space.


While the commoditization of server hardware triggered this first approach to scalability, a second area is related to the rise of multi-core processors. For a number of years CPUs have not gotten faster, but rather the number of cores has increased. There is no explicit contradiction in running a classical RDBMS on a multi-core machine and even having the RDBMS take advantage of them. However, it seems to me that the SQL language is a poor fit for queries in a multi-core environment when compared with alternatives such as Map/Reduce which are parallel by design.


3. Web

The third area of innovation revolves around the fact that the web is the dominant paradigm for computing in our time. This is also acknowledged by the two considerations discussed above. However, a third one is that HTTP is used for accessing the data. Other types of connectivity that were typically implemented as JDBC or ODBC drivers are not needed/used anymore. In many cases the data store exposes its resources in a RESTful API. An obvious benefit is the ubiquitous availability of clients including the browser itself. The classical RDBMS approach involving a dedicated driver looks like a client-server architecture mindset in comparison (I wrote about this 1.5 years ago).

At this point let me re-iterate that RDBMSs are here to stay, just like mainframes never went away. Moreover, a couple of the innovation areas cited above are not that new at all, especially, when it comes to non-relational data models (for example, I recently dug into the foundations of the Lotus Notes document store and came out very impressed). However, it is only now that the relational model default will disappear.


What about content management systems?

Considering the content management system industry as a whole I am extremely happy about this shift away from RDBMSs. Especially the model aspect is crucial: RDBMSs embody a fundamentally wrong model for content. There are varying opinions in the industry about what "content" really is, but one thing is more or less universally accepted: it is (at least partially) unstructured. Well, RDBMSs are designed for structured data. Duh.

So why are there one gazillion LAMP-based CMSs? I blame the relational model default. But as this default vanishes we will see more and more CMSs that are not based on an RDBMS (see the Jackrabbit wiki for a list of JCR-based ones, as well as the recent PHP-based JCR implementations Jackalope or for Typo3 or the Midgard content repository).

Don't laugh, but I truly envision a better (CMS) world once more CMSs are built upon proper tools and not forced into a relational model anymore. It will be a better world for developers and consequently for the CMS users.

What about Day?

REST and content repositories were invented and evangelized by Day's Chief Scientist Roy and Day's CTO David years ago already. So it is no surprise that Day's content management systems are in an excellent shape with respect to these considerations. CQ5 is built upon Apache Jackrabbit, i.e. a data store that implements a content-centric model, and Apache Sling, a web framework designed to be RESTful right from the start.

When it comes to scaling: a week ago we gave a live demonstration on how to install and cluster CQ5 on Amazon's EC2 service. But, expect even more exciting news in this area.