Hadoop, NOSQL, and the Relational Model
- by Phil Factor
(Guest Editorial for the IT Pro/SysAdmin Newsletter)Whereas Relational Databases fit the world of commerce like
a glove, it is useless to pretend that they are a perfect fit for all human
endeavours. Although, with SQL Server, we’ve made great strides with indexing
text, in processing spatial data and processing markup, there is still a
problem in dealing efficiently with large volumes of ephemeral semi-structured
data.
Key-value stores such as Cassandra, Project Voldemort, and
Riak are of great value for ephemeral data, and seem of equal value as a data-feed
that provides aggregations to an RDBMS. However, the Document databases such as
MongoDB and CouchDB are ideal for semi-structured data for which no fixed
schema exists; analytics and logging are obvious examples.
NoSQL products, such as MongoDB, tackle the semi-structured
data problem with panache. MongoDB is designed with a simple document-oriented
data model that scales horizontally across multiple servers. It doesn’t impose
a schema, and relies on the application to enforce the data structure. This is
another take on the old ‘EAV’ problem (where you don’t know in advance all the
attributes of a particular entity) It uses a clever replica set design that
allows automatic failover, and uses journaling for data durability. It allows
indexing and ad-hoc querying.
However, for SQL Server users, the obvious choice for handling
semi-structured data is Apache Hadoop. There will soon be an ODBC Driver for
Apache Hive .and an Add-in for Excel. Additionally, there are now two
Hadoop-based connectors for SQL Server; the Apache Hadoop connector for SQL
Server 2008 R2, and the SQL Server Parallel Data Warehouse (PDW) connector. We
can connect to Hadoop process the semi-structured data and then store it in SQL
Server.
For one steeped in the culture of Relational SQL Databases,
I might be expected to throw up my hands in the air in a gesture of contempt
for a technology that was, judging by the overblown journalism on the subject,
about to make my own profession as archaic as the Saggar makers bottom knocker
(a potter’s assistant who helped the saggar maker to make the bottom of the
saggar by placing clay in a metal hoop and bashing it). However, on the
contrary, I find that I'm delighted with the advances made by the NoSQL
databases in the past few years. Having the flow of ideas from the NoSQL
providers will knock any trace of complacency out of the providers of
Relational Databases and inspire them into back-fitting some features, such as
horizontal scaling, with sharding and automatic failover into SQL-based RDBMSs.
It will do the breed a power of good to benefit from all this lateral thinking.