To store huge datasets effectively, we’ve seen a new breed of databases appear. These are frequently called NoSQL databases, or Non-Relational databases, though neither term is very useful.
They group together fundamentally dissimilar products by telling you what they aren’t. Many of these databases are the logical descendants of Google’s BigTable and Amazon’s Dynamo, and are designed to be distributed across many nodes, to provide “eventual consistency” but not absolute consistency, and to have very flexible schema. While there are two dozen or so products available (almost all of them open source), a few leaders have established themselves:
Cassandra: Developed at Facebook, in production use at Twitter, Rackspace, Reddit, and other large sites. Cassandra is designed for high performance, reliability, and automatic replication. It has a very flexible data model. A new startup, Riptano, provides commercial support.
HBase: Part of the Apache Hadoop project, and modelled on Google’s BigTable. Suitable for extremely large databases (billions of rows, millions of columns), distributed across thousands of nodes. Along with Hadoop, commercial support is provided by Cloudera.