While there is a dizzying array of technologies that have the “NoSQL” label applied to them, I’m looking for one to replace a MySQL cluster. This particular cluster has roughly a billion records in it, uses a few TB of disk space, and is growing all the time. It is currently a MySQL master with several slaves and handles a fairly light query volume. The reasons we’d like to change it are:
- ALTER TABLE takes an unreasonably long time, so we can’t add or removed indexes or columns. Changes take over a month.
- The data footprint is large enough that it requires RAID and we seem to be really good at breaking RAID controllers.
- Queries are not very efficient, due partly to the underlying data organization, and due partly to the sheer amount of it compared to the available RAM.
- The data really isn’t relational anymore, so a document store is more appropriate. It’s just that when it was set up, MySQL was the tool of choice for just about all data storage.
I’ve spent some time looking around at several options, including MongoDB, CouchDB, and Cassandra. And I like aspects of all of them. If I could pick and choose features from all of them, here’s what it might look like:
- The high-level abstractions provided by CouchDB and MongoDB. Cassandra makes you think more about low-level details and performance (which can be good).
- The performance of Cassandra. By all accounts it is very fast and got faster in the 0.60 release.
- A clear understanding of the performance and storage tradeoffs in various schema/indexing designs.
- The “no single point of failure” and multi-machine replication and sharding features of Cassandra. They’re VERY compelling. CouchDB Lounge is a step in the right direction, but I’d rather see it as part of the core system.
- Map/Reduce support for ad-doc data analysis and queries.
- Persistent indexes to speed our most common queries. Cassandra’s feels very “roll your own”, which is consistent with the lower-level nature of Cassandra. CouchDB’s “all views, all the time” feels a bit odd too. MongoDB seems to get this right, providing traditional indexes and support for more ad-hoc operations similar to CouchDB.
- The documentation of MongoDB. CouchDB is pretty good and there are books available. Cassandra’s docs require a bit more trial and error on the part of the developer as things change with each release.
- A corporate entity that can provide support, consulting, and possibly custom development. Both CouchDB and MongoDB have this. Cassandra is more a community project, though the bulk of contributions come from developers employed by tech companies–none of them appear to be in the business of doing Cassandra.
- I’d love to be able to influence the organization of records on disk so that the most common queries will require very few seeks. I’d like to cluster around a particular key that may not be the primary key. I can see ways to do this with Cassandra. I’m not sure about the low-level details of MongoDB or CouchDB.
- I’d love native compression for large text fields. A substantial portion of our data in this cluster is text.
- Full-Text indexing would be a nice to have. We’re pretty good with Sphinx already, but having reasonable full-text indexing integrated would simplify things.
- A good Perl API.
Given all that, what else should I be looking at? What misconceptions do I have? What’s your experience been with any or all of them?