Reading the Foursqure/MongoDB post-mortem I’m struck by a few things. First, the folks at 10gen are doing a great job of being very open and upfront about what happened and how they hope to solve this problem so that it doesn’t happen again. As someone currently working on a MongoDB deployment myself, I can tell you that a team like that is worth its weight in gold.
Secondly, I’m reminded of the issues that Reddit had a while back with Cassandra. In both cases, servers were arguably under provisioned and running close to limits. And when it came time to add more servers to alleviate the load, the process actually made things worse (because of data migration happening) and didn’t offer immediate relief (because the original servers needed to be compacted).
I suspect that we’ll see this happen more often as time goes on. And there will be high-profile cases. They may or may not be with Cassandra or MongoDB, but the strategies they use are likely also being used by other sharding-friendly NoSQL database engines.
There is no single solution. This is partly a human issue related to planning and monitoring. You really have to stay on top of your growth rates. And it’s also a deficiency in software systems. There may be a need for more sophisticated or aggressive data migration and compaction routines.
All this goes to show that the folks who like to say “just shard your data and you’ll scale forever” are really over-simplifying the problem. The scary part is that I’m not sure many of them realize how much they’re glossing over.
Sharding is hard.