As I previously noted, I’m in the midst of converting some data (roughly 2 billion records) into documents that will live in a MongoDB cluster. And any time you move data into a new data store, you have to be mindful of any limitations or bottlenecks you might encounter (since all systems have had to make compromises of some sort or another).
In MySQL one of the biggest compromises we make is deciding what indexes really need to be created. It’s great to have data all indexed when you’re searching it, but not so great when you’re adding and deleting many rows.
In MongoDB, the thing that gets me is the document size limit. Currently an object stored in MongoDB cannot be larger than 4MB (though that’s likely to be raised soon). Now, you can build your own MongoDB binaries and tweak that parameter, but I’ve been advocating for making the default higher and giving the user the ability to adjust it without recompiling.
While doing a test data load recently (partly to exercise the new firmware on a pair of Fusion-io cards and partly to test the import half of my code), I hit the 4MB limit on a document. This was a bit of a surprise to me because I know that the average size of our documents is around 2KB or so. To have something that blows the 4MB cap means there are some real outliers in the data. It’s not too surprising, given that we’re talking about 2 billion documents, all of which can have a fair amount of metadata (mostly from events triggered by Craigslist users).
But how many outliers are there? And how big are they?
To find out I wrote a tool to scan all of the data that’s currently available. As of yesterday that’s 200 million documents. (The export from MySQL is taking a while and I’m staging all the data on disk before loading into MongoDB. And it needs to run for several more days yet!) The code approximates the size of the data when stored in MongoDB and keeps a count of how many documents will fit in various buckets. There’s a 1KB bucket (documents less than or equal to 1KB in size), a 2KB bucket, 4KB, 8KB, an so on. The code just keeps doubling the bucket maximum until a document fits.
At the end I get a histogram of the data. It tells me which buckets were used, how many documents fell into each, and what percentage of the total that is. And, not surprisingly, the data was more interesting than I might have predicted myself.
For our newer (and bigger) cluster:
1 KB 00% 115594 2 KB 72% 73095515 4 KB 20% 20871570 8 KB 03% 3901456 16 KB 02% 2177114 32 KB 00% 513621 64 KB 00% 7311 128 KB 00% 467 256 KB 00% 15 512 KB 00% 7 1024 KB 00% 5 4096 KB 00% 1 8192 KB 00% 2 16384 KB 00% 3 32768 KB 00% 2 65536 KB 00% 4 total count: 100682687
And our older (smaller) cluster:
1 KB 35% 26883047 2 KB 49% 37508498 4 KB 11% 8816758 8 KB 02% 1985910 16 KB 00% 560498 32 KB 00% 154749 64 KB 00% 16377 128 KB 00% 475 total count: 75926312
As you can see, there are only a few outliers so far. Once all the data is available, I can decide how to handle this. Odds are that I’ll either need to come up with some special case or simple truncate some of that data (it’s not all equally valuable).
It’s interesting that I intuitively “knew” that most of our documents would fall into the 2KB range. Based on what I see so far, about 80% of them do. And most of the other 20% aren’t too far off. But those wacky outliers that are a few MB in size would have been difficult to imagine.
If this data is representative of the whole (which is difficult to assume with such outliers), then I only have a small handful of exceptions to the 4MB limit to really worry about. But only time will tell at this point.
When in doubt, check the data. The real data. All of it.