Always Test with Real Data

As I previously noted, I’m in the midst of converting some data (roughly 2 billion records) into documents that will live in a MongoDB cluster. And any time you move data into a new data store, you have to be mindful of any limitations or bottlenecks you might encounter (since all systems have had to make compromises of some sort or another).

In MySQL one of the biggest compromises we make is deciding what indexes really need to be created. It’s great to have data all indexed when you’re searching it, but not so great when you’re adding and deleting many rows.

In MongoDB, the thing that gets me is the document size limit. Currently an object stored in MongoDB cannot be larger than 4MB (though that’s likely to be raised soon). Now, you can build your own MongoDB binaries and tweak that parameter, but I’ve been advocating for making the default higher and giving the user the ability to adjust it without recompiling.

While doing a test data load recently (partly to exercise the new firmware on a pair of Fusion-io cards and partly to test the import half of my code), I hit the 4MB limit on a document. This was a bit of a surprise to me because I know that the average size of our documents is around 2KB or so. To have something that blows the 4MB cap means there are some real outliers in the data.  It’s not too surprising, given that we’re talking about 2 billion documents, all of which can have a fair amount of metadata (mostly from events triggered by Craigslist users).

But how many outliers are there? And how big are they?

To find out I wrote a tool to scan all of the data that’s currently available. As of yesterday that’s 200 million documents. (The export from MySQL is taking a while and I’m staging all the data on disk before loading into MongoDB. And it needs to run for several more days yet!) The code approximates the size of the data when stored in MongoDB and keeps a count of how many documents will fit in various buckets. There’s a 1KB bucket (documents less than or equal to 1KB in size), a 2KB bucket, 4KB, 8KB, an so on. The code just keeps doubling the bucket maximum until a document fits.

At the end I get a histogram of the data. It tells me which buckets were used, how many documents fell into each, and what percentage of the total that is. And, not surprisingly, the data was more interesting than I might have predicted myself.

For our newer (and bigger) cluster:

    1 KB  00%  115594
    2 KB  72%  73095515
    4 KB  20%  20871570
    8 KB  03%  3901456
   16 KB  02%  2177114
   32 KB  00%  513621
   64 KB  00%  7311
  128 KB  00%  467
  256 KB  00%  15
  512 KB  00%  7
 1024 KB  00%  5
 4096 KB  00%  1
 8192 KB  00%  2
16384 KB  00%  3
32768 KB  00%  2
65536 KB  00%  4
total count: 100682687

And our older (smaller) cluster:

  1 KB  35%  26883047
  2 KB  49%  37508498
  4 KB  11%  8816758
  8 KB  02%  1985910
 16 KB  00%  560498
 32 KB  00%  154749
 64 KB  00%  16377
128 KB  00%  475
 total count: 75926312

As you can see, there are only a few outliers so far. Once all the data is available, I can decide how to handle this. Odds are that I’ll either need to come up with some special case or simple truncate some of that data (it’s not all equally valuable).

It’s interesting that I intuitively “knew” that most of our documents would fall into the 2KB range. Based on what I see so far, about 80% of them do. And most of the other 20% aren’t too far off. But those wacky outliers that are a few MB in size would have been difficult to imagine.

If this data is representative of the whole (which is difficult to assume with such outliers), then I only have a small handful of exceptions to the 4MB limit to really worry about. But only time will tell at this point.

When in doubt, check the data. The real data. All of it.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.
This entry was posted in craigslist, mongodb, mysql, programming, tech. Bookmark the permalink.

7 Responses to Always Test with Real Data

  1. jbmorse says:

    Looks like 4 MB will be a good input check when creating a new system with Mongo DB at the moment. My guess is that a few big docs like that would also cause some issues elsewhere in the system, if only occasionally.

  2. I think the biggest issue (at least based on what I hear from some of the 10gen folks) is making sure all the various language drivers are ok with increasing the limit.

    Now if you go and try to use a 4GB document, you’re probably asking for trouble. 🙂

  3. There is nothing better than real data when it comes to testing. You run into real world numbers and situations you probably would not have thought of when running your tests.

  4. Dave Ihnat says:

    That’s good, if you have the data. Better is to do data input validation while processing any data–batch input, especially user-provided fields, etc.

    So you could have done–still should do!–a validation check on the import records while processing the data. Oh, and another thing–when you report the exception–either interactively, or in a log file–give *meaningful* information to let you find the bad data. Extra points if you give the program an option to also save the actual offending data for examination–preferably NOT in the log file.

  5. Pingback: links for 2010-10-07 — Business Developer Talk

  6. Pingback: 雲端的交通工具@囧評20101002 – Razorcap Studios

  7. cialis vente says:

    En effet et comme je ne me suis pas rendu compte plus tГґt

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s