MongoDB Early Impressions

I’ve been doing some prototyping work to see how suitable MongoDB is for replacing a small (in number, not size) cluster of MySQL servers. The motivation for looking at MongoDB in this role is that we need a flexible and reliable document store that can handle sharding, a small but predictable write volume (1.5 – 2.0 million new documents daily), light indexing, and map/reduce operations for heavier batch queries. Queries to fetch individual documents aren’t that common–let’s say 100/sec in aggregate at peak times.

What I’ve done so far is to create a set of Perl libraries that abstract away the data I need to store and provide a “backend” interface to which I can plug in a number of modules for talking to different data stores (including some “dummy” ones for testing and debugging). This has helped to clarify some assumptions and change a few of my early design ideas. As you may remember from I Want a New Data Store, I’m interested in trying a few options.

Then I created another set of modules that can read the existing data from the MySQL tables, joining as necessary, and normalizing some field names as well as removing “empty” fields. (I’ve decided that it’s silly to store “NULL” values in MongoDB, since we still have to store the key even when there’s no value.) That code has allowed me to do some performance testing and, more importantly, sizing tests. I know have a pretty reliable estimate of our storage needs, assuming the data I’ve pulled so far is representative of the future. It looks something like this:

 1GB = ~       300,000 items  (0.3M)
 10GB = ~     3,000,000 items    (3M)
100GB = ~    30,000,000 items   (30M)
  1TB = ~   300,000,000 items  (300M)
 10TB = ~ 3,000,000,000 items    (3B)

This gives me a good way to judge how much space and how many machines we’ll end up needing.

Performance has be very good so far. The bottleneck is clearly getting the data out of MySQL. Because of the scattered nature of the content in the existing system, and the number of queries and joins required, it’s possible to pull at most 50-100 documents per second. Given that we have nearly a billion documents to migrate, it’ll take some time (and a few TB of space). I definitely plan to distribute that work and take advantage of there being multiple copies of the data around.

Programming against MongoDB is not hard at all. The MongoDB Perl modules are great. It’s really just a matter of wrapping your head around the API and getting used to the JavaScript console instead of the MySQL shell. The server’s http interface is handy for a quick idea of what’s going on, though we’ve hardly stressed this instance so far.

Limitations

There are two limitations (or missing feautres) that I’ve encountered so far in MongoDB. The first is a lack of compression. For this particular data set, I’m reasonably confident that both gzip and lzo could easily get somewhere from 3:1 or 5:1 compression. That could mean a substantial space savings (several TB of disk) at a very small CPU cost (and modern CPUs are very fast). The state of compressed file systems in Linux is sad, so the only real hope in the short or medium term is likely MongoDB object compression as requested in SERVER-164, which I have voted for and commented upon.

The second issue I’ve encountered is the 4MB object size limit. SERVER-431 discusses this a bit and I’d love to see some more discussion and voting there. My motivation for this is to allow for sequential reads of larger objects (think of them as something like covering indexes from the relational world) so that they don’t need to be chunked up (think GridFS) and force me to incur disk seeks when retrieving them. Amusingly, I was hoping to store lists (or arrays) in some documents using the $push operator when I ran into this. It didn’t take long to also find SERVER-654 and realize that I wasn’t the only one.

My ultimate wish for the 4MB limit would be a configurable limit so that the DBA or programmer can choose what’s right for their dataset. In my case, I went back and did some more analysis to see how many records would be affected. It turns out that it was a very small percentage of the total data. But the outliers do exist and since there’s a limit, I’d have to work around it anyway. That’s still code I need to write, test, and support. This caused me to think about the problem a little differently and consider other solutions too.

What’s Next

I need to do a bit more work on the storage side and some more testing. The next major piece of this system is an indexing piece that may not end up living inside of MongoDB. I’ll write more about that later, but my current thinking is to use Sphinx (which we already know and love).  This may give me an excuse to play with two new Sphinx features I’ve been itching to try: string attributes and real-time indexing (via the MySQL protocol).

The next MongoDB piece I want to play with is sharding and replication. I’m hoping to get a feel for the strengths and weaknesses, as well as figuring out the right deployment strategy for us if we end up using MongoDB. Thankfully, our new development environment just got setup last week and there’s enough hardware that I can experiment with different topologies to see how it’ll work out.

I’ll write more about both the Sphinx and MongoDB experimentation as it progresses.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.
This entry was posted in craigslist, mysql, nosql, programming, tech. Bookmark the permalink.

7 Responses to MongoDB Early Impressions

  1. Roger says:

    One gotcha you may not have noticed yet is that Mongo doesn’t index strings larger than 1kb. Something goes into the log, but that is the only indication. Even worse is that when doing a query that uses an index the non-indexed items are not returned – they may as well have disappeared.

    On the positive side the 10gen team are phenomenal. When people post things to the mailing list, they are diagnosed quickly. Issues are fixed quickly. They are really on the ball and seem to work 24 hours a day 7 days a week.

  2. Mike Dirolf says:

    Glad to hear that your early impressions of MongoDB are positive!

    @Roger Yeah, I think that behavior can (and will) be improved. And thanks for the positive feedback – we try hard to provide good support so it’s great to hear that it’s helpful.

  3. jr says:

    had not heard of MongoDB before, interesting piece, thanks.

    how does your “..small (in number, not size) cluster of MySQL servers” measure up against PostgreSQL?

    have you done any comparisons? if not, would you consider testing Postgres?

  4. jr:

    Given that this is a non-relational deployment, I’m not sure that pulling in PostgreSQL would make sense. That said, it’s a fantastic RDBMS that makes a lot of sense in may situations–just not this one.

    • rmxz says:

      Jeremy wrote “Given that this is a non-relational deployment, I’m not sure that pulling in PostgreSQL would make sense.”

      PostgreSQL has some interesting nosql-like features (hstore) and can be configured to perform similarly on simple key-value lookups. An interesting presentation benchmarking postgres for NoSQL -like applications, against Mongodb, Voldemort, Redis, Tokyo, Cassandra, etc can be found here:

  5. What types of queries are you doing on the server? I assume they are document-oriented (whatever that means) because fetch by ID and simple expressions on attributes can be handled by SQL.

  6. sampath says:

    Hi Guys, I am new to MongoDB and i doing the testing in my lab.I undergoing with 64-bit MongoDB.
    1. I gave one million files as backup to my server using mongodb as my Database Storage.
    2. My backup data size is “22.8-GB”.
    3. The mongodb size is “1.95GB”.

    Are there any settings available to compress the DB-Size before configuring the backup to the server?…If anything so.most welcomed one…

Leave a comment