Jeremy Zawodny's blog

N722VJ Air to Air Photos

Posted on June 14, 2011 by Jeremy Zawodny

This gallery contains 18 photos.

More Galleries | Tagged flying | 1 Comment

I actually tried learning Rails

Posted on June 13, 2011 by Jeremy Zawodny

A few months back I wanted to prototype some ideas that my wife and I have been tossing around for web sites that’d be fun to build up in our so-called spare time. I thought it would be smart to use that as an opportunity to learn Ruby and Rails.

Well, I was half right. Ruby is a very cool language and Perl’s influence on it is clear. (I’m and old Perl guy.) However, Rails has grown since I last looked at it. In fact, it’s grown a lot and that means a lot of complexity and more to learn. I felt a bit uneasy about it. The whole thing made me feel a bit inadequate, to be honest.

Today I came across What the hell is happening to Rails? and this struck a chord with me:

But it just feels like we’re making this herculean effort to write elegant code and disappearing off on our own cloud of perfection, leaving behind anyone who wants to learn rails. We’re making it perfect and keeping up the number of new things to learn per month for people writing rails for the last few years. But we’re making it harder and harder for anyone to join the club from scratch.

And as soon as I read that I realized that it wasn’t “just me” but that Rails has, in fact, become a big framework with a lot of culture (some of it rather odd feeling to me) and a lot of change still ahead of it. It was a relief.

In other words, it was exactly what I did NOT want for some prototyping.

Since then I’ve discovered mojolicious and have been tinkering quite happily. Granted, it doesn’t include everything under the sun (an ORM, for example) but I definitely feel like I can start small and build my way up.

I still think Ruby is a cool language and I’d like to tinker with it more. But Rails is not the best way to learn Ruby.

Posted in perl, programming | 29 Comments

Video from my MongoSF 2011 Talk

Posted on June 2, 2011 by Jeremy Zawodny

To go along with the Slides from my MongoSF 2011 Talk, the video is now on-line thanks to 10gen. I had a great time at MongoSF 2011, learned some important new things about MongoDB and got a sense of just how vibrant the user community is. It all feels like MySQL back in the 2002-2003 time frame. And, yes, that is a good thing. 🙂

If anyone can figure out how to embed that brightcove video in a WordPress.com blog (which this is), please let me know. I’ve tried a few things but even the “HTML” view of this editor doesn’t seem to cooperate with me pasting in the code. Perhaps WordPress needs an “expert”mode for people who aren’t scared of HTML…

Posted in craigslist, mongodb, nosql | 1 Comment

Slides from my MongoSF 2011 Talk

Posted on May 26, 2011 by Jeremy Zawodny

Due to a surprisingly long conversion delay, the slides from my talk at MongoSF 2011 are now available:

I’m sure video will be available at some point too, which will make the slides quite a bit more useful. But if you saw the talk and wanted the slides, you’re all set now. 🙂

Posted in craigslist, mongodb | 5 Comments

Speaking at MongoSF on May 24th

Posted on May 23, 2011 by Jeremy Zawodny

I’m a little late to be posting this the night before. In any case, tomorrow afternoon I’ll be presenting at MongoSF 2011. My talk title is Lessons Learned from Migrating 2+ Billion Documents at Craigslist and it picks up where my MongoSV 2010 talk left off. If you were not at MongoSV 2010, the video is available onlnie and provides some background on the project I was just beginning back then.

Tomorrow’s talk will cover a bit of the same information, but it really focuses on what we learned going through the process of migrating many years worth of data from MySQL to MongoDB.

Related to this, I did a short interview that became the blog post titled MongoDB Live at Craigslist over on the MongoDB Blog. That same day, a few others picked up the story, including:

Craigslist Adopting MongoDB (ReadWrite Web)
MongoDB Finds a Major Adopter in Craigslist (Java Lobby / DZone)

The agenda for tomorrow looks very good and I understand there are around 700 people registered. Should be a fun day!

Posted in craigslist, mongodb, nosql | 2 Comments

Building a GrowCamp Greenhouse

Posted on May 9, 2011 by Jeremy Zawodny

This past weekend we finally assembled the 4’x8′ GrowCamp greenhouse that we bought form Costco online about a year ago. We’re hoping to grow an assortment of tomatoes, peppers, herbs, and other veggies in an environment that’s safe from deer and other predatory critters.

The unit seems fairly well thought out and designed to weather well. It’s mostly heavy plastic and takes about 2-3 hours for a couple of adults to assemble. The hardest part is the base. Once that’s firmed up, the upper parts are quite a bit easier.

The harder part is hauling the roughly 46 cubic feed of dirt required to fill the 20 inch high base. Thankfully, Kathleen attacked that problem and was able to move the bags into place far faster than either of us expected. With that out of the way, it was a matter of loading all the dirt into the greenhouse, mixing in a bit of fertilizer, planting, and wiring in the drip irrigation system.

All in all, it took about one and a half days over the weekend. That includes some time shopping for dirt, plants, and irrigation bits.

There are a few more pictures in my Greenhouse Construction album on Flickr.

Now we get to decide what (if anything) to do about remote monitoring for temperature and humidity. You know, so we can be really geeky gardeners. 🙂

Posted in cooking | 6 Comments

Perl, MD5, and Unicode

Posted on April 28, 2011 by Jeremy Zawodny

Pro Tip: Perl’s Digest::MD5 hates Unicode (and so should you).

Here’s what I recently learned from perldoc Digest::MD5 recently (the hard way, of course):

Perl 5.8 support Unicode characters in strings. Since the MD5 algorithm is only defined for strings of bytes, it can not be used on strings that contains chars with ordinal number above 255. The MD5 functions and methods will croak if you try to feed them such input data.

Yes, that’s exactly what happend. I got a semi-cryptic error message. How to fix it?

What you can do is calculate the MD5 checksum of the UTF-8 representation of such strings. This is achieved by filtering the string through encode_utf8() function.

Of course! The exact opposite of what I’d done while trying to be a good Unicode Boy.

I have a much longer blog post brewing in my head about how they never tell you in Computer Science classes that 80-90% of your “programming” time in the real world it dealing with failures, exceptional cases, and general debugging.

Posted in perl, programming | 9 Comments

Databases 10 Years Ago

Posted on April 11, 2011 by Jeremy Zawodny

In preparation for the talk I’m giving tomorrow, I was assembling a slide that looks back about 10 years ago and remembering what our database infrastructure was like back then. I was at Yahoo! during that time and our leading-edge MySQL deployments were on machines that just under 1GHz processors running a 32bit operating system, with relatively small, slow, and expensive disks (no SSDs!).

Back then MySQL 3.23 was the norm, though brave folks like us were running MySQL 4.0 beta releases in production to take advantage of the new and improved replication. Remember when simply having replication was a big deal?!

And InnoDB was really, really new–not for the faint of heart.

Oh, how times have changed.

Nowadays, I could probably simulate our old Yahoo! Finance “feeds” infrastructure on my little Thinkpad laptop (8GB RAM, SSD, Core i5 CPU) in a handful of virtual machines.

Those were the good old days.

Posted in mysql, tech | 3 Comments

MongoDB Pre-Splitting for Faster Data Loading and Importing

Posted on March 6, 2011 by Jeremy Zawodny

Once nice thing about playing with Webdis is that I can watch the import rate of my multi-billion document MongoDB import in nearly real-time.

The downside of that is that I quickly found that once I got a few hundred million documents loaded, the performance not only dropped off quite a bit (expected) but also became highly variable (less expected). In fact, it got so bad that I started to worry how many months (not weeks or days) the whole process might take.

Chunks

After some poking around and reading on the mailing list, I realized that The Balancer was the culprit. To really understand what’s going on, you first need to read about:

If you’re lazy, I can explain without you having to click at all.

You see, MongoDB’s sharding infrastructure groups documents into logical “chunks” which are capped at 200MB by default. Chunks are identified by a range of shard keys. Each chunk is assigned to one shard. So if you’re sharding by a key like PostingID (in our case), you end up with a bunch of chunks that look logically like this:

chunk01 (PostingID 10024 -> 20098) lives on shard01
chunk02 (PostingID 20099 -> 30031) lives on shard02
chunk03 (PostingID 30032 -> 30085) lives on shard03

And so on.

Again, MongoDB groups documents into chunks of roughly equal size and maps those chunks to shards. As time goes on, it tries to notice imbalances such as a single shard having more chunks than the others. It then tries to correct these imbalances by moving chunks from the more populated shard(s) to the less populated shard(s). That is the job of the balancer.

In normal operation, this all works pretty well–especially if your data set fits mostly in RAM. However, our case is different in two important ways.

First off, this is a one-time data import. Our goal is to get it done as quickly as possible. We want to write, write, and write more. There are no reads going on.

Secondly, the size of data is expected to far exceed the available RAM on each of the shard servers. Once the migration is done, this is going to be a lightly used deployment most of the time, so that’s OK.

Since MongoDB uses memory-mapped files for all of its data and indexes, what you expect to see is that eventually all available RAM is used up and the kernel eventually needs to start writing dirty pages out to disk so that RAM can be freed up for new data. (This all assumes you’re not pre-emptively making fsync requests.)

If that’s all that happened, you might expect performance to take a temporary hit when these flushes happen and then recover back to some baseline value. But we were seeing not only degraded performance but we saw performance take a nose dive, not really recover, and vary quite a bit over time (by a factor of 10 or more).

The Problem

After reading on the MongoDB mailing list, double-checking some docs, and looking more closely at logs, I started to see what what happening. This thread was particularly useful.

All writes were initially going to one of the three shards. As the data grew, it was automatically split into chunks. And the aforementioned balancer would eventually sense the need for balancing and start to move chunks from one shard to another.

Many of the chunks the balancer decided to move from the busy shard to a less busy shard contained “older” data that had already been flushed to disk to make room in memory for newer data. That means that the process of migrating those chunks was especially painful, since loading in that older data meant pushing newer data from memory, flushing it to disk, and then reading back the older data only to hand it to another shard and ultimately delete it. All the while newer data is streaming in and adding to the pressure.

That extra I/O and flushing eventually manifest themselves as lower throughput. A lot lower. Needless to say, the situation was not sustainable. At all.

The Solution

The better way to do this involves pre-splitting the key range, which is described in Splitting Chunks, followed by using the moveChunk command as described in Moving Chunks to assign the various chunks to different shards. (For bonus points, you also disable the balancer after this is done.)

Once that is done, you can load data at a fairly rapid rate without fear of chunk migrations happening between shards. And instead of all writes going to a single shard initially, they can be fairly evenly distributed among all shards. That leads to stable performance and greater sustained throughput overall.

Doing this well assumes that you have a good handle on what you data looks like, including the distribution of shard keys (if they’re not unique), average document size, and so on.

Fortunately, we’re using a unique shard key and I have a really good sense of our average document size from earlier testing with MongoDB. That allowed me to build a Perl script that would produce the necessary JavaScript code I could feed to the MongoDB shell (when talking to one of our mongos routing servers) to perform the splitting of chunks and assigning them to different shards in our cluster.

These are the inputs we started with:

min shard key: 1
max shard key: 2^31 + 20%
average doc size: 2,200 bytes
default chunk size: 200MB
num shards: 3

The resulting output was:

95,325 documents per chunk
27,033 total chunks
9,011 chunks per shard

It took probably 6 hours to do the pre-splitting and chunk moving, but the resulting performance has been very good so far. We’re easily able to sustain a higher write rate with var less variance than we saw before.

This image shows what I see now:

The occasional dips are the result of a worker process finishing its current file and stopping to decompress the next one before it can continue adding documents to MongoDB.

Future Enhancements

10gen CTO Eliot tells me that they’re working to implement shard key hashing so that this can happen automatically (if you’re willing to sacrifice range queries based on your shard key). When that is done, loaded documents will be sprayed uniformly across the shards. That should greatly reduce the need for constant balancing to happen during this process.

Since we’re not anticipating range queries that involve our shard key, this would eliminate the need to pre-split and move chunks before importing this type of data in the future. The only question in my mind is what happens when a new shard or two is added. Presumably new documents will start to land on those new shards and the balancer will move chunks to them over time as well.

Posted in craigslist, mongodb, programming, tech | 27 Comments

Finally Learning Python using Google’s Videos

Posted on March 1, 2011 by Jeremy Zawodny

Back in 2006 I asked which language I should learn next: Ruby or Python. Well, I finally got around to feeding my brain a bit. I recently discovered Google’s Python Class and have watched the first few videos. So far it’s all very basic, but learning the fundamentals of Python has proven to be very useful. I already have a good sense of some of the language idioms and am really enjoying learning a new language.

Virtually all my work has been in Perl for the last 10 years or so and I’ve always run across libraries or frameworks that I’d like to play with but didn’t have the right language in my background. Often times that language has been Python.

Thanks to Google for making their internal training materials available to all.

Posted in programming, python, tech | 8 Comments