Sharding is Hard

Reading the Foursqure/MongoDB post-mortem I’m struck by a few things. First, the folks at 10gen are doing a great job of being very open and upfront about what happened and how they hope to solve this problem so that it doesn’t happen again. As someone currently working on a MongoDB deployment myself, I can tell you that a team like that is worth its weight in gold.

Secondly, I’m reminded of the issues that Reddit had a while back with Cassandra. In both cases, servers were arguably under provisioned and running close to limits. And when it came time to add more servers to alleviate the load, the process actually made things worse (because of data migration happening) and didn’t offer immediate relief (because the original servers needed to be compacted).

I suspect that we’ll see this happen more often as time goes on. And there will be high-profile cases. They may or may not be with Cassandra or MongoDB, but the strategies they use are likely also being used by other sharding-friendly NoSQL database engines.

There is no single solution. This is partly a human issue related to planning and monitoring. You really have to stay on top of your growth rates. And it’s also a deficiency in software systems. There may be a need for more sophisticated or aggressive data migration and compaction routines.

All this goes to show that the folks who like to say “just shard your data and you’ll scale forever” are really over-simplifying the problem. The scary part is that I’m not sure many of them realize how much they’re glossing over.

Sharding is hard.

Posted in mongodb, nosql, programming, tech | 2 Comments

Always Test with Real Data

As I previously noted, I’m in the midst of converting some data (roughly 2 billion records) into documents that will live in a MongoDB cluster. And any time you move data into a new data store, you have to be mindful of any limitations or bottlenecks you might encounter (since all systems have had to make compromises of some sort or another).

In MySQL one of the biggest compromises we make is deciding what indexes really need to be created. It’s great to have data all indexed when you’re searching it, but not so great when you’re adding and deleting many rows.

In MongoDB, the thing that gets me is the document size limit. Currently an object stored in MongoDB cannot be larger than 4MB (though that’s likely to be raised soon). Now, you can build your own MongoDB binaries and tweak that parameter, but I’ve been advocating for making the default higher and giving the user the ability to adjust it without recompiling.

While doing a test data load recently (partly to exercise the new firmware on a pair of Fusion-io cards and partly to test the import half of my code), I hit the 4MB limit on a document. This was a bit of a surprise to me because I know that the average size of our documents is around 2KB or so. To have something that blows the 4MB cap means there are some real outliers in the data.  It’s not too surprising, given that we’re talking about 2 billion documents, all of which can have a fair amount of metadata (mostly from events triggered by Craigslist users).

But how many outliers are there? And how big are they?

To find out I wrote a tool to scan all of the data that’s currently available. As of yesterday that’s 200 million documents. (The export from MySQL is taking a while and I’m staging all the data on disk before loading into MongoDB. And it needs to run for several more days yet!) The code approximates the size of the data when stored in MongoDB and keeps a count of how many documents will fit in various buckets. There’s a 1KB bucket (documents less than or equal to 1KB in size), a 2KB bucket, 4KB, 8KB, an so on. The code just keeps doubling the bucket maximum until a document fits.

At the end I get a histogram of the data. It tells me which buckets were used, how many documents fell into each, and what percentage of the total that is. And, not surprisingly, the data was more interesting than I might have predicted myself.

For our newer (and bigger) cluster:

    1 KB  00%  115594
    2 KB  72%  73095515
    4 KB  20%  20871570
    8 KB  03%  3901456
   16 KB  02%  2177114
   32 KB  00%  513621
   64 KB  00%  7311
  128 KB  00%  467
  256 KB  00%  15
  512 KB  00%  7
 1024 KB  00%  5
 4096 KB  00%  1
 8192 KB  00%  2
16384 KB  00%  3
32768 KB  00%  2
65536 KB  00%  4
 
total count: 100682687

And our older (smaller) cluster:

  1 KB  35%  26883047
  2 KB  49%  37508498
  4 KB  11%  8816758
  8 KB  02%  1985910
 16 KB  00%  560498
 32 KB  00%  154749
 64 KB  00%  16377
128 KB  00%  475
 
 total count: 75926312

As you can see, there are only a few outliers so far. Once all the data is available, I can decide how to handle this. Odds are that I’ll either need to come up with some special case or simple truncate some of that data (it’s not all equally valuable).

It’s interesting that I intuitively “knew” that most of our documents would fall into the 2KB range. Based on what I see so far, about 80% of them do. And most of the other 20% aren’t too far off. But those wacky outliers that are a few MB in size would have been difficult to imagine.

If this data is representative of the whole (which is difficult to assume with such outliers), then I only have a small handful of exceptions to the 4MB limit to really worry about. But only time will tell at this point.

When in doubt, check the data. The real data. All of it.

Posted in craigslist, mongodb, mysql, programming, tech | 7 Comments

Tools and Technology I’d Like To Use

This is a short list of technology and tools that I’ve been looking at off and on over the last few months and would like to try out to solve real problems (not just toy projects).

  • RabbitMQ, because queuing make a lot more sense that polling in a lot of applications.
  • 0MQ, because I like the higher level network connection abstractions.
  • Hadoop (probably from Cloudera), because we have a lot of machines and a lot of data and are always Doing It Wrong.  MapReduce and HDFS could simplify A LOT and make things way faster.
  • Redis 2.2, because having the expire semantics that people expect will make some things a lot easier.  2.0 is working great but 2.2 will fix one of its biggest warts, IMO.
  • Apache Mahout, because I’m curious what we could learn if we feed it some of our data.

MongoDB would have been on this list a few months ago, but I’m in the middle of a project using it now.  Come to MongoSV and you can hear about that (of course, I’ll talk about it some here as well).

Since I work in Perl a lot but haven’t kept up on some of what the community has built in the last few years, there are some modules/frameworks that I feel like I should be paying more attention to and trying out:

  • Moose, because it seems to make OO not suck.
  • POE, because I like event-driven stuff in some cases.
  • Coro, because it seems over the top and crazy, but also quite useful
  • Plack, because I’m starting to think we’d be better off ditching Apache/mod_perl since we’re really not using much of Apache (and we’re still on 1.3).
  • AnyEvent, because I’ve played with it some but would really like to do more.

Now, does anyone have some spare time I can borrow?

Posted in Uncategorized | 5 Comments

My Motorola Droid Keyboard is Unnecessary

I’ve had a Motorola Droid for over six months now and have decided that the slide-out keyboard is totally unnecessary for my use. That’s a little surprising to me in retrospect, since I originally got the Droid for three reasons:

  1. it ran Android on the Verizon network
  2. it had a good touchscreen
  3. it had a keyboard I could actually type on

In reality the on-screen keyboard is good enough for most circumstances that I wish I could give back the physical keyboard in return for a thinner phone (or maybe more battery). I just don’t find myself doing nearly as much typing as I thought I wold.

App designers have done a very good job of making it easy to live without much keyboard activity, and the auto-complete stuff that some apps take advantage of makes it easy to use the “slower” software keyboard in most cases. The only real exception has been using an SSH client. But that’s rarely something I do more than once a month anyway.

I’m guessing my next phone will have a better screen (of course), run Android, and not have a physical keyboard at all.

Posted in tech | 10 Comments

How I Comment Perl Code

I realized a few days ago that I have a particular way of commenting my Perl code. I wonder if I’m unique in this way, or if these are habits I’ve picked up reading others’ code over the years.

A documentation command is prefixed by two hashes (##) followed by a single space and indented to the level that matches the code:

    ## this loop iterates over each foo item to compute bar

A line or block of code that’s being commented out for the long-term (such as extra debugging that I might only turn on once in a blue boon) is prefixed by a single hash (#) followed by a single space and indented to the appropriate level.

    # if ($foo->validateIndex()) {
    #     print STDERR "blah!";
    # }

A line or block of code that’s temporarily commented out and expected to be turned back on soon is prefixed by a single hash (#) with no space to separate it from the code itself. Often times it’s not indented appropriately either.

    #unlink $tmpfile or die "unlink: $!";

What commenting habits do you have?

Posted in programming, tech | 20 Comments

1,250,000,000 Key/Value Pairs in Redis 2.0.0-rc3 on a 32GB Machine

Following up on yesterday’s 200,000,000 Keys in Redis 2.0.0-rc3 post, which was a worst-case test scenario to see what the overhead for top-level keys in Redis is, I decided to push the boundaries in a different way. I wanted to use the new Hash data type to see if I could store over 1 billion values on a single 32GB box. To do that, I modified my previous script to create 25,000,000 top-level hashes, each of which had 50 key/value pairs in it.

The code for redisStressHash was this:

#!/usr/bin/perl -w
$|++;

use strict;
use lib 'perl-Redis/lib';
use Redis;

my $r = Redis->new(server => 'localhost:63790') or die "$!";

## 2.5B values

for my $key (1..25_000_000) {
	my @vals;

	for my $k (1..50) {
		my $v = int(rand($key));
		push @vals, $k, $v;
	}

	$r->hmset("$key", @vals) or die "$!";
}

exit;

__END__

Note that I added a use lib in there to use a modified Redis Perl library that speaks the multi-bulk protocol used all over in the Redis 2.0 series.

If you do the math, that yields 1.25 billion (1,250,000,000) key/value pairs stored. This time I remembered to time the execution as well:

real	160m17.479s
user	58m55.577s
sys	5m53.178s

So it took about 2 hours and 40 minutes to complete. The resulting dump file (.rdb file) was 13GB in size (compared to the previous 1.8GB) and the memory usage was roughly 17GB.

Here’s the INFO output again on the master:

redis_version:1.3.16
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:21426
uptime_in_seconds:12807
uptime_in_days:0
connected_clients:1
connected_slaves:1
blocked_clients:0
used_memory:18345759448
used_memory_human:17.09G
changes_since_last_save:774247
bgsave_in_progress:1
last_save_time:1280092860
bgrewriteaof_in_progress:0
total_connections_received:22
total_commands_processed:32937310
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
db0:keys=25000000,expires=0

Not bad, really. This provides a slightly more reasonable usse case of storing many values in Redis. In most applications, I supsect people will have a number of “complex” values stored behind their top-level keys (unlike my previous simple test).

I’m kind of tempted to re-run this test using LISTS, then SETS, then SORTED SETS just to see how they all compare from a storage point of view.

In any case, a 10 machine cluster could handle 12 billion key/value pairs this way. Food for thought.

Posted in nosql, programming, tech | 9 Comments

200,000,000 Keys in Redis 2.0.0-rc3

I’ve been testing Redis 2.0.0-rc3 in the hopes of upgrading our clusters very soon. I really want to take advantage of hashes and various tweaks and enhancements that are in the 2.0 tree. I was also curious about the per-key memory overhead and wanted to get a sense of how many keys we’d be able to store in our ten machine cluster. I assumed (well, hoped) that we’d be able to handle 1 billion keys, so I decided to put it to the test.

I installed redis-2.0.0-rc3 (reported as the 1.3.16 development version) on two hosts: host1 (master) and host2 (slave).

Then I ran two instances of a simple Perl script on host1:

#!/usr/bin/perl -w
$|++;

use strict;
use Redis;

my $r = Redis->new(server => 'localhost:63790') or die "$!";

for my $key (1..100_000_000) {
	my $val = int(rand($key));
	$r->set("$$:$key", $val) or die "$!";
}

exit;

__END__

Basically that creates 100,000,000 keys with randomly chosen integer values. They keys are “$pid:$num” where $pid is the process id (so I could run multiple copies). In Perl the variable $$ is the process id. Before running the script, I created a “foo” key with the value “bar” to check that replication was working. Once everything looked good, I fired up two copies of the script and watched.

I didn’t time the execution, but I’m pretty sure I took a bit longer than 1 hour–definitely less than 2 hours. The final memory usage on both hosts was right about 24GB.

Here’s the output of INFO from both:

Master:

redis_version:1.3.16
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:10164
uptime_in_seconds:10701
uptime_in_days:0
connected_clients:1
connected_slaves:1
blocked_clients:0
used_memory:26063394000
used_memory_human:24.27G
changes_since_last_save:79080423
bgsave_in_progress:0
last_save_time:1279930909
bgrewriteaof_in_progress:0
total_connections_received:19
total_commands_processed:216343823
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master
db0:keys=200000001,expires=0

Slave:

redis_version:1.3.16
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:5983
uptime_in_seconds:7928
uptime_in_days:0
connected_clients:2
connected_slaves:0
blocked_clients:0
used_memory:26063393872
used_memory_human:24.27G
changes_since_last_save:78688774
bgsave_in_progress:0
last_save_time:1279930921
bgrewriteaof_in_progress:0
total_connections_received:11
total_commands_processed:214343823
expired_keys:0
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:slave
master_host:host1
master_port:63790
master_link_status:up
master_last_io_seconds_ago:512
db0:keys=200000001,expires=0

This tells me that on a 32GB box, it’s not unreasonable to host 200,000,000 keys (if their values are sufficiently small). Since I was hoping for 100,000,000 with likely lager values, I think this looks very promising. With a 10 machine cluster, that easily gives us 1,000,000,000 keys.

In case you’re wondering, the redis.conf on both machines looked like this.

daemonize yes
pidfile /var/run/redis-0.pid
port 63790
timeout 300
save 900 10000
save 300 1000
dbfilename dump-0.rdb
dir /u/redis/data/
loglevel notice
logfile /u/redis/log/redis-0.log
databases 64
glueoutputbuf yes

The resulting dump file (dump-0.rdb) was 1.8GB in size.
I’m looking forward to the official 2.0.0 release. 🙂

Posted in programming, tech | 11 Comments

Coding Outside My Comfort Zone: Front-End Hacking with jQuery and flot

To folks who’ve read my tech ramblings over the years, it’s probably no surprise that I generally avoid doing front-end development (HTML, CSS, JavaScript) like the plague. In fact, that’s probably one of the reasons I finally migrated my blog from a self-hacked and highly-tweaked MovableType install to WordPress. I spend the majority of my time dealing with back-end stuff: MySQL, Sphinx, Redis, and the occasional custom data store (for a feature we’re launching soon). I try to build and maintain fast, stable, and reliable services upon which people who actually have front-end talent can build usable and useful stuff.

But every now and then I have an itch to scratch.

For the last year and a half, I’ve wanted to “fix” a piece of our internal monitoring system at craigslist. We have a home-grown system for gathering metrics across all our systems every minute as well as storing, alerting, and reporting on that data. One piece of that is a plotting tool that has a web interface which lets you choose a metric (like CpuUser or LoadAverage), time frame, and hosts. When you click the magic button, it sends those selections to a server that pulls the data, feeds it to gnuplot, and then you get to see the chart. It’s basic but useful.

However, I wanted a tool that gave me more control, took advantage of the fact that I have a lot of CPU power and RAM right here on my computer, and make prettier charts. I wanted easier selection of hosts and metrics (with auto-complete as you type instead of really big drop-down lists), plotting of multiple metrics per chart, and a bunch of other stuff. So I went back to a few bookmarks I’d collected over the last year or two and set about building it.

I ended up using JSON::XS and building a mod_perl handler to serve as a JSON endpoint that could serve up lists of hosts and metrics (for the auto-completion) from MySQL as well as the time series data for the plots. That was the easy part. For the font-end I used jQuery (the wildly popular JavaScript toolkit) and flot (a simple but flexible and powerful charting library based on jquery). It took a lot of prototyping and messing around to get the JavaScript bits right. That is due largely to my lack of knowledge and experience. It’s frustrating to me when nothing happens and I wonder what good debugging tools might be. But instead of actually bothering to install and learn something like Firebug, I just charge ahead and try to reason out WTF is going on. Eventually I get somewhere.

As with most things, the initial learning curve was steep. But I eventually started to feel a little comfortable and productive. I had a few patterns for how to get things done and, most importantly, I understood how they worked. So I was able to piece together a first version with all the minimal functionality I thought would be good to have. Yesterday I made that first version available internally. There’s already a wishlist for future features, but I’m happy with what I have as a starting point.

It was fun to step out of the normal stuff I do. Getting out of my comfort zone to work on front-end stuff gave me a renewed appreciation for all this browser technology (it works on Firefox, Chrome, and Safari), the libraries I built upon, and all the collective work that must have produced them. Aside from scratching my own itch, I now feel a little less resistant to hack on JavaScript. So that means I’m just a bit more likely to dive into something similar in the future.

It feels like I stretched parts of my brain that don’t normally get much of a workout. I like that.

Posted in craigslist, programming, tech | 5 Comments

MySQL 5.5.4-m3 in Production

Back in April I wrote that MySQL 5.5.4 is Very Exciting and couldn’t wait to start running it in production. Now here we are several months later and are using 5.5.4-m3 on all the slaves in what is arguably our most visible (and one of the busiest) user-facing cluster. Along the way we deployed some new hardware (Fusion-IO) but not a complete replacement. Some boxes are Fusion-io, some local RAID, and some SAN.  We have too many eggs for any one basket.

We also converted table to the Barracuda format in InnoDB, dropped an index or two, converted some important columns to BIGINT UNSIGNED and enabled 2:1 compression for the table that has big chunks of text in it. Aside from a few false starts with the Barracuda conversion and compression, things went pretty well. Coming from 5.0 (skipping 5.1 entirely) we had some my.cnf work to do to take advantage of all the new InnoDB tuning (especially on the boxes with Fusion-IO cards and more memory). Hopefully switching the master goes smoothly too.

Needless to say, we have some very good and dedicated folks working behind the scenes to make it all happen (hi, Josh). Most of my involvement was initial testing, prodding, schema changes, and finding my.cnf options for the InnoDB tuning.

Posted in craigslist, mysql, tech | 40 Comments

Database Drama

There’s been a surprising amount of drama (in some circles, at least) about database technology recently.  I shouldn’t be surprised, given the volume of reactions to the I Want a New Datastore post that I wrote. (Hint: I still hear from folks pitching the newest data storage systems.)

The two things that caught my eye recently involve Cassandra and MongoDB (and, indirectly, MySQL). First was what I read as a poorly thought out and whiny critique of MongoDB’s durability model: MongoDB Performance & Durability. Just because something is the default doesn’t mean you have to use it that way. Thankfully there was reasoned discussion and reaction elsewhere, including the Hacker News thread about it.

Look. Building fast, feature-rich, scalable systems is Really Hard Work. You’re always making tradeoffs. You can have the ultimate in single-server durability (with all the fancy hardware that dictates) but you’re going to really sacrifice performance (or budget!). But at least you won’t have a lot of complexity. Or you can build something that scales out really well using many machines. But that adds a lot of complexity and different sacrifices.

Next comes the Twitter Engineering blog post Cassandra at Twitter Today in which we learn that Twitter loves Cassandra but they’re opting to use their sharded MySQL infrastructure for storing tweets. This surprised a lot of people and even became “news” at TechCrunch. This is hardly surprising. The long version of why I say that is captured in the Reddit comments on the story.

But if you’re not interested in reading the 80+ comments currently there, maybe I can simplify it a bit. Have you ever wondered why there are so damned many NoSQL systems out there?

Simple. Different circumstances dictate making different choices when presented with the list of tradeoffs. This includes durability, performance, data model, scalability, richness of query language, replication model, atomicity, indexing, transactions, administration and support, etc.

Each and every one of those NoSQL projects exist because someone needed them. And sometimes you need to start using a shiny new thing before really understanding its limitations and what those tradeoffs REALLY mean in your environment. And once you’ve done that you might realize that sticking with the tried and true is the best path forward. The same is true of programming languages (Ruby vs. Python vs. PHP vs. JavaScript vs. Go vs. whatever) and the frameworks that programmers decide to use. Lots of drama and fan-boy arguments that really boil down to different people having different needs and priorities.

I’m not saying any of this to promote MySQL and knock Cassandra or MongoDB. I lost at least a day of work last week due to some legacy MySQL issues that seem completely insane in the modern world. But years ago those issues were edge cases. Nowadays they’re very easy to hit.

I’ve actually spent some time recently playing with both Cassandra and MongoDB in the hopes of replacing  a big (in data size, not query volume) MySQL cluster. Both are impressive (and frustrating) in different ways. But ultimately, I do expect that one of them will work quite nicely in this role–and possibly others later on. Not having to contemplate another multi-week ALTER TABLE will be a welcome change!

Which one?  Stay tuned. 🙂

Maybe what I should do in the meantime is spend more time reading stories about what works WELL for people, instead of how they’re unhappy with their choice of tool. All this drama is a real time sink.

Posted in mysql, nosql, tech | 15 Comments