Redis Sharding at Craigslist

After reading Salvatore’s Redis Presharding, I thought I’d write up the sharding technique we use at Craigslist to take advantage of multiple cores, multiple hosts, and allow for painless growth (up to a point).

Nodes

The primary building block of our Redis infrastructure is a node, which is simply a host:port pair that points to a single redis-server instance. Each node is given a symbolic node name to uniquely identify it in a way that doesn’t tie it to a specific host (or port).

The node name to node mappings are stored in a global configuration file that our Redis API uses behind the scenes.

Replication

We don’t use on-disk persistence very heavily because Redis primarily stores volitile or semi-volitile data and we prefer to have predictibly low latency. It is common to see calls complete in 1ms or less. Many of our calls have timeouts of 100ms to handle exceptional cases. Insetad of frequent disk writes, we replicate every node to a slave node on a different physical host and give it an otherwise identical configuration.

The master/slave mappings are stored in a global configuration file that our Redis API uses behind the scenes.

Configuration

The bulk of our code is Perl, so the configuration is expressed in Perl for easy consumption. Here’s a minimal example that uses 2 hosts, assumed to have 2 cores each. This provides 4 total redis nodes, two primaries (masters) and two secondaries (slaves). Losing a single host will result in virtually zero data loss.

our $redisConfig = {
    # node names
    'nodes' => {
        redis_1 => { address => '192.168.1.100:63790' },
        redis_2 => { address => '192.168.1.100:63791' },
        redis_3 => { address => '192.168.1.101:63790' },
        redis_4 => { address => '192.168.1.101:63791' },
    },

    # replication information
    'master_of' => {
        '192.168.1.100:63792' => '192.168.1.101:63790',
        '192.168.1.100:63793' => '192.168.1.101:63791',
        '192.168.1.101:63792' => '192.168.1.100:63790',
        '192.168.1.101:63793' => '192.168.1.100:63791',
    },
}

In reality, our clusters each have 10 physical hosts and 4 cores, which provides for 40 redis nodes (20 masters and 20 slaves).

Hashing

Our API uses a slightly different approach to hashing keys onto nodes. I wrote an API-compatible wrapper library around the CPAN Redis library that transparently provides consistent hashing and other features like connection timeouts, command timeouts, graceful handling of down servers with retries, etc.

For most operations, the hashing of a key is automatic:

$redis->get("foo");

That’ll automatically run “foo” through the hash function and map it to a node name, which is then mapped to the proper node (host:port pair). This is a very important detail. By mapping keys to node names instead of directly to nodes (host:port pairs), we have the freedom to relocate a node to a different server without disturbing the consistent hashing ring.

To allow the user to specify their own hash key (so that related keys can all land on a given node), you pass an array reference where you’d normally pass a scalar. The first element is the hash key and the second is the real key:

$redis->get(["userinfo", "foo"]);

In that case “userinfo” is the hash key but “foo” is still the name of the key that is fetched from the redis node that “userinfo” hashes to.

Room to Grow

Given this configuration, we have room to double the cluster in overall capacity twice (at a minimum) before we ever have to think about rehashing keys. First, we could simply double the available memory in our servers and increase the maxmemory setting in redis to match. Second, we could also double the number of hosts and move half our nodes to those new hosts. As long as we update the configuration to map the node names to the new nodes (host:port pairs), the consistent hashing ring is unaffected and things continue to run as expected (aside from a brief downtime during that re-configuration).

Real Numbers

As I alread mentioned, today we use 10 hosts, each running 4 redis nodes (2 masters, 2 slaves). Each node has a maxmemory setting of 6GB since these are 32GB boxes and we need to account for fragmentation and the possibility of doing an occasional SAVE or BGSAVE or having to re-attach a slave. That gives us an aggregate capacity of (4*10*6)/2 = 120GB of RAM for the keyspace when you take into account replication (it’s 240GB of RAW capacity).

If we replaced those 10 32GB boxes with 20 64GB boxes, we could choose between keeping the same number of overall nodes (40) and having only 2 per box. That’d mean using maxmemory of 24GB which works out to (2*20*24)/2 = 480GB of RAM for the keyspace (960GB RAW). No re-hashing would be necessary.

If we’re willing to incur a one-time re-hashing, we could double the number of redis nodes (to take advantage of all the cores), and run 4 per box with maxmemory of 12GB which works out to (4*20*12)/2 = 480GB of RAM for the keyspace (960GB RAW). The storage capacity is the same, but we’d have twice the CPU power available and an easy path to doubling in size again in the future (by doubling the number of hosts again).

Future Idea: Slave Reads

One item on my wishlist is to take advantage of the otherwise idle slaves for handling “expensive” reads. This would be accomplished by prepending a call with something like “slave_” such that this would automatically Do The Right Thing:

$redis->slave_zrevrange($key, $start, $stop, 'withscores');

The AUTOLOAD function in our wrapper would just strip the “slave_” prefix and send the command to the slave of the node that holds the give key. In theory, we could automatically re-route reads (but probably not writes) in to a slave node if the master node happens to become unresponsive or unavailable.

Future Idea: Async Operations

I have a branch of my wrapper API that uses AnyEvent::Redis instead of the blocking Redis client to provide for async access to redis. Since we have code that frequently makes 10-50 redis calls to gather up data (and the calls don’t depend on each other), we could see a substantially performance boost by moving to an async API.

Hopefully after a bit more polishing, cleanup, and testing we can start to take advantage of doing these calls in parallel.

Future Idea: Redis Cluster

Given that all our code is using the same API layer to talk to redis, if we move to redis-cluster someday it should be very straightforward to translate our style of hash keys to using the {hashkey} style that redis-cluster will likely use. That will give us even more flexability and remove some of the management overhead involved in (someday) adding nodes.

See Also: Hacker News discussion

Posted in craigslist, nosql, programming, redis | 39 Comments

Webdis is Full of Awesome

I’ve long wanted a lightweight system for gathering and graphing simple performance metrics without a lot of centralized processing. I could use something like StatsD from the Etsy folks but got inspired by reading about Redis at Disqus the other day.

So I decided to start charting the process of migrating data into our MongoDB archive cluster. I already have worker processes that are tracking their own performance. And I have a “manager” process that makes sure the workers have work to do and also aggregates a bit of info about what the workers are doing. The workers all self-report into a Redis instance, so I updated the manger to store a few bits of aggregate data back in Redis as well.

To graph this data in the browser, I used a combination of jquery, flot, and Webdis. Before I say more about webdis, here’s what it looks like:

mongo-watch screenshot

If you work with Redis and haven’t heard of webdis, you should really check it out. It’s a tiny event-based web server designed specifically to put a REST interface on Redis. Webdis can transform Redis output into a variety of very useful formats:

  • JSON
  • JSONP
  • XML
  • BSON
  • Text

In other words, it makes it trivial to build little applications using jQuery in the browser without having an app server involved. Just put the logic in the JavaScript and the data in Redis. Webdis gives you a handy way to move data in and out of Redis using methods that feel very natural in JavaScript.

I’m toying with the idea of generalizing this a little bit so it’s possible to throw arbitrary stats at Redis and have an interface for exploring and plotting them.

Posted in craigslist, programming, redis, tech | 3 Comments

Slides for: Fusion-io and MySQL 5.5 at Craigslist

The slides from the talk I gave at Percona Live San Francisco yesterday are now available on slideshare (hopefully the embed works here too):

View more presentations from Jeremy Zawodny.

Overall, I enjoyed the conference–not just meeting up folks I don’t see often enough, but also getting a renewed sense of how active the MySQL ecosystem really is.

Posted in craigslist, mysql, programming, tech | 6 Comments

MongoDB Data Types and Perl

In working to import a couple billion documents from MySQL into MongoDB, I’m trying to make sure I get it all right. And I recently stumbled upon one way in which I was getting to decidely wrong: data types.

I had often wondered why MongoDB quoted numeric values when I displayed documents in the command-line shell. But I never got curious enough to dig into it. Maybe it was just for consistency. Maybe it was a JavaScript thing of some sort.

But I was re-reading some of the documentation for the MongoDB Perl driver and found myself on the page that describes data types.

It says:

If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending. There are some object-document mappers that will enforce certain types for certain fields for you. You generally shouldn’t save numbers as strings, as they will behave like strings (e.g., range queries won’t work correctly) and the data will take up more space.

Uh oh. At that moment, I realized what was going on. The MongoDB driver was looking at the Perl variables an asking if it was a string or not. If it was, the data was stored as a string. That explained why many (maybe all) of my integers were strings: because they had been stringified in Perl at some point in the process of being pulled from MySQL, serialized, compressed, written to disk, and then read back and re-materialized.

What to do?

I ended up creating a list of field names that I know must be integers and run each of them through an int() call before handing the document to MongoDB for insertion. Magically enough, it just works.

Word to the wise: double-check your field types. It’s far too easy for an integer to get stringified by Perl without you noticing.

Posted in mongodb, programming, tech | 10 Comments

When an example falls in your lap

As I recently noted, I’m giving a short talk at Percona Live about our experience with Fusion-io for MySQL at Craigslist. As is often the case, I agreed to give the talk before giving too much thought about exactly what I’d say. But recently I’ve started to sweat a little at the prospect of having to think up a compelling and understandable presentation.

Thankfully, due to a cache misconfiguration this week, we ended up taking a number of steps that not only will help us to deal with future growth, but as a side-effect we got to directly quantify some of the benefits of Fusion-io in our big MySQL boxes. For whatever reason, the bulk of the presentation basically fell into my lap today.

Now I just have to put it all together.

I won’t go so far as to claim that this is an argument for procrastination,but it sure is nice when something like this happens. 🙂

Posted in craigslist, mysql, tech | 8 Comments

Speaking at Percona Live in San Francisco

On Wednesday, February 16th, I’ll be attending Percona Live in San Francisco to hear about what’s new in the MySQL ecosystem and talk about our adoption of Fusion-io storage for some of our systems at Craigslist. Not only do we have a busy web site, the data itself has posed some unique challenges over the last few years.

Part of getting a handle on that was upgrading to faster storage and moving from years-old MySQL 5.0.xx to more modern releases. I’ll also provide a bit of background on our plans to continue scaling and growing in the coming years.

If you’re in the area and interested in some of the cutting edge work that’s been going into production as part of major MySQL/XtraDB deployments, check out the conference. It’s an affordable 1-day event, considering all the expert brains you can pick and learn from (myself exempted).

If you’re going and have questions you’d like to ask now so I can work them into my talk, please do so in the comments.

Posted in craigslist, mysql, tech | 1 Comment

Redis 2.2 replication consistency with maxmemory and LRU

We’ve been running a version of Redis 2.2 RC at Craigslist for a few months now and it has been flawless. It has rapidly become the backbone of one of our internal systems.

When I upgraded from the 2.0 series and started testing 2.2, I made a few changes to our configuration. The meat of the config file looks like this:

appendonly no
appendfsync no
rdbcompression yes
maxmemory 6gb
maxmemory-policy volatile-lru
maxmemory-samples 3
glueoutputbuf yes
hash-max-zipmap-entries 512
hash-max-zipmap-value 512
activerehashing yes
vm-enabled no

We’re running 4 instances on each of our servers. Two are masters and two are slaves of instances on another server. So our 10 machine cluster has 40 redis-server instances, half of which are masters and half of which are slaves.

The “maxmemory” and “maxmemory-policy” directives are fairly new (part of 2.2) and allow us to make sure that a single redis-server instance doesn’t go over 6GB. Since there are 4 per box, that means redis in total should never use more than 24GB (even though the server has 32GB). We don’t use BGSAVE, so the only time a redis-server should fork and trigger any serious copy-on-write (COW) behavior is when a new slave is attaching to a master.

Now here’s the interesting bit. All the servers are configured identically. But it turns out that the master doesn’t synthesize a delete event into the replication stream when it evicts (deletes) a key due to reaching maxmemory and having to use the LRU policy we’ve chosen. This is different than the normal expire/ttl mechanism that applies to “volatile” keys in redis. In that case, the master controls which keys are removed and sends a delete to each slave when that happens.

The result is that the master and slave are no longer identical when you hit maxmemory and doing LRU-based deletes. And the longer this runs, the more they can drift apart. This is a little surprising.

The discussion going on now revolves around whether or not this is the right default behavior. I’m a bit on the fence but leaning toward “keep the master and slave consistent” because I’m a fan of the principle of least surprise.

If this matters to you, now is a good time to speak up.

Posted in nosql, programming, redis, tech | 4 Comments

Suggestions for analysis of all craigslist postings?

I’m sure this will end up attracting far more (both in number and complexity) suggestions than I can reasonably implement, but I figured I’d ask anyway…

I’m working on a project (discussed at the recent MongoSV conference) that will migrate the entire Craiglist posting archive from MySQL to MongoDB. While I’m testing some of the migration code, I see a lot of posting titles scroll by on the screen. Millions of them.

That got me wondering what the popular words in the titles might be. I could easily code that in to the migration job without appreciably slowing it down. And that got me thinking about the other things I might be able to compute and summarize along the way.

And that made me wonder what smart readers like you would do if you were going to run through all the data a few times on reasonably fast hardware.

Drop a comment and let me know. Hopefully I’ll be able to implement a few of them.

Posted in craigslist, tech | 47 Comments

Kilauea at Night

Kilauea at Night

Last night after having a great dinner at the Kilauea Lodge, we ventured back to the Jaggar Museum along the crater rim for a chance to view the hot gasses coming out of the lava pool in the caldera.  We managed to snap over a hundred pictures in the low light conditions and at least some of them came out pretty well. Many were 2.5 second exposures at ISO 3200.

We’re looking forward to some night lava viewing near the shoreline tonight. Should be fun!

Posted in travel | 3 Comments

Heading to Hawaii

In less than 24 hours we’ll arrive in Hawaii (staying on the big island) for a belated holiday vacation. It’s a lot cheaper if you go a week or two after the big holiday rush.

We’ll be there for about 7 days with 6 days where we can really get out and do things. The agenda is not set in stone at all yet, but we’re hoping to do some or all of the following:

  • visit the volcanoes
  • see whales
  • swim with dolphins
  • go parasailing
  • scuba diving or snorkeling
  • enjoy the beaches and pools

We’d appreciate any tips, suggestions, cautions, or other advice.

I’ll try to post a few pics now and then over on my Tumbr (that’s all I seem to use it for currently). I’ll post the really good ones after our return. While Kathleen has bee to Hawaii a few times, I’ve never been. So this is all new to me. 🙂

In any case, it’ll be a good break that we’ll both enjoy.

Posted in travel | 11 Comments