Redis Sharding at Craigslist

Posted on February 26, 2011 by Jeremy Zawodny

After reading Salvatore’s Redis Presharding, I thought I’d write up the sharding technique we use at Craigslist to take advantage of multiple cores, multiple hosts, and allow for painless growth (up to a point).

Nodes

The primary building block of our Redis infrastructure is a node, which is simply a host:port pair that points to a single redis-server instance. Each node is given a symbolic node name to uniquely identify it in a way that doesn’t tie it to a specific host (or port).

The node name to node mappings are stored in a global configuration file that our Redis API uses behind the scenes.

Replication

We don’t use on-disk persistence very heavily because Redis primarily stores volitile or semi-volitile data and we prefer to have predictibly low latency. It is common to see calls complete in 1ms or less. Many of our calls have timeouts of 100ms to handle exceptional cases. Insetad of frequent disk writes, we replicate every node to a slave node on a different physical host and give it an otherwise identical configuration.

The master/slave mappings are stored in a global configuration file that our Redis API uses behind the scenes.

Configuration

The bulk of our code is Perl, so the configuration is expressed in Perl for easy consumption. Here’s a minimal example that uses 2 hosts, assumed to have 2 cores each. This provides 4 total redis nodes, two primaries (masters) and two secondaries (slaves). Losing a single host will result in virtually zero data loss.

our $redisConfig = {
    # node names
    'nodes' => {
        redis_1 => { address => '192.168.1.100:63790' },
        redis_2 => { address => '192.168.1.100:63791' },
        redis_3 => { address => '192.168.1.101:63790' },
        redis_4 => { address => '192.168.1.101:63791' },
    },

    # replication information
    'master_of' => {
        '192.168.1.100:63792' => '192.168.1.101:63790',
        '192.168.1.100:63793' => '192.168.1.101:63791',
        '192.168.1.101:63792' => '192.168.1.100:63790',
        '192.168.1.101:63793' => '192.168.1.100:63791',
    },
}

In reality, our clusters each have 10 physical hosts and 4 cores, which provides for 40 redis nodes (20 masters and 20 slaves).

Hashing

Our API uses a slightly different approach to hashing keys onto nodes. I wrote an API-compatible wrapper library around the CPAN Redis library that transparently provides consistent hashing and other features like connection timeouts, command timeouts, graceful handling of down servers with retries, etc.

For most operations, the hashing of a key is automatic:

$redis->get("foo");

That’ll automatically run “foo” through the hash function and map it to a node name, which is then mapped to the proper node (host:port pair). This is a very important detail. By mapping keys to node names instead of directly to nodes (host:port pairs), we have the freedom to relocate a node to a different server without disturbing the consistent hashing ring.

To allow the user to specify their own hash key (so that related keys can all land on a given node), you pass an array reference where you’d normally pass a scalar. The first element is the hash key and the second is the real key:

$redis->get(["userinfo", "foo"]);

In that case “userinfo” is the hash key but “foo” is still the name of the key that is fetched from the redis node that “userinfo” hashes to.

Room to Grow

Given this configuration, we have room to double the cluster in overall capacity twice (at a minimum) before we ever have to think about rehashing keys. First, we could simply double the available memory in our servers and increase the maxmemory setting in redis to match. Second, we could also double the number of hosts and move half our nodes to those new hosts. As long as we update the configuration to map the node names to the new nodes (host:port pairs), the consistent hashing ring is unaffected and things continue to run as expected (aside from a brief downtime during that re-configuration).

Real Numbers

As I alread mentioned, today we use 10 hosts, each running 4 redis nodes (2 masters, 2 slaves). Each node has a maxmemory setting of 6GB since these are 32GB boxes and we need to account for fragmentation and the possibility of doing an occasional SAVE or BGSAVE or having to re-attach a slave. That gives us an aggregate capacity of (4*10*6)/2 = 120GB of RAM for the keyspace when you take into account replication (it’s 240GB of RAW capacity).

If we replaced those 10 32GB boxes with 20 64GB boxes, we could choose between keeping the same number of overall nodes (40) and having only 2 per box. That’d mean using maxmemory of 24GB which works out to (2*20*24)/2 = 480GB of RAM for the keyspace (960GB RAW). No re-hashing would be necessary.

If we’re willing to incur a one-time re-hashing, we could double the number of redis nodes (to take advantage of all the cores), and run 4 per box with maxmemory of 12GB which works out to (4*20*12)/2 = 480GB of RAM for the keyspace (960GB RAW). The storage capacity is the same, but we’d have twice the CPU power available and an easy path to doubling in size again in the future (by doubling the number of hosts again).

Future Idea: Slave Reads

One item on my wishlist is to take advantage of the otherwise idle slaves for handling “expensive” reads. This would be accomplished by prepending a call with something like “slave_” such that this would automatically Do The Right Thing:

$redis->slave_zrevrange($key, $start, $stop, 'withscores');

The AUTOLOAD function in our wrapper would just strip the “slave_” prefix and send the command to the slave of the node that holds the give key. In theory, we could automatically re-route reads (but probably not writes) in to a slave node if the master node happens to become unresponsive or unavailable.

Future Idea: Async Operations

I have a branch of my wrapper API that uses AnyEvent::Redis instead of the blocking Redis client to provide for async access to redis. Since we have code that frequently makes 10-50 redis calls to gather up data (and the calls don’t depend on each other), we could see a substantially performance boost by moving to an async API.

Hopefully after a bit more polishing, cleanup, and testing we can start to take advantage of doing these calls in parallel.

Future Idea: Redis Cluster

Given that all our code is using the same API layer to talk to redis, if we move to redis-cluster someday it should be very straightforward to translate our style of hash keys to using the {hashkey} style that redis-cluster will likely use. That will give us even more flexability and remove some of the management overhead involved in (someday) adding nodes.

See Also: Hacker News discussion

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.

View all posts by Jeremy Zawodny →