Redis Sharding at Craigslist

After reading Salvatore’s Redis Presharding, I thought I’d write up the sharding technique we use at Craigslist to take advantage of multiple cores, multiple hosts, and allow for painless growth (up to a point).

Nodes

The primary building block of our Redis infrastructure is a node, which is simply a host:port pair that points to a single redis-server instance. Each node is given a symbolic node name to uniquely identify it in a way that doesn’t tie it to a specific host (or port).

The node name to node mappings are stored in a global configuration file that our Redis API uses behind the scenes.

Replication

We don’t use on-disk persistence very heavily because Redis primarily stores volitile or semi-volitile data and we prefer to have predictibly low latency. It is common to see calls complete in 1ms or less. Many of our calls have timeouts of 100ms to handle exceptional cases. Insetad of frequent disk writes, we replicate every node to a slave node on a different physical host and give it an otherwise identical configuration.

The master/slave mappings are stored in a global configuration file that our Redis API uses behind the scenes.

Configuration

The bulk of our code is Perl, so the configuration is expressed in Perl for easy consumption. Here’s a minimal example that uses 2 hosts, assumed to have 2 cores each. This provides 4 total redis nodes, two primaries (masters) and two secondaries (slaves). Losing a single host will result in virtually zero data loss.

our $redisConfig = {
    # node names
    'nodes' => {
        redis_1 => { address => '192.168.1.100:63790' },
        redis_2 => { address => '192.168.1.100:63791' },
        redis_3 => { address => '192.168.1.101:63790' },
        redis_4 => { address => '192.168.1.101:63791' },
    },

    # replication information
    'master_of' => {
        '192.168.1.100:63792' => '192.168.1.101:63790',
        '192.168.1.100:63793' => '192.168.1.101:63791',
        '192.168.1.101:63792' => '192.168.1.100:63790',
        '192.168.1.101:63793' => '192.168.1.100:63791',
    },
}

In reality, our clusters each have 10 physical hosts and 4 cores, which provides for 40 redis nodes (20 masters and 20 slaves).

Hashing

Our API uses a slightly different approach to hashing keys onto nodes. I wrote an API-compatible wrapper library around the CPAN Redis library that transparently provides consistent hashing and other features like connection timeouts, command timeouts, graceful handling of down servers with retries, etc.

For most operations, the hashing of a key is automatic:

$redis->get("foo");

That’ll automatically run “foo” through the hash function and map it to a node name, which is then mapped to the proper node (host:port pair). This is a very important detail. By mapping keys to node names instead of directly to nodes (host:port pairs), we have the freedom to relocate a node to a different server without disturbing the consistent hashing ring.

To allow the user to specify their own hash key (so that related keys can all land on a given node), you pass an array reference where you’d normally pass a scalar. The first element is the hash key and the second is the real key:

$redis->get(["userinfo", "foo"]);

In that case “userinfo” is the hash key but “foo” is still the name of the key that is fetched from the redis node that “userinfo” hashes to.

Room to Grow

Given this configuration, we have room to double the cluster in overall capacity twice (at a minimum) before we ever have to think about rehashing keys. First, we could simply double the available memory in our servers and increase the maxmemory setting in redis to match. Second, we could also double the number of hosts and move half our nodes to those new hosts. As long as we update the configuration to map the node names to the new nodes (host:port pairs), the consistent hashing ring is unaffected and things continue to run as expected (aside from a brief downtime during that re-configuration).

Real Numbers

As I alread mentioned, today we use 10 hosts, each running 4 redis nodes (2 masters, 2 slaves). Each node has a maxmemory setting of 6GB since these are 32GB boxes and we need to account for fragmentation and the possibility of doing an occasional SAVE or BGSAVE or having to re-attach a slave. That gives us an aggregate capacity of (4*10*6)/2 = 120GB of RAM for the keyspace when you take into account replication (it’s 240GB of RAW capacity).

If we replaced those 10 32GB boxes with 20 64GB boxes, we could choose between keeping the same number of overall nodes (40) and having only 2 per box. That’d mean using maxmemory of 24GB which works out to (2*20*24)/2 = 480GB of RAM for the keyspace (960GB RAW). No re-hashing would be necessary.

If we’re willing to incur a one-time re-hashing, we could double the number of redis nodes (to take advantage of all the cores), and run 4 per box with maxmemory of 12GB which works out to (4*20*12)/2 = 480GB of RAM for the keyspace (960GB RAW). The storage capacity is the same, but we’d have twice the CPU power available and an easy path to doubling in size again in the future (by doubling the number of hosts again).

Future Idea: Slave Reads

One item on my wishlist is to take advantage of the otherwise idle slaves for handling “expensive” reads. This would be accomplished by prepending a call with something like “slave_” such that this would automatically Do The Right Thing:

$redis->slave_zrevrange($key, $start, $stop, 'withscores');

The AUTOLOAD function in our wrapper would just strip the “slave_” prefix and send the command to the slave of the node that holds the give key. In theory, we could automatically re-route reads (but probably not writes) in to a slave node if the master node happens to become unresponsive or unavailable.

Future Idea: Async Operations

I have a branch of my wrapper API that uses AnyEvent::Redis instead of the blocking Redis client to provide for async access to redis. Since we have code that frequently makes 10-50 redis calls to gather up data (and the calls don’t depend on each other), we could see a substantially performance boost by moving to an async API.

Hopefully after a bit more polishing, cleanup, and testing we can start to take advantage of doing these calls in parallel.

Future Idea: Redis Cluster

Given that all our code is using the same API layer to talk to redis, if we move to redis-cluster someday it should be very straightforward to translate our style of hash keys to using the {hashkey} style that redis-cluster will likely use. That will give us even more flexability and remove some of the management overhead involved in (someday) adding nodes.

See Also: Hacker News discussion

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly a Flight Design CTSW and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.
This entry was posted in craigslist, nosql, programming, redis. Bookmark the permalink.

25 Responses to Redis Sharding at Craigslist

  1. Pingback: Redis Sharding at Craigslist | Joseph Scott

  2. Pingback: Redis Sharding at Craigslist « Wobbits

  3. Pingback: Redis Sharding at Craigslist | Jeremy Zawodny’s blog « Netcrema – creme de la social news via digg + delicious + stumpleupon + reddit

  4. Jim B. says:

    Any chance you could tell us the number of req/s the cluster serves, and what the avg/med service times are like?

    Is this effectively taking over for memcache?

  5. Pingback: Sysadmin Sunday #20 « Boxed Ice Blog

  6. Pingback: Top Posts — WordPress.com

  7. I didn’t go big enough in the beginning and had to solve the problem from the other end.
    http://gamemakers.ngmoco.com/post/3603289160/saved-by-replicas

  8. Michael Fischer says:

    One of the things that concerns me about running multiple Redis instances on a host is the risk is significant I/O contention caused by background snapshots — the contention must increase as the size of the combined databases and the number of CPUs (and therefore processes per host) grows.

    Assume you’re doing snapshots every minute to obtain some reasonable durability, and assume your update velocity is so fast that using AOF is not viable. A host with 32GB RAM and 24GB worth of database would have to save 500MB/s to disk, on average. Even modern SSDs can’t do that.

    And if you are running multiple Redis processes on a host to scale across multiple cores, background snapshots will compete with one another for I/O. This can’t be good.

    The only way to mitigate this issue is to reduce the snapshot frequency, which sacrifices durability.

    • John says:

      The rdb file can be compressed so you can sacrifice some CPU to get the amount of data needed to be written down . Of course the compression ratio depends on what you stor, so selecting the right snap-freq will be tricky.
      And with multiple redis instances you can place the rdb files on separate disks

    • Mihai says:

      Michael, Jeremy says “We don’t use on-disk persistence very heavily” which can mean also that snapshots are done every couple of minutes or every hour.

  9. Pingback: Open-Source Redis Administration Console…? | JR on Web Dev

  10. Andre Voget says:

    Thank you for your insights. I’m learning about Redis and your article is very helpful. Small typo: Insetad -> Instead

  11. garfee says:

    hi,Jeremy.You mentioned in ‘Configure’ section,you have 4 redis nodes,2 masters,2 slaves.But in your perl code,there are 8 redis instant,4 masters,4 slaves.That confused me.

  12. Pingback: How does craigslist shard redis's pipleline command? - Quora

  13. This is beautiful dude! I have implemented similar featuresets in PHP around the PHPRedis module, but I still need to reach out the timeout and retry handling next, but it’s very comforting to see a big name using the same methods. Please post more about your Redis techniques soon :)

  14. Pingback: [발 번역] Redis Sharding at Craigslist | Charsyam’s Blog

  15. Pingback: Craigslist.org uses Redis Sharding and PERL to get fast NOSQL cluster database with 1 ms low latency. Note for HFT platforms! | QUANTLABS.NET

  16. Pingback: It’s time for true 21st century web dev: AppFog now supports Redis…AND RabbitMQ - Platform as a Service Magazine

  17. Vinay says:

    Hi Jeremy,
    Thanks for the insights!

    Could you describe how exactly you use the slaves? In other words, how do you deal with a scenario when one of the masters fails?

    You mention “…otherwise idle slaves…” – does this mean that in normal course, your application does not interact with the slaves at all?

    Thanks!
    Vinay

  18. Pingback: Introducing disredis: an open source client to automated sharding and failover | Engineering @ CustomMade

  19. Pingback: Consistent hashing with Memcached or Redis, and a patch to libketama | Wayfair Engineering

  20. Pingback: Увеличение производительности Redis с помощью простого кластера » CreativLabs

  21. Pingback: Who is Using Redis? | REDIS ONLINE

  22. Maybell says:

    I see a lot of interesting articles on your blog.
    You have to spend a lot of time writing, i know how to save you a lot of work,
    there is a tool that creates unique, SEO friendly posts
    in couple of seconds, just search in google – laranita’s free content source

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s