After reading Salvatore’s Redis Presharding, I thought I’d write up the sharding technique we use at Craigslist to take advantage of multiple cores, multiple hosts, and allow for painless growth (up to a point).
Nodes
The primary building block of our Redis infrastructure is a node, which is simply a host:port pair that points to a single redis-server instance. Each node is given a symbolic node name to uniquely identify it in a way that doesn’t tie it to a specific host (or port).
The node name to node mappings are stored in a global configuration file that our Redis API uses behind the scenes.
Replication
We don’t use on-disk persistence very heavily because Redis primarily stores volitile or semi-volitile data and we prefer to have predictibly low latency. It is common to see calls complete in 1ms or less. Many of our calls have timeouts of 100ms to handle exceptional cases. Insetad of frequent disk writes, we replicate every node to a slave node on a different physical host and give it an otherwise identical configuration.
The master/slave mappings are stored in a global configuration file that our Redis API uses behind the scenes.
Configuration
The bulk of our code is Perl, so the configuration is expressed in Perl for easy consumption. Here’s a minimal example that uses 2 hosts, assumed to have 2 cores each. This provides 4 total redis nodes, two primaries (masters) and two secondaries (slaves). Losing a single host will result in virtually zero data loss.
our $redisConfig = { # node names 'nodes' => { redis_1 => { address => '192.168.1.100:63790' }, redis_2 => { address => '192.168.1.100:63791' }, redis_3 => { address => '192.168.1.101:63790' }, redis_4 => { address => '192.168.1.101:63791' }, }, # replication information 'master_of' => { '192.168.1.100:63792' => '192.168.1.101:63790', '192.168.1.100:63793' => '192.168.1.101:63791', '192.168.1.101:63792' => '192.168.1.100:63790', '192.168.1.101:63793' => '192.168.1.100:63791', }, }
In reality, our clusters each have 10 physical hosts and 4 cores, which provides for 40 redis nodes (20 masters and 20 slaves).
Hashing
Our API uses a slightly different approach to hashing keys onto nodes. I wrote an API-compatible wrapper library around the CPAN Redis library that transparently provides consistent hashing and other features like connection timeouts, command timeouts, graceful handling of down servers with retries, etc.
For most operations, the hashing of a key is automatic:
$redis->get("foo");
That’ll automatically run “foo” through the hash function and map it to a node name, which is then mapped to the proper node (host:port pair). This is a very important detail. By mapping keys to node names instead of directly to nodes (host:port pairs), we have the freedom to relocate a node to a different server without disturbing the consistent hashing ring.
To allow the user to specify their own hash key (so that related keys can all land on a given node), you pass an array reference where you’d normally pass a scalar. The first element is the hash key and the second is the real key:
$redis->get(["userinfo", "foo"]);
In that case “userinfo” is the hash key but “foo” is still the name of the key that is fetched from the redis node that “userinfo” hashes to.
Room to Grow
Given this configuration, we have room to double the cluster in overall capacity twice (at a minimum) before we ever have to think about rehashing keys. First, we could simply double the available memory in our servers and increase the maxmemory setting in redis to match. Second, we could also double the number of hosts and move half our nodes to those new hosts. As long as we update the configuration to map the node names to the new nodes (host:port pairs), the consistent hashing ring is unaffected and things continue to run as expected (aside from a brief downtime during that re-configuration).
Real Numbers
As I alread mentioned, today we use 10 hosts, each running 4 redis nodes (2 masters, 2 slaves). Each node has a maxmemory setting of 6GB since these are 32GB boxes and we need to account for fragmentation and the possibility of doing an occasional SAVE or BGSAVE or having to re-attach a slave. That gives us an aggregate capacity of (4*10*6)/2 = 120GB of RAM for the keyspace when you take into account replication (it’s 240GB of RAW capacity).
If we replaced those 10 32GB boxes with 20 64GB boxes, we could choose between keeping the same number of overall nodes (40) and having only 2 per box. That’d mean using maxmemory of 24GB which works out to (2*20*24)/2 = 480GB of RAM for the keyspace (960GB RAW). No re-hashing would be necessary.
If we’re willing to incur a one-time re-hashing, we could double the number of redis nodes (to take advantage of all the cores), and run 4 per box with maxmemory of 12GB which works out to (4*20*12)/2 = 480GB of RAM for the keyspace (960GB RAW). The storage capacity is the same, but we’d have twice the CPU power available and an easy path to doubling in size again in the future (by doubling the number of hosts again).
Future Idea: Slave Reads
One item on my wishlist is to take advantage of the otherwise idle slaves for handling “expensive” reads. This would be accomplished by prepending a call with something like “slave_” such that this would automatically Do The Right Thing:
$redis->slave_zrevrange($key, $start, $stop, 'withscores');
The AUTOLOAD function in our wrapper would just strip the “slave_” prefix and send the command to the slave of the node that holds the give key. In theory, we could automatically re-route reads (but probably not writes) in to a slave node if the master node happens to become unresponsive or unavailable.
Future Idea: Async Operations
I have a branch of my wrapper API that uses AnyEvent::Redis instead of the blocking Redis client to provide for async access to redis. Since we have code that frequently makes 10-50 redis calls to gather up data (and the calls don’t depend on each other), we could see a substantially performance boost by moving to an async API.
Hopefully after a bit more polishing, cleanup, and testing we can start to take advantage of doing these calls in parallel.
Future Idea: Redis Cluster
Given that all our code is using the same API layer to talk to redis, if we move to redis-cluster someday it should be very straightforward to translate our style of hash keys to using the {hashkey} style that redis-cluster will likely use. That will give us even more flexability and remove some of the management overhead involved in (someday) adding nodes.
See Also: Hacker News discussion
Pingback: Redis Sharding at Craigslist | Joseph Scott
Pingback: Redis Sharding at Craigslist « Wobbits
Pingback: Redis Sharding at Craigslist | Jeremy Zawodny’s blog « Netcrema – creme de la social news via digg + delicious + stumpleupon + reddit
Any chance you could tell us the number of req/s the cluster serves, and what the avg/med service times are like?
Is this effectively taking over for memcache?
Pingback: Sysadmin Sunday #20 « Boxed Ice Blog
Pingback: Top Posts — WordPress.com
I didn’t go big enough in the beginning and had to solve the problem from the other end.
http://gamemakers.ngmoco.com/post/3603289160/saved-by-replicas
One of the things that concerns me about running multiple Redis instances on a host is the risk is significant I/O contention caused by background snapshots — the contention must increase as the size of the combined databases and the number of CPUs (and therefore processes per host) grows.
Assume you’re doing snapshots every minute to obtain some reasonable durability, and assume your update velocity is so fast that using AOF is not viable. A host with 32GB RAM and 24GB worth of database would have to save 500MB/s to disk, on average. Even modern SSDs can’t do that.
And if you are running multiple Redis processes on a host to scale across multiple cores, background snapshots will compete with one another for I/O. This can’t be good.
The only way to mitigate this issue is to reduce the snapshot frequency, which sacrifices durability.
The rdb file can be compressed so you can sacrifice some CPU to get the amount of data needed to be written down . Of course the compression ratio depends on what you stor, so selecting the right snap-freq will be tricky.
And with multiple redis instances you can place the rdb files on separate disks
Michael, Jeremy says “We don’t use on-disk persistence very heavily” which can mean also that snapshots are done every couple of minutes or every hour.
Pingback: Open-Source Redis Administration Console…? | JR on Web Dev
Thank you for your insights. I’m learning about Redis and your article is very helpful. Small typo: Insetad -> Instead
hi,Jeremy.You mentioned in ‘Configure’ section,you have 4 redis nodes,2 masters,2 slaves.But in your perl code,there are 8 redis instant,4 masters,4 slaves.That confused me.
Pingback: How does craigslist shard redis's pipleline command? - Quora
This is beautiful dude! I have implemented similar featuresets in PHP around the PHPRedis module, but I still need to reach out the timeout and retry handling next, but it’s very comforting to see a big name using the same methods. Please post more about your Redis techniques soon 🙂
@Justin DeMaris Is it open source and available somewhere?
Pingback: [발 번역] Redis Sharding at Craigslist | Charsyam’s Blog
Pingback: Craigslist.org uses Redis Sharding and PERL to get fast NOSQL cluster database with 1 ms low latency. Note for HFT platforms! | QUANTLABS.NET
Pingback: It’s time for true 21st century web dev: AppFog now supports Redis…AND RabbitMQ - Platform as a Service Magazine
Hi Jeremy,
Thanks for the insights!
Could you describe how exactly you use the slaves? In other words, how do you deal with a scenario when one of the masters fails?
You mention “…otherwise idle slaves…” – does this mean that in normal course, your application does not interact with the slaves at all?
Thanks!
Vinay
Pingback: Introducing disredis: an open source client to automated sharding and failover | Engineering @ CustomMade
Pingback: Consistent hashing with Memcached or Redis, and a patch to libketama | Wayfair Engineering
Pingback: Увеличение производительности Redis с помощью простого кластера » CreativLabs
Pingback: Who is Using Redis? | REDIS ONLINE
I see a lot of interesting articles on your blog.
You have to spend a lot of time writing, i know how to save you a lot of work,
there is a tool that creates unique, SEO friendly posts
in couple of seconds, just search in google – laranita’s free content source
I read a lot of interesting posts here. Probably you spend a
lot of time writing, i know how to save you a lot of time,
there is an online tool that creates readable, google friendly posts in seconds, just search in google –
laranitas free content source
Pingback: Увеличение производительности Redis с помощью простого кластера | FNIT.RU
Pingback: Redis: Links, News And Resources (3) | Angel "Java" Lopez on Blog
Pingback: The Absolutely Simplest Consistent Hashing Example - PyBloggers
Pingback: NoSQL vs. SQL: Data Management Solution - ProgramFault
Pingback: Real Python: How to Use Redis With Python – Cebu Scripts
Pingback: More Than 8,000 Unsecured Redis Instances Found In The Cloud – ThreatRavens
Pingback: More Than 8,000 Unsecured Redis Instances Found in the Cloud – TerabitWeb Blog
Pingback: More Than 8,000 Unsecured Redis Instances Found in the Cloud | Antivirus and Security news
Pingback: More Than 8,000 Unsecured Redis Instances Found in the Cloud | Cell Phone Spy and Mobile Tracking Software
Pingback: More Than 8,000 Unsecured Redis Instances Found in the Cloud
Pingback: More Than 8,000 Unsecured Redis Instances Found in the Cloud - Virtualattacks
Pingback: More Than 8,000 Unsecured Redis Instances Found in the Cloud - TrendLabs Security Intelligence Blog
Pingback: Cómo usar Redis con Python - Python super fácil