Handling Database Failover at Craigslist

There has been some interesting discussion on-line recently about how to handle database (meaning MySQL, but really it applies to other systems too) failover. The discussion that I’ve followed so far, in order, is:

GitHub’s report on their automated failover and downtime issues
Baron’s follow-up to that, Is automated failover the root of all evil? which generated a bit of discussion in the comments
Peter’s follow-up to both, The Math of Automated Failover which heads in the direction I was going when I realized I might want to toss my 2 cents into the mix

As Rick James (from Yahoo) notes in the comments on Baron’s posting, they take the same approach that I still advocate and which we use at Craigslist: no automated failover. Get a human involved. But try to make it as easy for that human to do two very important things:

Get a clear picture of the state of things
Put things in motion once a choice has been made

It’s that simple.

Peter’s posting gets at the heart of the matter for me. While it’d be fun (and scary) to try and build a great automated system to detect failures and Do The Right Thing, it’s also a really hard problem to solve. There are lots of little gotchas and if you get it wrong, the amount of pain you can bring is potentially enormous.

At Craigslist, we share some similarities with Yahoo. We own our own hardware and it is installed in space that we manage. We try to select good hardware and take good care of it. And things still fail (of course). But the failures are not so frequent that we’re constantly worried about the next MySQL master that’s going to die in the middle of the night.

Rick pointed at MHA in his comment. I need to have a look at it and/or point some of my coworkers at it. I didn’t realize it existed and spent a couple weeks creating a custom tool to help with #1 above. In the event of a master failure, it looks at all available slaves, finds the most suitable candidates, presents a list, and allows the operator to choose a new master. Once selected, the script then tries to automate as much of the switching as possible.

Though I’ve stared at the code quite a bit and tried to reason about the ways it might fail, and I feel pretty good about it, we’ve never actually used it. And that’s OK, really. We have a nicely documented playbook of what to do in that sort of situation already. It has served us well. And, as I said, it doesn’t happen that often. All the script does it try to automate existing practice so that we can turn 10-20 minutes of “read-only” time into less than 5 minutes.

There’s a point at which you start to wonder if that savings is worth the risk of a tricky to spot bug finding its way in and turning 20 minutes into many hours of late night pain. I’m not sure where I stand on that in this particular case. Something like Galera Cluster for MySQL is interesting too, but I kinda feel like it pays not to be an early adopter here too. If we had a lot of problems with master failures, I’d surely feel differently.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.

View all posts by Jeremy Zawodny →

12 Responses to Handling Database Failover at Craigslist

Shlomi Noach says:

September 18, 2012 at 8:56 pm

Hi,
I see “only manual failover” as a privilege of the larger companies, which can allow a group of experienced DBA’s, readily accessible in shifts. This is fine! The smaller companies cannot rely on always having someone in reach of a computer and available for a couple hours emergency work, which is why they look for automated failover solutions.

Partha Dutta says:

September 19, 2012 at 1:23 am

I wouldn’t feel to comfortable about having a failover script and never actually using it. Its a good idea to have periodic controlled failovers (or switchovers) in order to ensure that there is some sanity and insurance that the documented process will work (Same things go for testing backups….)

nelsonminar says:

September 19, 2012 at 8:11 am

The alternative is to go with a Chaos Monkey approach and fail the database over randomly once a day. If you can build your systems so that actually works you’re much better off, but it’s awfully difficult if you start from a traditional database perspective.

It’s funny how no one believes in Master/Master replication.

Gavin Towey says:

September 19, 2012 at 1:05 pm

@Shlomi Noach
In addition to that, I think the other major case for working towards automated failover is maintaining your SLA.

When a master mysql database fails that’s usually disastrous because now there’s no way to accept write requests. If you’re lucky and well engineered, your site becomes read-only and if not some or all of your site is offline.

In the middle of the night, it can take quite a bit of time to wake the appropriate person, get them online — not to mention vpn and access issues, especially if you’re traveling, and then to get them to make a decision.

How long does it take to recover in those cases? Would you rather have a 20 minute outage or a 20 second one?

Pingback: Stuff The Internet Says On Scalability For September 21, 2012 | Krantenkoppen Tech
Pingback: Failover is evil | OpenLife.cc
David Mytton says:

September 22, 2012 at 7:50 am

If you completely rely on manual failover then you’re transfering the risk from an automated process to a manual process. There’s more chance of a human doing something wrong than a script. The real problem is complexity – there are many different kinds of failover scenarios which mean that automation can work well in some cases but really mess things up in others. And those “others” tend to be where the failure is fleeting an intermittent i.e. flapping between states.

A good approach could be to automate some of the basic failover situations and only allow 1 failover. Then alert a human and get them involved to diagnose the root problem or take further actions.

It comes down to what the effects of failure are and how much downtime you’re willing to have. Going to read only mode is less of a problem than complete downtime. Larger companies can afford to have staff on site 24/7 where getting as close to that mythical 100% uptime is crucial.

Pingback: Sysadmin Sunday 100 - Server Density Blog
James B. says:

December 11, 2012 at 6:26 pm

I wanted to correct a couple misunderstandings above:

> The alternative is to go with a Chaos Monkey approach and fail the database over randomly once a day.

At Netflix, databases (MySQL and Cassandra) were normally whitelisted when I was there, so that Chaos Monkey skips killing them. The cost of a bug in Chaos Monkey killing multiple servers, or interrupting a master promotion, or degrading performance, is simply too high.

> It’s funny how no one believes in Master/Master replication.

Even if Master/Master is supported by your database (vanilla MySQL doesn’t), there is always a cost for preventing in-flight collisions over a network.

Pingback: Сентябрьская лента: лучшее за месяц (2012)
how to hack twitter says:

June 2, 2013 at 8:15 pm

That is really interesting, You’re an overly professional blogger. I have joined your feed and sit up for in the hunt for extra of your excellent post. Also, I’ve
shared your web site in my social networks

u says:

October 11, 2019 at 11:38 pm

Sinon, ce sera encore un autre triste année pour les Gunners dans le League.