There has been some interesting discussion on-line recently about how to handle database (meaning MySQL, but really it applies to other systems too) failover. The discussion that I’ve followed so far, in order, is:
- GitHub’s report on their automated failover and downtime issues
- Baron’s follow-up to that, Is automated failover the root of all evil? which generated a bit of discussion in the comments
- Peter’s follow-up to both, The Math of Automated Failover which heads in the direction I was going when I realized I might want to toss my 2 cents into the mix
As Rick James (from Yahoo) notes in the comments on Baron’s posting, they take the same approach that I still advocate and which we use at Craigslist: no automated failover. Get a human involved. But try to make it as easy for that human to do two very important things:
- Get a clear picture of the state of things
- Put things in motion once a choice has been made
It’s that simple.
Peter’s posting gets at the heart of the matter for me. While it’d be fun (and scary) to try and build a great automated system to detect failures and Do The Right Thing, it’s also a really hard problem to solve. There are lots of little gotchas and if you get it wrong, the amount of pain you can bring is potentially enormous.
At Craigslist, we share some similarities with Yahoo. We own our own hardware and it is installed in space that we manage. We try to select good hardware and take good care of it. And things still fail (of course). But the failures are not so frequent that we’re constantly worried about the next MySQL master that’s going to die in the middle of the night.
Rick pointed at MHA in his comment. I need to have a look at it and/or point some of my coworkers at it. I didn’t realize it existed and spent a couple weeks creating a custom tool to help with #1 above. In the event of a master failure, it looks at all available slaves, finds the most suitable candidates, presents a list, and allows the operator to choose a new master. Once selected, the script then tries to automate as much of the switching as possible.
Though I’ve stared at the code quite a bit and tried to reason about the ways it might fail, and I feel pretty good about it, we’ve never actually used it. And that’s OK, really. We have a nicely documented playbook of what to do in that sort of situation already. It has served us well. And, as I said, it doesn’t happen that often. All the script does it try to automate existing practice so that we can turn 10-20 minutes of “read-only” time into less than 5 minutes.
There’s a point at which you start to wonder if that savings is worth the risk of a tricky to spot bug finding its way in and turning 20 minutes into many hours of late night pain. I’m not sure where I stand on that in this particular case. Something like Galera Cluster for MySQL is interesting too, but I kinda feel like it pays not to be an early adopter here too. If we had a lot of problems with master failures, I’d surely feel differently.