database schema upgrades
Sooner or later, every database is going to need a schema change, usually to coincide with a software upgrade. Sometimes it’s simple, just adding a little more data. Sometimes it’s not so simple and things go wrong and you need to make them right. Here’s my current best idea how to go about minimizing pain, assuming we have some software version X that we’re upgrading to version X+1, and that the installer or upgrade tool will need to upgrade the database from schema Y to schema Y+1. I’ve used plans like this pretty successfully, but came back to the idea after thinking about the statistical component of the problem more recently.
Lots of things can go wrong. You don’t know as much about the target database as you’d like to know. Especially if the user has direct access to it, you have no idea what manual changes may have been made, but even hosted databases get tweaked in ways which seem safe but cause trouble later. There will always be implicit constraints that the software product assumes are in place but not actually enforced by the database. It is an irresistible force of nature.
failure
What does it mean for an upgrade to fail? In the best case, some SQL command failed, you (in the guise of the upgrade tool) rollback, and the user keeps going with the old version X. You still have a bug to fix, but you’re not completely dead in the water.
In the not so best case, everything completed, but produced wrong results. Now the data is either corrupt or causes version X+1 to crash. (It’s possible for the data to be right and the software to be wrong too, but it’s about the same end result.) It’s tempting to require that all upgrade steps be reversible, but I don’t feel too good about that in practice. If you couldn’t get the forward data transformation to work right, how confident are you that the less tested reverse transformation is going to work?
fixes
Four things you can do, depending on the problem. Fix the code (either the product or the upgrader). Fix the data (by hand). Restore from backup. Apologize profusely. Or a combination.
Of the four, only the first scales to thousands of customers/users/accounts, but if the customer is down right now and they need to be up right now then you probably go with the fix it twice strategy. Fix their data by hand, then fix the code for future upgraders.
minimizing pain
How can we roll out the new version to thousands of customers without a support meltdown? Pick a pain threshold. Ten is a good number, I’ll use it. That’s how many customers you are willing to risk, at any one time, to be in the super emergency priority 0 fix category.
We aren’t sure, at the start, how smoothly the upgrade will go. So ten is the limit we can upgrade in the first batch. Yeah, sure, we’ve done some testing and migrated all the internal systems, but pulling the trigger with customer data is always different. Upgrade ten accounts. Any trouble? Fix it, then repeat until you get no trouble.
If we upgrade ten accounts without trouble, we can optimistically say that our failure rate must be below 10%, or else one of the ten would have failed. So keeping our pain tolerance at 10, we can upgrade 100 accounts “knowing” that at most 10% will fail. If that works, we assume a failure rate of no greater than 1% and move onto 1000.
This technique isn’t infallible. You can have rough patches where suddenly 20/100 fail. It’s a start though, and intuitively makes some sense.
the pudding
What if we wanted to really calculate our risk? Assume there is a chance p (0.0 to 1.0) that any given upgrade will complete successfully. (The previous section dealt with failure rates, this section uses success rates, as I think it’s easier to reason about it that way, but it’s all the same.) What do we know about p from our initial upgrade of ten accounts? All the trials completed successfully, so p^10 = X. What’s a good value for X? Let’s start with 0.5, assuming the outcome we did observe is at least the 50% probable outcome. Solving, p = 0.9330. Hey, that’s even better than the naive 90% we assumed above. We can expect at most 7 failures moving to 100 trials.
Still pretty naive. Let’s look at a few more values for X. Assume we’re lucky to start. What’s the lowest p that’d still give us a 5% of getting 10 successes? p^10 = .05, p = 0.7411. That corresponds to an expected 74 successes and a somewhat overwhelming 26 failures at the next level up. If we’re unlucky, it could be worse. So a lucky streak followed by an unlucky streak could result in a really bad day.
How confident can we be that our strategy won’t screw us over? Unfortunately, not very. Ten trials is just not enough to produce a good confidence interval. There are some statistical tests that can be run, but the results for 10/10 are just not inspiring. So what do we do? Maybe repeat the first step and upgrade ten accounts at a time once or twice more. Maybe just roll the dice. The good news is that 100 successful upgrades is much stronger evidence that we can reliably upgrade 1000.
We started out with some intuition and a sensible approach. We apply a little math and things are still looking good. Then we apply some serious math and consider some less likely scenarios and get an answer we don’t like. What do we do? Trust our gut and ignore the math? We can also hope that our failures are not distributed according to some probability p. Either they truly are one off crazies or they are systemic failures that will show up quickly.
other
Be careful how you select which accounts to upgrade. Going from newest to oldest or the reverse means all the databases that have some funky artifact due to at one time running a particular version of the software will clump together.
Picking accounts to upgrade is easy if you’re hosting them all. This of course assumes it’s technically possible to handle them one at a time because they have isolated databases (the pros and cons of organizing databases like this is debatable and well beyond the scope of this post). If everybody’s data is in the same database, it’s either go or no go.
Getting off site customers to upgrade in a timely fashion is harder. The first ten picked may not want to or be able to upgrade right away, so you have to expand the pool, but now you’re increasing your exposure. I’ve been bitten a few times by assuming silence meant all good when it really meant not ready yet.