Recently, I wrote about maintenance windows and how they should not be used for modern, cloud-native applications. I talked about why maintenance windows are a bad idea and explained why your customers still consider maintenance windows as downtime. Customers want to access your web application on their schedule, not your schedule.
No truly modern, cloud-native application should require downtime to perform a product upgrade or routine maintenance. A modern, cloud-native application should have no trouble operating 24/7, with no need for scheduled maintenance windows, ever.
Increasing Technical Debt
If your application requires maintenance windows due to some historical architectural issue, then you should address it for what it truly is: An architectural problem. This downtime requirement is actually technical debt imposed on your application by your application’s systems and processes. This technical debt is costing you customer satisfaction and it’s costing you money. After all, your customers don’t care why your application is down. They just care that it is down.
As your application grows and expands, it will be harder to justify having a regular downtime window. Just like any other technical debt, you’ll make your application easier to maintain, grow and scale by reducing the technical debt associated with the application—and this includes removing the need to use maintenance windows to make changes.
You must be able to expand your application without impacting your customers.
But my Change Really Does Need Downtime to Deploy!
That’s rarely the case. In most cases, the need to bring an application down to perform an upgrade isn’t a necessity. Rather, it’s a shortcut. It’s generally substantially easier to design migration strategies if your application is down during the upgrade than it is to design migration strategies that don’t require downtime. Put another way, it’s much easier to bring the application down than to try and do the same work while it is still operating.
Designing downtime-less migration strategies can be a challenge. They take more time to implement and may use more resources. Hence, they cost more money to implement. But, in the vast majority of cases, this additional effort and cost is worth it because:
- More Thought and Attention. Requiring downtime-less migrations forces the team to spend time on the entire migration solution. Applying this additional attention may allow other potential issues or concerns to surface that were not previously considered during the simpler and quicker discussion using downtime.
- Lost Customer Opportunity. There is a lost-opportunity cost caused by losing or inconveniencing customers when you bring your application down. The lost-opportunity cost is, for most successful companies’ applications, substantially greater than the development costs involved in using a better migration strategy.
The first point is worth emphasizing. We, developers, tend to get lazy. When we believe we can freely take down our application to perform an upgrade, we take advantage of that and build a simpler upgrade mechanism that assumes the application can be down. If, instead, we do not allow for the option of bringing down the application, then developers must focus on the upgrade process more than they otherwise normally would.
When developers are required to think about the operational impact of a change, they tend to produce fewer operational issues than when they simply “throw it into production” and do not consider such ramifications. When you depend on maintenance windows, overall quality and availability suffer.
Sometimes, You Don’t Have a Choice
But I recognize that, sometimes, you don’t have a choice. Unfortunately, many applications simply can’t be upgraded without impacting the availability of the application. Some applications, including relatively modern applications, simply must be brought down to perform some types of upgrades.
The majority of changes that require downtime involve data migrations. Changing the size, shape or structure of data is often difficult or nearly impossible to do without bringing the application—or parts of the application—down, even briefly.
Still, even in this case, don’t use predefined maintenance windows. Having a predefined, predetermined, regularly scheduled maintenance window is simply an invitation to allow teams to make use of the window. Even for upgrades that don’t involve data manipulation and could, in fact, be done without any downtime, the implemented migration plan typically will require downtime more often if your organization has predefined organizational maintenance windows than if you don’t have such windows defined.
The default answer should always be “No downtime to perform an upgrade.” Upgrades that must require downtime to perform should be treated as anomalies and go through a special, much more rigorous approval process. The bar to allow an upgrade that forces application downtime should be very high, and the process to overcome that bar should be highly scrutinized.
It should not be easier to “just bring the app down” than it is to think deeper about the problem trying to be solved.