Announcing maintenance, and avoiding clock confusion
I'll see you at 12! At 12am, 12pm, 00:00, or 24:00? Tomorrow or today?
If you’ve ever been responsible for a site or service with many users, you’ll be familiar with the problem of “I need to upgrade the server but don’t want to disrupt users”. If your users are mostly based in one country, there’s an obvious time window when you can do exactly that: during the night, when everyone is asleep.
There’s some downsides to that, but I’d like to focus on a less obvious one first. Most people should be asleep by 23:00, and engineers love to do things on round numbers. The next big round number is midnight. This is where the problem lies.
If you’re in a country that doesn’t use the 24 hour clock, it’s 12am1. However, some people refer to it by 12pm, because 12 is where a.m changes to p.m or vice versa2. But switching to the 24 hour clock doesn’t exactly solve it, either. Let’s introduce a theoretical date and time, 20/01/2024 00:00.
Does this refer to:
One minute after 19/01/2024 23:59
One minute after 20/01/2024 23:59
The correct answer is 1), but it can be interpreted to mean 2), too. Things get more confusing when you introduce 20/01/2024 24:00, as that should mean “one minute after 20/01/2024 23:59”. But it is sometimes used as “one minute after 19/01/2024 23:59”.
All this is pretty cultural. Beyond the divide of the 24 hour clock and the 12 hour clock, there’s also an element of language. Different languages refer to time in different ways. In Norway, “half 4” refers to 15:30, whereas in British English, it’s 16:30. It doesn’t stop there, though. Dialects differ even within a country. What would you refer to as the morning meal, the midday meal, and the evening meal3? Depending on where you’re from, these terms all apply:
Morning: breakfast
Midday: lunch, dinner
Evening: dinner, tea, supper
If you’re working in a diverse team, be it country-wise, language-wise, or even dialect-wise, it’s usually a good idea to stick to the 24 hour clock when referring to events in text, and remove ambiguity and possible misinterpretations.
Back to the 00:00 problem, though. The travel industry, such as flights or trains, usually avoids 00:004. It’s standard practice to schedule one side, either at 23:59 or 00:01. This removes all uncertainty by firmly putting the time into one day or another. I go one further, and recommend +/- 5 minutes, just to be super clear. It’s entirely feasible for people to understand each other when talking about 00:00, but why risk it? 00:05 or 23:55 is more clear.
The best time to have downtime, for developers, is usually during the working day. Your team is around, people can be called in to help. Everyone is usually at a computer. It becomes more obvious when something is wrong, because people are actually using or working with a service. If you’re performing a large migration that could take your service offline, you might not be able or want to do it during the workday, and it often makes business sense to do it out of hours.
Regardless of when it’s scheduled to happen, there’s some basic rules to follow:
Inform the teams that will be impacted ahead of time.
At least a week in advanced is good, but don’t do it too far out - people will forget. Post a reminder when the time gets close.
Announce the maintenance in public channels, where people can easily find the message, even if you’re not aware of them being directly impacted. Direct messages or emails will only apply to the specific people you target. Whereas public channel messages are easily shareable or searchable, even for those that you haven’t considered to be directly impacted.
The service documentation should make it very obvious where to find these messages, whether it’s in the Github repo, official documentation, or a service catalogue. Invite people to the public channel if you know they will be impacted.
Keep it short and sweet, but highlight:
The expected start date and time (with the timezone5)
The expected duration (be pessimistic — it’s better to say 5 hours and finish in 2 rather than to say 2 hours)
The impact on users
Any expected error messages teams might see
The people involved in the maintenance, and how to contact them
Inform people when the migration starts, and when it has ended
Don’t be afraid to rollback if you run into unexpected problems that would take much longer to resolve than expected
Additionally, I really like having playbooks6 for this type of stuff. If you’re going to be running a series of commands, write them all down somewhere and share it with everyone who will be taking part. Just like code reviews, you may find that someone spots a mistake or a problem. If you want to go further, give an estimated time for each step. The playbook should also document the steps required to restore a system to the last stable state, so that you don’t need to worry about it when something goes wrong. Do as much work as possible beforehand, so you can do as little during the maintenance window.
If you have a shared calendar, adding an event makes it even more clear. Calendars will automatically convert timezones for you, usually. Some calendars tend to make out-of-hours stuff less visible though, so this isn’t a replacement for an announcement.
Finally, and this is more of an employee/employer tip, if employees are working out of their usual hours, they should be compensated for the time worked. Either through additional payment, or time off. The compensation rates should be agreed to beforehand, and added to the contract. Depending on your company and the country, your union or government may have rules regarding this. To the employer, round up to the nearest hour. If someone works for 20 minutes at 23:59, then they’ve mentally preparing since 23:20, and have their sleep disrupted. So give them 1 hour of compensation, at least.
To sum up:
Avoid 00:00 or 12am/pm, prefer 00:05 or 23:55
Tell people what you’re doing, when, who, and what to expect
Write down each step beforehand, and how to recover
Pay workers fairly
Unless everyone in your company is from a place that uses the 12 hour clock, you really should use the 24 hour one.
This is even more confusing when people are used to the 24 hour clock and have to work with the 12 hour clock.
Eating habits also differ, introducing new terms: brunch, supper, afternoon tea, etc.
I think they also avoid 12am/pm in countries without a 24:00 clock, but I’m not certain about that.
ISO 8601 is a good standard, though be sure to make it readable (i.e not 2024-03-12T20:47:10Z+00:00, but 2024-03-12 at 20:47:10 in UTC+00:00). Use the timezone that your company usually refers to for the rest of working hours — employees in other timezones usually will use that to calculate the difference. UTC+00:00 globally is a good default, though.
It doesn’t need to be complicated, just a series of separated code blocks in a markdown file is better than nothing.