Systematic vs individualistic problems
Sometimes things go wrong, for no particular reason other than coincidence. It isn't possible to account for every possible error, and that's generally why resilience is better goal than perfection; how you handle a problem is more pragmatic than preventing every possible problem. But if a problem is reoccurring, for many different systems, be they people or code, it's a sign of an environmental issue.
Where I live, there’s a lot of AirBnB-style rentals available. But instead of going through AirBnB, it’s done on Booking.com. The company renting them out has many apartments in many different buildings, but they’ve put my building’s address as the one shown on Booking.com About once a week, I see a group of lost, confused tourists, trying to find out where they’re staying and how to get there. I’ll usually give them some rough pointers, but if I have time, I’ll show them to where they can find the keys and instructions.
If it were one group of tourists over the last 4 years, it’d likely be a one-off problem. Except it’s a regular occurrence, faced by many different groups. They range from people who have travelled half the world, to those leaving their home country for the first time. A systematic problem is one where the system and environment lead to a problem frequently reoccurring.
Unlike a one-off problem, a systematic problem will keep happening over and over again. A small problem occurring once usually doesn't have much impact. Scale that problem up to 50 groups of people, and it becomes a significant time and energy sink for the users, and the customer support, and anyone unrelated yet helping (such as me).
Users expect Booking.com to contain all the information they need, but the pick up location is sent through other channels. The tourists will always show me the app, containing my street name, but not any emails or texts which inform them that the keys are in a completely different part of the city. Once the key is collected, there's a physical printout with instructions and a map of where to go, but that's not much help until the keys are collected. I get why the company uses Booking.com, since it has a lot of reach. But they do not provide information in a way that users expect based on the platform.
The two solutions I could see are to either use a different platform, or change how information is delivered. The time investment to switch platform is high, and could result in much fewer bookings. Therefore it makes sense why they haven't done something different. Yet still every week there’s a new tourist group having a difficult few hours as they’re in Oslo for the first time.
Systematic problems apply to code, too. If a problem occurs rarely, with low impact on the service, it will probably be de-prioritized. But if it occurs frequently, the cost of not fixing the problem increases.
I worked on a system that involved a good deal of manual testing, where there'd need to be a person on-site, with another triggering the code and monitoring. The cost of testing was therefore quite high each time we’d want to do it. The system of testing was problematic. So I wrote a script that could instead be triggered by the on-site person, which did require a high setup cost, but reduced the overhead involved.
Starting small, with quick ways of testing a solution out, is great for small, infrequent problems. Realising when it doesn’t scale though is important. Code is generally easier to scale than people: bump up more servers, or rewriting performance-critical hot paths. People do not scale. The cost of involving more people in a frequently occurring solution is high, and takes valuable time away from those involved.
Mapping out the cost of fixing a reoccurring problem can be a great way to identify at which point to invest time into preventive fixes. Finding the point where manual, reactive, fixes no longer make sense is an experienced-based skill that engineers get better at over time.