News sites have two main traffic patterns: day-to-day patterns, and breaking news patterns. Caches play a big role in both, but the behaviour needed is a little different.
Day-to-day patterns are predictable, and generally follow a curve. Some periods might have higher averages, but overall, it’s easy to identify when there’ll be organic traffic.
Breaking news is unpredictable. We don’t currently have the technology to predict when big disasters happen. Big disasters tends to be what causes unexpected large spikes in traffic. A time period when the usual traffic is x, breaking news might spawn traffic to 30x the usual amount.
There are some events each year where we are given some headsup that a breaking news story is coming up before it happens. These are typically elections, big political discussions, big sports games. These events are a gift to the technical side of news sites. Not only can we prepare for these specific events, but the improvements made will help for the unexpected events too.
News sites rely a lot on caching. If a cache is working correctly, then traffic spikes will be easy to handle: just hit the cache instead of the backend server. Caches scale well. Instead of worrying about scaling up backend services that consume a lot of compute time per request, the logic is done via lookup up paths in key-value pair. While caches often aren’t truly only map lookups, they still have a considerably smaller complexity time. Generating a response might be 30t, but looking up a value in the cache is 1t.
Backend service are rarely isolated from each other. A request might need multiple databases or other sources in order to generate the response. Responding at the cache layer isn’t solely to improve the response time, but to reduce to load on all the services that are involved in generating the response.
Unfortunately, since caches are very fast, a lot of logic ends up being put into them. It’s faster to do some things in a cache than in the backend service itself. Some of the logic might be necessary - varying on headers, redirecting logged in vs logged out requests.
Handling spikes in traffic requires the caches to do their job properly, and extra logic can make that harder. Clever solutions can result in complex code, and complex code can result in unexpected behaviour. The code paths become infrequently visited, and hard to trace.
During the spike, the desired outcome is for the cache to handle as many paths as possible without needing to reach the backend. Breaking news evolves. It’s important for the user to get the latest version of an article when they open it. Caches cannot indefinitely respond with the version of the article that is stored. An architecture that informs the cache when it needs to hit the backend for a new version is usually better than one that caches forever, or one that always hits the backend.
Infrastructure behaviour should adapt to the needs of the product. This article talked about news sites, but other sites have different concerns: things like how long content is cached, and how much you can rely on stale content, differs greatly.
The business needs of news sites means that we need to handle big spikes, with otherwise predictable patterns. We need to make sure that the content is accurate and correct, and avoid serving disinformation.
I recommend:
Identify the desired business behaviours of your product or API.
Are there hours with negligible traffic?
What causes spikes in traffic?
How much does the recency of the content matter?
Maybe it’s okay if the site is down for 10 minutes. Maybe it’s not.
Keep caching logic small and simple.
Make sure caches are doing what they do best.
Make sure that as many code paths in the cache as possible are tested frequently, in production.
Have fallback behaviours, but only when they’re needed.
Prioritise complexity within caching that enables the intended business behaviours.
Identify requests that can be cached for long periods.
Make sure they are returned from the cache, reducing the load on backend services.
Logs, metrics, all that good stuff are important to figure out where time is being spent.