Drawing the Line(s)
Drawing the Line(s)
Of course my post on postmortems drew forth a stretch of incidents, the worst of which stemmed from sudden influxes of traffic. When you wave your hands in the air saying “I’ve learned this one lesson!” then the universe makes sure you learn the next in short order.
Every system has critical thresholds. Some you can reason about statically, such as the number of connections a database will accept before it starts to refuse connections. Other thresholds only appear dynamically. What happens if we wake up this many servers all at once? You find out when it happens!
For the static thresholds you know about, you have the luxury of building so that you never cross those lines. If you set aggressive timeouts for database transactions, then connections can’t pile up behind a long-running transaction. You can also limit the number of machines that can connect to the database. What you’re essentially doing is trading latency for errors. Your requests are always guaranteed to return in a certain amount of time, but they may return with a timeout error!
It sounds obvious that you’d want to guarantee that requests return within a certain amount of time or that your app can only use so much memory. In practice you never worry about it until usage forces you to. That’s probably the right tradeoff - most programs don’t exhaust memory and most services don’t see enough traffic to worry about long-running requests. Once you start hitting those thresholds though, you’ve got to aggressively tighten up. You’re not going to stop hitting them! In fact, if you’re growing consistently, then you’re going to start hitting them faster.
Thresholds that only appear out in the wild are trickier. You want to be able to measure whichever resource you’re exhausting, so you can see when it happens. You can reduce the odds that you walk over the line, for example by adding per-account limits, without providing a guarantee. And you most certainly can limit the damage that occurs when you cross the threshold. That’s what the Replit platform team is focused on right now: adding the concept of failure domains to our backend so that when the service gets into a bad state it only affects a subset of our community.
I’m excited! Here at the start of the project, it appears to be Not That Bad (tm), which simply means we haven’t uncovered the nasty surprises yet. It’s making me rethink what I wrote about postmortems. Yes, you want to limit the action items you take out of the postmortem so that there’s a chance you actually complete them. But when a particular pattern (ahem, cascading failure from an overload resource), you want to start solving the larger problem instead of the immediate symptom.