just dans

Making programming more accessible

24 Sep 2020

What fix-it-up projects look like at the beginning

What fix-it-up projects look like at the beginning

In the last week and a half, we’ve set out to simplify Repl.it and make it more stable. It’s a good sign that the best use of our time right now is slimming down the product. It means people are using it! Peak usage tripled over the past month! We keep having to up our GCP quotas!

I’ve run a few of these projects at previous companies. In fact, now that I think about it, everywhere I’ve been has reached a point where people say “we absolutely have to fix this thing over here.” It’s the natural state of running a service for other humans. I’d like to write down what we’re doing as we do it, because you tend to discover what the real problems are along the way and it’s so easy in hindsight to say “we made these changes and then everyone loved us” when really the story goes something like “we started out fixing this over here, but that didn’t help, and then we saw something strange on this graph and that was the ticket.”

When you open a repl, the workspace in your browser opens a web socket connection to a random VM running in GCP. The workspace sends a token over that connection identifying which repl you’re opening. If the repl is already running on another VM, the web socket connection is piped over to that VM. If the repl is not running anywhere else, the randomly-chosen VM will start a container and pipe the web socket to a process we call pid1 that acts on behalf of the user inside the container. From there on out, the workspace and pid1 communicate by passing protobuf messages.

There are a lot of advantages to this approach! We’re essentially handing you a fresh computer that you can interact with using a well-defined protocol. But, as anyone who has tried to keep a process running indefinitely knows, things happen. VMs go down. Network latencies spike. So the workspace has to smoothly handle the container disappearing. Despite building a reconnection flow into the workspace (and keeping ourselves honest by running preemptible VMs that can shutdown at any time), the community’s top complaint has been that reconnections happen too often and sometimes people lose what they just typed. That’s not good at all!

I won’t steal Faris’s thunder, since he’s been working on this for many weeks and has entered another plane of existence, but we recently reduced the number of times that the workspace cannot automatically reconcile local edits with what’s in the container by 50x. Meanwhile we’ve also started to measure the connection error rate - how often does the workspace attempt to connect to a container and not succeed? Bringing that rate down has pushed us to look at the lifecycle of a VM. What happens once a VM has decided to shut down? What happens right before a VM declares itself unhealthy?

And that’s it! We’re busy improving two metrics right now. “Let’s fix this” projects are really, under the hood, “let’s focus on this” projects.