just dans

Making programming (everything) more accessible

09 Nov 2020

Data Processing in Repls

Data Processing in Repls

In Wednesday’s FISH, Spencer showed us how he could run Drone CI on Replit by forking repls through the API. He’d mentioned it on slack a few weeks prior and I knew it was technically possible (so much becomes possible once you can fork) but it still blew my mind to see it live. Amjad jokingly asked if you can run MapReduce on Replit and yes, if you can spin up repls from a repl then you can do it.

While the thought of running a MapReduce cluster open to the world alongside free compute and free hosting makes me nervous, it’s a good, excited nervous. The biggest pain points in large-scale data processing are provisioning, debugging jobs that run for hours then crash right at the end, getting access to the data you need, and discovering all the ways a string field can be misformatted. With repls you don’t need to worry about provisioning and you can tuck credentials into the .env file and not worry about access. And the secret to large-scale data processing is to start out not doing it - running jobs over a small slice of the dataset, running several small jobs that you can combine quickly instead of a mega one that takes hours. Start with a csv file right in the repl! It’s a time-honored tradition.

With repls, everything runs in the same environment. There’s no local environment, no dev, no prod. If you can get a repl to the point where it runs reliably, it’s good to go. And everything under the sun is available. I’ve written many jobs that would run just fine locally but fail due to quirks in the Hadoop environment. BigQuery is interactive (there’s a run button!) and fast but your data has to reside in it, if you need to touch something outside you’re out of luck. You’ve also got access to 50MB of key-value storage, so you have a prayer of restarting jobs that crash or get stuck.

Last night I wrote a Go program that lists all the BigQuery datasets in a project. As if to prove my point that the hardest part of data processing is getting access, I spent close to an hour staring at invalid scope errors and googling for examples of how to do this. I even deleted a service account and created a new one with a different role in case that helped! Searching for bigquery credentialsfromjson example ended up being the ticket. It led me to useful examples. You’ve got to put the scope on the credentials when you create them! Of course!

Victory!

If you ever find yourself saying, “hey, I could really go for listing some BigQuery datasets right now,” then this repl is for you! I’m excited to have access to all our data from inside a repl, it’ll be great to move our data processing from the old world to the new.