Finding OSS Backend Stack that best resembles Google Infrastructure

Callertree

Feb 4, 2022 • 4 min read

A Google Senior engineer’s adventures selecting a software stack

If you’re a Google engineer that has grown up working in their famed ‘tech island’ — a metaphor for the invisible but sprawling internal software infrastructure that powers the company, working with open-source can be daunting. If you’re selecting a software stack for your project that best maps to industry, read on..

Historically, many pieces of large-scale backend infrastructure were pioneered by Google: Hadoop system based on Google File System, HBase based on Google BigTable and Kubernetes based on Google’s Borg. However, today, the landscape is a lot more messy.

When we started building Callertree, I was overwhelmed by the number of possible options for my backend software stack. Previously issues that were no-brainers now needed a careful decision — Do I force myself to use Protocol Buffers (Google’s object serialization library) in my code or just rely on vanilla JSON? Do I build apps to directly leverage the host and OS libraries like in Google, or do I leverage Docker as an abstraction layer?

Here are some of the technologies I chose and some notes about what I learned:

Software

Server Programming Language — Python

At Google, I worked primarily in C++ on high performing databases where each millisecond mattered, but didn’t really need that for Callertree. Rust was out for the same reason. I cared more about prototyping quickly and getting my features out on time and leverage great library support so it came down to Python or Go. We ended up using Python because I was more familiar with it, but I’d like to try out Go for my next project — especially given

App Server Software — Django

Now that we picked Python, we debated between using a lightweight framework like Flask vs a ‘heavier’ framework like Django. I was more familiar with Flask, but for Callertree, I really liked the idea of structuring the system via the MVC pattern (model-view-controller) or the Django’s MTV framework. I especially liked Models (django’s ORM system), getting a free admin page and not having to re-implement the boring stuff like — authentication, form-validation, management scripts etc. Overall I’m happy with this decision except one big thing — I hate Django Migrations. I eventually learned to live with it, but if you’d like to collaborate on a Django migration system that doesn’t suck, reach out to me.

Message Passing Infrastructure — RabbitMQ

I didn’t need something super-inexpensive like Kafka and just wanted something simple, expressive with great completeness guarantees (basically finish all the work it is supposed to). Many new engineers try to make their user-facing servers do things that are better handled by message passing infrastructure, and leveraging RabbitMQ saved us a ton of time and money. We loved RabbitMQ and the celery library for django. A small nitpick is that it was a slight pain to maintain because they removed some flags that would cause old versions to crash, but nothing too bad.

Database — PostgresSQL

Working with Django + PostgresSQL is pretty easy. We didn’t really need like a massive database in the scale of TB of data with massive write throughput, so didn’t really look at NoSQL. A SQL database like Postgres worked great for us. Django integrates pretty neatly with Postgres and backups were easy to setup, so we liked it overall. Any problems we had, mostly had to do with Django migrations than PostgresSQL itself.

DevOps

Containerization— Docker

Initially I didn’t even think we needed Docker and instead coupled my app tightly with my linux kernel, but was convinced to use it by my good friend who happens to be a devops expert. What I really liked about it is that it was leveraging docker recipes for my PostgresSQL Database and RabbitMQ message passing infrastructure. Dealing with backups, updates and maintenance was super easy and I felt comfortable being able to swap things out. What I really hate about Docker is how it treats your machine disk and pollutes it with build artifacts in hidden places. My dev machine started getting very slow when I added docker builds to my workflow and I couldn’t immediately figure out which folders were being polluted and why. Next time, I want to experiment with containerd as an alternative.

Compute Scheduling and Management — Kubernetes

Use Kubernetes for your backends guys — we save 1000s of hours daily using Borg inside Google and Kubernetes (the OSS version of Borg) saved me a lot of time deploying and managing my infrastructure. Honestly, any project where you need more than 2 machines and expect to maintain for more than 1 month would benefit from Kubernetes.

Source Control — GitHub

Believe it or not, our source control decisions were the most contentious. We argued about it a lot. One main question was around leveraging GitLab which had great features around team sizes for private repositories and release management. However, one engineer was convinced that GitLab lost his commits and corrupted the repository, so we didn’t use it. The other issue was around branch based development. Google famously uses a mono repo with small bite-sized commits being directly added to the main ‘branch’ and initially we didn’t adopt branch based development because we didn’t want to deal with difficult merges. However, having a ton of tiny commits made looking for ‘blame’ history very hard. If I were to start a new project today, I would support a branch-based model where engineers have an understanding to merge frequently.

A few of us previously with engineering leadership positions at Google, Amazon and others’ noticed a reliability and alerting gap that companies today are grappling with and built

https://callertree.com/

Companies with 1000s of employees love working with us to revolutionize their business continuity and reliability systems. We recently launched a free to use version of our system! Let us know what you think and we’d love to work with you.