How Google never goes down with World Class SRE and Dev Operations
Perspectives of a Google tech lead
I worked at Google straight out of college and one of the things that blew my mind about the company was how few outages the company had. With over 35 million commits and over 2B lines of code from 100K+ engineers behind the scenes, users experienced largely a seamless experience.
As I progressed through my career, I witnessed and spearheaded scrambling to resolve issues that caused several tens of thousands of dollars of revenue for the company. To visualize this, imagine a flamethrower scorching a room full of money while you figure out how to turn the flamethrower off.
Culture
From the first orientation session to criteria for career growth, Google emphasized a culture of excellence for reliability. Engineers were taught on day 1 to thoroughly plan for failure modes, redundancy of systems, observability and alerting. Additionally, engineers’ designs were thoroughly vetted for reliability and promotion cases could be make-or-break based on reliability. Another key idea was sustained reliability or performance. By setting up alerts and well defined escalation pathways for potential issues, problems were never swept under the rug and addressed consistently over the long term. Finally, instead of passing blame, we assume good faith and are expected to address underlying causes as part of our ‘blameless culture’.
Site Reliability Engineering
Site reliability engineers are critical for the smooth functioning of Google’s diverse services. To continue the analogy of setting money on fire, part of SREs’ job is to be fire fighters. Additionally, they advise and build reliability and automation capabilities and help developer teams come up with clear mitigation strategies and team-level ownership structures.
Alerting Infrastructure
Google has systems in place to observe erroneous behavior, crashes and others and automatically send alerts to teams’ responsible for certain systems. Teams’ and oncall engineers receive alerts (often on their phones) and immediately start debugging and further escalation to pull in more folks to put out fires. Depending on the scale of the issue, escalation pathways will pull in senior leadership to devote extra resources to fixing the problems. The company believes in a policy of no heroes i.e systems should be inplace to fix problems — not the heroics of individuals.
Takeaways
During my time there, I understood the confluence of culture, systems and engineering that helped keep Google running. Billions on people rely on Google daily for their livelihoods — and even a small outage lasting an hour can have devastating consequences. However, every company of every size grapples with outages and reliability problems. Failing to address these properly negatively impacts customer experience, the bottom-line and employee morale.
Companies should strongly consider adopting simple alerting infrastructure coupled with clear escalation systems and team-ownership to significantly improve operations.
A few of us previously with engineering leadership positions at Google, Amazon and others’ noticed a reliability and alerting gap that companies today are grappling with and built
Companies with 1000s of employees love working with us to revolutionize their business continuity and reliability systems. We recently launched a free to use version of our system! Let us know what you think and we’d love to work with you.