Even Cowboy Coders Get the Site Reliability Blues*
The principles behind building reliable distributed systems and gracefully managing changes in them turn out to look a lot like the principles behind building psychologically safe communities. In this talk, I'll explain some of the basic principles behind site reliability engineering and how they relate to feminist and social justice ideas. This talk assumes no prior knowledge of site reliability engineering.
Building more just societies is often framed as independent of, or a distraction from, building robust computer systems. But in fact, people who reject social justice and psychologically safe spaces also reject the principles needed to build reliable distributed systems.
Saying “be excellent to each other” doesn’t make a community safe, and building reliable systems isn’t that simple either. Both people and machines require structured processes, protocols, and unambiguous communication in order to work together smoothly. Site reliability engineers (SREs) say that “hope is not a plan.” Likewise, “be excellent to each other” is not a plan for a good community. While developing software can in theory be done by loosely coupled individuals — the much-vaunted “cowboy coders”, maintaining that software and keeping it running in production isn’t a job for cowboys. Keeping the lights on and the servers responding promptly isn’t glamorous work — but by taking care of the systems that we and our colleagues have built, we take care of each other.
Site reliability engineering is about safely making changes to running systems. As such, everything about it is diametrically opposed to cowboy coding. Cowboy coders imagine programming as a libertarian free-for-all in which there are no rules, only brilliant programmers rejoicing in intellectual freedom. But maintaining systems amounts to building safe space:
- SREs use canaries to test new changes: they deploy a change so that only a small percentage of end users will see it, rather than all at once. Likewise, social justice activists support sensitive people as emotional canaries[*], who may notice problems in communities before less sensitive people do. They respect people who have feelings, just as SREs observe and notice the effects of a change on canary machines rather than blaming the machines.
- SREs accept that failure is inevitable and that they will make mistakes. They do blameless postmortems, in which teams work together to analyze the root causes of a failure with a focus on preventing it from happening it again, not on assigning blame. Likewise, social-justice-minded people try to understand the root causes of social problems, finding flaws in social structures rather than blaming individuals for their poor character. In both cases, it’s about observation, experimentation, and learning from failures.
- SREs use service-level agreements (SLAs) to make explicit their commitments to the users of their systems. You can think of an SLA as a social contract for a networked entity. Likewise, social justice activists have been focusing on instituting codes of conduct, for the same reason: to make promises and expectations explicit rather than implicit. In both case, clearly declaring expectations is important for success.
Neither building reliable systems nor creating psychological safety is a matter of common sense. SREs can tell you the former, and social justice activists can tell you the latter. We aren’t born knowing how to coordinate with other people and have low-friction social interactions, any more than we are born knowing how to build scalable and reliable computer systems. Running production systems is hard because of the need to coordinate and handle changes in a non-uniform system where failures happen constantly. Social change is hard for the same reason. When either a system remains reliable or a social movement succeeds, it’s not an accident; it takes hard work and planning.
In this talk, I will introduce you to the basic principles behind site reliability engineering, and show how each technical principle relates to feminist or social justice principles. Pushing software to the limit also requires pushing your empathy to the limit; I will show how the distinction between hard and soft skills is mostly notional.
[*] Thanks to users quartzpebble and kaberett at dreamwidth.org for the term “emotional canary.”
social justice, site reliability engineering, sre, distributed systems
I spoke at Open Source Bridge 2013 (audio: http://opensourcebridge.org/sessions/970 ). I have also spoken at the Haskell Implementors' Workshop and at the Portland Functional Programming Study Group (pdxfunc) and have been a guest lecturer at Mills College.
I’m a site reliability engineer at Google with a previous life as a functional programming and compiler hacker. I studied computer science at Portland State University; the University of California, Berkeley; and Wellesley College. I’ve been contributing to open-source projects for over a decade and am a former contributor to geekfeminism.org and the Geek Feminism Wiki. In my copious free time, I like to write, ride my bicycle, and play with my cats.