In this interview, Netflix’s Casey Rosenthal explains how to engineer trust within complex systems. He describes what it’s like to work at Netflix, how the company maintains complexity without sacrificing speed, and why all the teams don’t necessarily follow agile practices.
Josiah Renaudin: Welcome back to another TechWell interview. Today I'm joined by Casey Rosenthal, the engineering manager for the traffic team and chaos team at Netflix, and a keynote speaker at this year's STARWEST. Casey, thank you very much for joining us today.
Casey Rosenthal: Thank you, I'm glad to be able to keynote.
Josiah Renaudin: Absolutely, and before we do actually cover the keynote material, can you explain what it's like to work at Netflix? I think it's something that a lot of people would want to know. Can you describe what those two teams are like that you work for?
Casey Rosenthal: Sure. Working at Netflix, just in general, is great. There's a culture deck out on the internet that Netflix put out a while back, that some of the listeners might be familiar with, that outlines Netflix's culture. I think that highlights a lot of things that make working at Netflix much different from working at another tech company. Certainly different from working at another company in this area. Things like, we're very adverse to process. We emphasize freedom and responsibility. We don't have budgets. We only hire senior engineers. That sort of thing. That makes it a really dynamic environment that moves very quickly and does things very well.
The chaos team is responsible for overseeing the chaos engineering that we do here at Netflix. We defined chaos engineering and we published some principles behind it, principles that work. If anybody's interested in reading up on the formal principles. We basically, we see chaos engineering as a new discipline within software engineering, designed to surface systemic effects in distributive systems, particularly like ours, where we do have so many subscribers, where we are running at scale. That's really about identifying systemic effects that a complicated application like ours, an infrastructure like ours tends to have.
The flip side, the traffic team is responsible for remediating failure when we do run into some of the systemic effects that could potentially bring a service down. On the traffic side, we're actually frequently, on some not un-frequent basis I should say, called and votes to shift traffic around the globe when one of the regions that we're operating in suffers a failure. Either self-inflicted from a bad code push, or from an infrastructure problem with either our IAS service provider, our cloud provider, AWS, or with the Internet in general. Sometimes there are big problems with the ISPs. I kid you not, people shooting at internet connects in the Midwest of this country. For fun, apparently. Backhoes cutting cables, those types of things. The traffic team's responsible for moving our control plane traffic around the globe to get around infrastructure or software problems within a particular region. We actually see domain failures in that respect on a fairly regular basis.
Josiah Renaudin: Beyond the actual chaos team you work with a Netflix, you keynote tackles, chaos engineering and intuition engineering. What do these terms mean in the context of software and test engineering?
Casey Rosenthal: Sure. Chaos engineering, like you said, is really about surfacing these systemic issues. We're moving towards a place in software where it doesn't make a lot of sense, or it's not practical to build applications as one large thing, where all of the pieces were planned out by an individual or a small group of individuals. Instead, we're seeing systems to where people or small teams are building components that have well defined contracts between them. As long as the contracts are reasonable and adhered to, people can focus on just building their smaller component. For Netflix, we call this a micro service architecture. We've got many micro services running within the larger product that our subscribers see.