How Netflix Embraces Complexity without Sacrificing Speed: An Interview with Casey Rosenthal

interview

October 7, 2016

Summary

In this interview, Netflix’s Casey Rosenthal explains how to engineer trust within complex systems. He describes what it’s like to work at Netflix, how the company maintains complexity without sacrificing speed, and why all the teams don’t necessarily follow agile practices.

Josiah Renaudin: Welcome back to another TechWell interview. Today I'm joined by Casey Rosenthal, the engineering manager for the traffic team and chaos team at Netflix, and a keynote speaker at this year's STARWEST. Casey, thank you very much for joining us today.

Casey Rosenthal: Thank you, I'm glad to be able to keynote.

Josiah Renaudin: Absolutely, and before we do actually cover the keynote material, can you explain what it's like to work at Netflix? I think it's something that a lot of people would want to know. Can you describe what those two teams are like that you work for?

Casey Rosenthal: Sure. Working at Netflix, just in general, is great. There's a culture deck out on the internet that Netflix put out a while back, that some of the listeners might be familiar with, that outlines Netflix's culture. I think that highlights a lot of things that make working at Netflix much different from working at another tech company. Certainly different from working at another company in this area. Things like, we're very adverse to process. We emphasize freedom and responsibility. We don't have budgets. We only hire senior engineers. That sort of thing. That makes it a really dynamic environment that moves very quickly and does things very well.

The chaos team is responsible for overseeing the chaos engineering that we do here at Netflix. We defined chaos engineering and we published some principles behind it, principles that work. If anybody's interested in reading up on the formal principles. We basically, we see chaos engineering as a new discipline within software engineering, designed to surface systemic effects in distributive systems, particularly like ours, where we do have so many subscribers, where we are running at scale. That's really about identifying systemic effects that a complicated application like ours, an infrastructure like ours tends to have.

The flip side, the traffic team is responsible for remediating failure when we do run into some of the systemic effects that could potentially bring a service down. On the traffic side, we're actually frequently, on some not un-frequent basis I should say, called and votes to shift traffic around the globe when one of the regions that we're operating in suffers a failure. Either self-inflicted from a bad code push, or from an infrastructure problem with either our IAS service provider, our cloud provider, AWS, or with the Internet in general. Sometimes there are big problems with the ISPs. I kid you not, people shooting at internet connects in the Midwest of this country. For fun, apparently. Backhoes cutting cables, those types of things. The traffic team's responsible for moving our control plane traffic around the globe to get around infrastructure or software problems within a particular region. We actually see domain failures in that respect on a fairly regular basis.

Josiah Renaudin: Beyond the actual chaos team you work with a Netflix, you keynote tackles, chaos engineering and intuition engineering. What do these terms mean in the context of software and test engineering?

Casey Rosenthal: Sure. Chaos engineering, like you said, is really about surfacing these systemic issues. We're moving towards a place in software where it doesn't make a lot of sense, or it's not practical to build applications as one large thing, where all of the pieces were planned out by an individual or a small group of individuals. Instead, we're seeing systems to where people or small teams are building components that have well defined contracts between them. As long as the contracts are reasonable and adhered to, people can focus on just building their smaller component. For Netflix, we call this a micro service architecture. We've got many micro services running within the larger product that our subscribers see.

Chaos engineering is specifically a way of finding problems with that architecture. Preferably before they affect our subscribers, our customers. Intuition engineering is also focused on these systemic effects and helping us to understand the behavior of a system as a whole. Intuition engineering though, is focused on building tools that humans can interact with in a way that's very natural to them. Yet, communicates a large amount of complexity to them in a very short period of time. We are essentially trying to build tools that give a human an intuition about the behavior of a very complex system. You can contrast that with the way a lot of hawks or data centers or large operations currently work, where they'll present people responsible for availability, not time, with a lot of charts and graphs and numerical readings, things that have to be parsed and really well understood, in order to make sense of them.

Josiah Renaudin: We were talking about load a bit earlier. I think, I've always wondered—do you prepare in a certain way, let's say when a new show hits Netflix that you know it’s going to be popular, like Narcos season two or something like that. Do any of the teams you work for have to have contingency plans for if too many people are accessing the service at once, or does that really matter?

Casey Rosenthal: There are two components to our scale. One is our CDN, which handles all of the actual video content. It's by far the largest CDN in the world. We are over a third of the bits on internet at peak. That's quite a lot of traffic. The other component is our control plane, which handles all of subscribers interations with our service on thousands of devices. Everything from Blu-ray players from 2007 to your smart phone, is hitting this control plane. Those run in two places. The control plane all runs in AWS, and CDN all runs on our own hardware that we have installed in pops around the globe.

For the control plane, even when we have a very popular show release, our control plane is generally scaled well enough to handle a burst in traffic. Because we have so many users across so many regions around the globe, a large burst for us tends to not be as bursty as for other businesses. Even if we release Narcos season two at nine o'clock p.m. on a Friday, a lot of people aren't going to start viewing it right then. That does get spread out for us just partially due to the fact that we're global. On the CDM side, I don't want to speak too much for that part, but those devices, our CDM appliances, have the system of puring traffic that does a really good job of getting hot files distributed quite evenly. So far, we haven't run into that as a problem.

Josiah Renaudin: Gotcha. Netflix is forward-thinking, to put it lightly, you've always been ahead of the curve. I feel like just now, a lot of different software teams in the business I work in are adopting agile, and really getting into agile. Is the agile methodology something that Netflix has been tied to since the beginning, that iterative thought process?

Casey Rosenthal: No, not really. We only hire senior engineers, and because we don't have process, because we actually go out of our way to avoid process, there's no mechanism for us to even evaluate whether everyone is following agile, let alone recommending or enforcing that on engineering teams. It's a lot easier to think of Netflix as eighty small engineering teams, than one large engineering team. We don't have a chief architect. We don't have a VP of engineering who oversees all of our engineering teams. That infrastructure just doesn't exist here. There's nobody who can say, "Oh well, we think we're going to see a benefit from agile, so everyone will use agile now." It doesn't happen, there's no mechanism for that to happen.

While I'm sure that there are some teams here that subscribe to agile methodologies, I would say that in a lot of cases it doesn't even make sense for some teams to adhere to that. For example, one of the agile principles is that you have daily interactions with business people, with stakeholders. I can think of several services here that understand their domain space well enough that they don't have to interact with the stakeholders on that frequently of a basis. There are other services here that are building things that will improve Netflix, generally, put they might not have stakeholders. They might not have business counterparts. What makes sense for them to have that kind of structure. I'd say generally, Netflix moves very quickly and most teams that I've seen iterate in production very regularly, so the principle that the best measure of success is working software, I think is definitely something that I see a lot of. We certainly don't explicitly adhere to agile.

Josiah Renaudin: That makes sense. You mentioned, at Netflix you need to move very quickly and more often than not, it feels like we almost need to compromise speed in order to create more complex services and software, and vise versa. How is Netflix—and how can other companies, too, if you want to give advice for other people—embrace complexity without really sacrificing the necessary velocity?

Casey Rosenthal: That's a really good question. Chaos engineering and intuition engineering, and traffic engineering for that matter, are all services that are designed to help embrace that complexity and allow the organizations to still move quickly. Chaos engineering, by building tools to find problems in that complexity that humans won't normally find. Intuition engineering to allow humans to find, to have a better understanding of very complex systems without necessarily having to know how things work under the hood. Traffic engineering being a way to remediate problems from our loosely coupled development model, which gives some assurance that teams can move quickly, and if the worst case scenario is that they bring a region down, they can rely on a traffic team to evacuate customers from that region, so that it doesn't affect our availability.

Having teams like that in place certainly helps, but I would also say Netflix benefits from, in that respect, from only hiring senior engineers and really focusing on always raising the bar in our talent. I understand that financially, not every company can do that. In cases like ours where iterating and innovating rapidly are critical to our success and our survival, it really isn't a choice for us. It's a necessity. Many engineers can move quickly and write a lot of code, but really knowing that you're doing the right thing, doing the correct thing quickly, comes with experience. While junior engineers may be great engineers and great at writing code, they might also be moving very quickly, writing a lot of software. Because they don't have the benefit of a lot of experience, they might be doing the wrong thing. They might be adding complexity in ways that don't reflect high feature velocity.

Josiah Renaudin: Speaking of necessity, something you guys have to think about is security. Of course, you have so many users, there's so much personal data involved with something like Netflix. There are a lot of other companies that have dealt with major security breaches. How they respond, and then how they respond after the breach and what they do for their users. What resources does Netflix allocate to security since you, once again like I said, carry so many people's personal data?

Casey Rosenthal: We take security very seriously and we have a very large security team that focus on all of the right pieces for ensuring that our subscribers are protected. I can't comment too much beyond that, because I'm not on the security team. It's certainly our top priority to make sure that our subscribers are, that their personal data's kept safe. Fortunately, aside from the billing information, we are an entertainment company, so there's not too much information that we actually need to have on our subscribers. I don't think people would find it too useful to know the viewing patterns of others. We still take it very seriously.

Josiah Renaudin: I think we did touch on this subject a little bit earlier. I'll ask this specific question to get a more specific answer. Why does Netflix choose to optimize for development velocity instead of performance or availability?

Casey Rosenthal: It's not exclusionary, we optimize for all of those things. I think one of the interesting things coming from years as a consultant, is that you see a lot of companies when they start out on a project or a division or developing their engineering culture. They'll optimize for one of three things. Performance, availability or fault tolerance. It takes, usually experience, but it takes a more nuanced approach, more experienced team to know how to optimize for all three things simultaneously, without over optimizing for performance, for example, and not paying attention to what that means for the availability and fault tolerance of the system that they're building.

Without citing specific examples, you can see this a lot in Fintech, particularly older bank companies, for example. Where they will either optimize for fault tolerance, make sure they never lose the records, but they don't have good availability, or they optimize for performance. Think high-speed trading and that sort of thing, but they don't have a good availability of fault tolerance story. One side of maturity in the engineering team and the culture and even the project itself, is if they can optimize for all three of those things simultaneously. Netflix explicitly adds this fourth component to that, which is a feature of velocity. That's something that pervades our culture when we make decisions, we're being thoughtful about trade-offs to availability, and potentially fault tolerance and performance when we make decisions to, specifically to get features out faster.

We also by being thoughtful of how those things interact, are able to make improvements in all four of those area simultaneously. I think that's a really important part of our engineering culture and it's not something that I see too often at other engineering companies, where they're trying to balance out their investment in their engineering culture for feature velocity.

Josiah Renaudin: All right, great, and I don't want to give away the entirety of your keynote, Casey. I do appreciate the time, again, but just to summarize, more than anything, what central message do you want to really leave with your audience in Anaheim, after you give your keynote?

Casey Rosenthal: I hope they come away with an appreciation for chaos engineering and intuition engineering and these new disciplines that we're starting to explore for better understanding complex systems. I think that appreciations really important, because we're at a point where as engineers, we're going to be more and more removed from having insight into the internal operation of systems. AI is a good example of this where in a neural network, you can't crack it open and see why it reasoned a way that it did.

As the systems we work with become more and more like black boxes to us, it becomes more important for quality assurance people and testing engineers to have tools that they can use to describe the properties of the black boxes and give us confidence in the operation of the black box systems. I really see that that's a real strong area for the field of test engineering to pioneer forward and that's how chaos engineering and intuition engineering play into that, by giving us insight into these systems that they're properties without having to understand how they work internally.

Josiah Renaudin: Great. I'm looking forward to hearing the full thing. I'll be in Anaheim and hopefully a lot of people will be there, too. Thank you again, Casey. Appreciate the time and look forward to once again, hearing the full version of the keynote at STARWEST.

Casey Rosenthal: Yeah, thank you so much. I look forward to meeting you there.

Casey R. Casey Rosenthal is the engineering manager for the Traffic team and the Chaos team at Netflix. Previously an executive manager and senior architect, Casey has managed teams to tackle Big Data, architect solutions to difficult problems, and train others to do the same. He finds opportunities to leverage his experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike. For fun, Casey models human behavior using personality profiles in Ruby, Erlang, Elixir, Prolog, and Scala. Follow Casey on Twitter or on LinkedIn.

Topics:

agile development performance monitoring performance testing quality assurance teams test design

About The Author

Josiah Renaudin

A long-time freelancer in the tech industry, Josiah Renaudin is now a web content producer and writer for TechWell, StickyMinds, and Better Software magazine. Previously, he wrote for popular video game journalism websites like GameSpot, IGN, and Paste Magazine, where he published reviews, interviews, and long-form features. Josiah has been immersed in games since he was young, but more than anything, he enjoys covering the tech industry at large.