This article’s main focus will be to provide a simple picture and high level system working of one of the world’s leading OTT platform i.e. Netflix. As a Netflix user, I always enjoy most of its content as well as the smoother experience which makes mine preferred choice over Amazon prime. Netflix enjoys over 110 million subscribers, operating in 200 countries, playing more than 1 billions of hours of video each week. This is one of the reason I would love to peek behind the scenes of its operations as its quite fascinating.
Netflix was launched in 1998, but it was in mid 2000’s when the internet was fast enough and the bandwidth costs improved sufficiently to allow customers to download movies from net. YouTube as a streaming service was gaining popularity and in 2007 Netflix scrapped of its hardware service concept, and started its video on-demand-service.
One of the important feature of behind designing a system, is the scalability. In 2007 , EC2 was just started, Netflix has started its own streaming service to cater the high on demand streaming, so it started to built its own datacenters, adopted the vertical scaling concept. But the members were growing rapidly ,so you need a reliable system infrastructure with no single point failure. There was an outage in 2008 in of its datacenter which made Netflix realize that its primary focus must be in delivering video rather than building its own datacenter. It shifted to horizontal scaling.
Netflix chose AWS , as it lets you focus on your customers rather than on the heavy lifting of racking , stacking and powering servers.
Netflix operates on two cloud computing services i.e. AWS and Open Connect (OC). Netflix comprises of three main components : Client, Backend and OC.
Client is any user interface device where you can play Netflix videos and where you can enjoy the streaming . Netflix develops its own iOS and Android apps to provide the best viewing experience in each and every device. Through its SDK, Netflix has a control over the Smart TV’s. Every Netflix app makes request to AWS and plays video using the SDK.
So what happens when we press play on a video in Netflix ,the whole part can be divided into two parts
- Architecture served when running a video i.e. Open Connect
- Architecture that doesn’t involve video running but the scalability i.e. AWS
When the client clicks on play button , the request from the device goes to AWS and OCA on Netflix CDN to stream the videos. Netflix CDN …?? The Open Connect. When the production houses and the studios sends video to Netflix , it converts the video into a format that is suitable best for your device. This process is called Transcoding/ encoding. Why does it require? The original movie we get can be around for example 50gb , such a big file and streaming for every customer will be difficult as it may consume lot of bandwidth and space constraint. Netflix supports around 2200 different devices , each device has its own video format. Source video must be converted to different resolutions to support different devices.
Hence, Netflix creates optimized files for different network speeds and clients. Steps involved in transcoding:
- Netflix spends a lot of time validating the video i.e. it looks for any missing frames , color changes, if found any it rejects.
- Once validation is done, it is fed to media pipeline , a series of steps involved to make it ready. The file is broken into multiple chunks, the individual tasks that needs to be processed are put it in a Queue. The chunks are encoded in parallel means they are processed at same time. It uses lot of servers in EC2 to achieve parallelism. Once encoded, they are validated and uploaded in Amazon scalable storage S3.
- As a result lot of files will be created to support every internet connected device. For The Crown , Netflix stores around 1200 files . As a fact, it took around 190,000 CPU hours to encode just one season.
So after receiving all the files , how does Netflix plays all the video. Here enters the Open Connect
Open Connect is a global content delivery network (CDN) responsible for storing and delivering Netflix TV shows and movies to their subscribers world-wide. The idea behind CDN was simple, it tried to achieve low latency, another important feature in system design . Imagine your watching a video i.e. being streamed from New York, the video stream must pass through lot of networks which may result in slow and unreliable . So simple solution is bring the content as close as possible .Each location with a computer storing video content is called a PoP or point of presence. Each PoP is a physical location that provides access to the internet. It houses servers, routers, and other telecommunications equipment.
But why its own CDN rather than a third party CDN, as it could have saved lot of time.
- Netflix decided to use 3rd party CDN’s as the pricing was low around 2009. So it bought time, it invested the time in creating algorithms adapting to changing network conditions. Focused mostly on the services in AWS called the control plane. After 2 years, it realized the scale of a dedicated CDN .
- Setting up its own CDN and know where to set it up by availing the user choice data it has, it was more of a cheaper strategy rather than buying from 3rd party. In this way, it can control the whole video path -transcoding ,CDN and client.
Open Connect Servers comprises of OCA (Open Connect Appliances). OCA’s are computer systems developed by Netflix. So when you press play , you are watching the video from one of the nearest OCA.
Where does Netflix put the OCA’s? Streaming services like Amazon prime, YouTube took an expensive approach by building their own global network for delivering the content.
In order to localize the traffic of watching the Netflix videos to the customers network, it went for a partnership with the ISP and IXP around the world , deploying its OCA’s for free inside their network . Its a genius move as it got the benefits of a data center rather than building its own data center.
Through ISP’s and IXP , it came close to its customer and localized the traffic. This reduces costs by relieving internet congestion for ISPs. At the same time, Netflix members experience a high-quality viewing experience. And network performance improves for everyone.
So we had the video sitting in S3, now we want it to send it to the Client through OCA. Each OCA is a video cache of what you’ll most likely want to watch. It uses the popularity data to predict which videos members probably will want to watch tomorrow in each location. Netflix copies the predicted videos to one or more OCAs at each location. This is called prepositioning. Netflix doesn’t copy all their video to every OCA in the world, as its catalog is too large to store. Netflix caches video by predicting what you’ll want to watch. From the list of 10 different OCAs servers returned by Playback Apps service, the client tests the quality of network connections to these OCAs and selects the fastest, most reliable OCA to request video files for streaming. Playback Apps service, Steering service and Cache Control service run entirely in AWS cloud based on a micro services architecture.
- Netflix uses Amazon’s elastic load balancer service to route traffic to different front end services. It implements two tier load balancing scheme. Initially the load is balanced across the zones first(Tier 1). From the zones , it has an array of load balancer instances(Tier 2), which does round robin load balancing over the instances.
- The load balancer communicates with the layer 7 API gateway service called Zuul, designed by the Netflix team. This component can be deployed to multiple AWS EC2 instances across different regions to increase Netflix service availability. For a detail approach, here is the link .
- Now, here what if one or many services go down because of network failure or timeout issues or exception is thrown in that particular service? Since there is service call on multiple layers, it is common on a distributed system that a remote service may fail. Such failure may cascade to layers ultimately reaching to the user. In order to handle such failure gracefully, we need to have a mechanism that falls back to some other service call or default service so that the cascading of the error stops and the user doesn’t need to experience system failure. The situation can be as depicted below:
To prevent such fallback, Netflix designed library called Hystrix , it implements Circuit Breaker Design pattern which means that if a method call fails and that failures build up to a threshold, Hystrix opens the circuit so that subsequent calls automatically fail. Hystrix provides metrics related to latency , how micro services are performing , about performance and puts in dashboard , how services are performing better.
Netflix uses Micro Services architecture to power all of the APIs needed for applications and Web apps. Critical micro services and stateless services are designed to provide resiliency . Each micro service can have its own data store and some in-memory cache stores of recent results. EVCache is a primary choice for caching of micro services at Netflix.
There are certain API’s where the data can be cached , so that it can relieve pressure from hitting the server, and there are certain API’s where we need o get data from server always. How did Netflix solve this, by using EVCache , a wrapper on memcached. Netflix deployed lot of clusters on number of EC2 instances , in which there are so many nodes of memcached. So whenever a write happens , it updates the every cluster available, so when read happens , it tries to read from the nearest cluster thus achieving distributed cache .
When migrating their infrastructure to AWS cloud, Netflix made use of different data stores , both SQL and NoSQL, for different purposes
Netflix Data Stores deployed on AWS
- MySQL databases are used for movie title management and transactional/billing purposes.
- Hadoop is used for big data processing based on user logs
- Elastic Search has powered searching titles for Netflix apps
- Cassandra is a distributed column-based NoSQL data store to handle large amounts of read requests with no single point of failure.
When you browse around looking for series or content in Netflix , you must have notices there’s an image displayed for each video. This is header image. The header image is chosen to draw a person’s attention towards the video. Every person has different taste and preferences. Netflix knows this too. Depending on the data it has gathered , it tries to know which kind of movies you like best, which genre, which actor and based on that data, the header image gets personalized. Netflix implements the collaborative filtering and content based filtering technique to build its recommendation engine which is highly driven by data and lot of other metrics.
In the above discussion , we have tried to see how Netflix has built an architecture that respects every design goal like high availability and low latency through Open Connect and EVCache , Scalability through horizontal scaling of EC2 instances provided by AWS auto scaling service, Resilience is achieved through Hystrix library. I hope ,the above content has given a very simpler understanding and a high level design behind the scenes of Netflix
For a deeper understanding of every component , I would recommend the below references: