How Big Companies like Netflix Stores its Data. What is Big Data?

6 min readSep 17, 2020

“It’s amazing how much data is out there. The question is how do we put it in a form that’s usable?”

-Bill Ford

Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum is a single value of a single variable. single variable. we can say that we have many data like files, audio, videos, messages etc.

What is Big Data?

Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. This is known as the three Vs.

Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.

What is Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common — contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) — a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN — (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;[10][11]
Hadoop MapReduce — an implementation of the MapReduce programming model for large-scale data processing.

Distributed Storage

A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes. A distributed object store is made up of many individual object stores, normally consisting of one or a small number of physical disks. These object stores run on commodity server hardware, which might be the compute nodes or might be separate servers configured solely for providing storage services. As such, the hardware is relatively inexpensive. The disk of each virtual machine is broken up into a large number of small segments, typically a few megabytes in size each, and each segment is stored several times (often three) on different object stores. Each copy of each segment is called a replica. The system is designed to tolerate failure. As relatively inexpensive hardware is used, failure of individual object stores is comparatively frequent; indeed, with enough object stores, failure becomes inevitable. However, as it would require every replica to become unavailable for data to be lost, failure of individual object stores is not an ‘emergency event’ requiring call-out of storage engineers, but something handled through routine maintenance. Performance does not noticeably degrade, and the under-replicated data is gradually and automatically re-replicated from existing replicas. There is no ‘re-silvering’ operation to perform when the defective object store is replaced in the same way that would happen with a replacement RAID disk.

Distributed storage systems can store several types of data:

Files — a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
Block storage — a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
Objects — a distributed object storage system wraps data into objects, identified by a unique ID or hash.

Distributed storage systems have several advantages:

Scalability — the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
Redundancy — distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
Cost — distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.
Performance — distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.

Netflix :

Netflix has over 100 million subscribers and it is increasing day by day with that comes a large wealth of data that can analyze to improve the user experience.

Over 75% to 85% of Content that we watch is driven using the Recommendation system.

Recommendations that Netflix gives :-

Rating of the content
The device used
Number of searches etc and many more

How Netflix Manages this Data ? BigData ?

Netflix uses data processing software and traditional business intelligence tools such as Hadoop and Teradata, as well as its own open-source solutions such as Lipstick and Genie, to gather, store, and process massive amounts of information. These platforms influence its decisions on what content to create and promote to viewers.

Netflix doesn’t use a traditional data center-based Hadoop data warehouse. In order to allow it to store and process a rapidly increasing data set, it uses Amazon’s S3 to warehouse its data, allowing it to spin up multiple Hadoop clusters for different workloads accessing the same data. In the Hadoop ecosystem, it uses Hive for ad hoc queries and analytics and Pig for ETL (extract, transform, load), and algorithms.

It then created its own Genie project to help handle increasingly massive data volumes as it scales. All this points to one thing: Netflix is very particular about having a lot of data and being able to process this data to ensure it understands exactly what its users want.

The result has been nothing short of amazing. Netflix has been able to ensure a high engagement rate with its original content, such that 90 percent of Netflix users have engaged with its original content.

Netflix’s big data approach to content is so successful that, compared to the TV industry, where just 35 percent of shows are renewed past their first season, Netflix renews 93 percent of its original series.

Conclusion:

Without getting bored with the technicality, Netflix is clearly a great example of the power of big data. While you might not have the resources to create your own project for more big data efficiency like Netflix did by creating its Genie project, the big data industry is rapidly evolving and a lot of open source tools exist to help you collect and process the essential data to understand exactly what your users want.