How YouTube and Netflix Stores, Manages and Analyze their Data{0}


By Soledad O.

There is an estimated 300 hours of video content being uploaded on YouTube from various sources every minute, and an estimated 5 billion videos are watched every day. On Netflix, an average of 33.3 million subscribers stream an average of 2 hours of movies and/or TV shows a day. From a data analyst point of view, the data being exchanged seems to be limitless. So where do these top media sites store and retrieve their content? How do they know what to recommend to a specific user? How do they know what you are watching? The answer lies in the complex and diverse management of the company’s database.

A database is stored data, managed and organized in such a way for fast and efficient retrieval. Media and content industries use database methods and data mining tools to optimize their services in sharing and streaming various data types such as images, audio files and videos. However, their success lies in their incorporation of data software to support their databases, although challenges do exist in particular to each company and software. This can be recognized in two of the most popular media sites on the web: Netflix and YouTube.

Netflix was previously known for providing DVDs on demand, either by mail or a self-checkout machine. Recently, they have switched over to an online presence, where they offer movies and TV shows on demand. They currently hold a little over 8,000 movies and TV shows in their database. Users are able to simply click on any title and instantly stream with ease. However, there is greater work put behind this simple click. The company currently uses a mix of Amazon Web Services and their own CDN (content distribution network) called Open Connect, managing an equivalent “tens of thousands of servers” on the cloud. However, the actual content is stored in data centers from exchange points and Internet service providers’ networks (Brodkin, 2016). Customized storage hardware allows optimized streaming for videos. They use two types of servers: a hard drive which stores about 100 TB of data and a flash drive used least commonly. So, AWS and CDN provide the applications to stream and collect the data come in and out of these data centers. Essentially, the applications decide what server is closest to your point of interaction (TV, laptop, phone, etc.) and then have it stream from that location. Although, this procedure has only been recent, as they use to use Oracle and SQL as their base, which only allowed them one service point, which is a huge risk. Netflix only recently shifted to AWS, which took 7 years to fully migrate and integrate an all-new system, which is one of the challenges of handling such a large amount of data. However, this now allows Netflix more security in case the cloud was to fail, and to reach a larger number of people worldwide with almost equivalent quality. Netflix also uses NoSQL Cassandra to store customer data, which is important to analyze to determine what shows would be best to license and personalize user experience. Customer data includes how many views certain shows receive, what are the most popular categories, what shows can be recommended to an individual user and even what is popular in certain regions. DataViz is a commonly used analytical tool, that visually displays data in graphs to analyze (O’Neill, 2014). By analyzing this data, analysts know what types of shows and movies they should license and stream. Both the company and the customer benefits.

YouTube has been around since 2005, and has been a central source for video-sharing since then. Any individual can upload their own content by simply creating an account, and media corporations can share their material as well. With so many upload sources and over 800 million hours of video content, YouTube heavily relies on MySQL and various data management systems to keep the site up and running. However, it all begins at the Google Modular Data Center. The Google Modular Data Center is where Google stores some of its servers and subsequently much of its data. YouTube is owned by Google, having been bought in 2006, so it is well integrated with Google’s other services. YouTube uses a total of 5 or 6 data centers along with its own CDN. The company is well aware that one of the main disadvantages would be the risk in a power outage or a natural disaster, but they are equipped with emergency generators for backup. Other than that, their cloud storage would be their only other source. Unlike Netflix, videos actually come out of any data center and not necessarily the closest one to the region (High Scalability, 2008). Much of this is customized, so they are built to store and serve high quality videos. In conjunction with its hardware, MySQL serves most of the sites functions in storing their data, from videos to user information. By using the varbinary datatype, this allows them to store both the video content and images (such as thumbnails). However, MySQL has been a challenge, as it offers very little benefit for scalability to an ever growing company, but they are not ready to let go of it so easily. In conjunction, Vitess assists MySQL in what it cannot do in terms of management duties (Jackson, 2012). An earlier version of it called Vtocc, would consolidate incoming queries into smaller batches as well as parse them, which would become easier to handle and execute. Vitess would do this and handle replication for backups and automatically grow across servers without manual implementation. With all their data centered in one area, it gives YouTube an even greater advantage to work on research projects. One of these projects includes YouTube 8M, which is a large public database, of over 500,000 hours of video being used to analyze and improve search queries to all over the web (Weiss, 2016). Another use of their database is to collect information for advertisement. Google collects the user’s browser and search history, using algorithms to analyze and collect information such as what products or services the user is interested in, what region they live in, and even what time they search for these things. Companies can then pay for advertisement on the YouTube Webpage, essentially buying a space to put up a commercial or simple advertisement image by using Adwords and Adsense, which keeps track of the clicks on the advertisement. For example, if a user decides to search for shoes in their browser, companies who have a deal with YouTube for monetary compensation can have a chance to have their advertisement be put up on that user’s screen when they log into YouTube by reviewing their reports (Handley, 2017). The data collected and analyzed gives the user a satisfied experience with plenty of content to search through, the advertisers a widely used platform for their products, and the company an international recognition.

Databases and data mining tools have a huge impact in the media and content industries, for both the company and the users. These industries use the advantage of being able to efficiently manage and store their content, as well as use tools towards more financial and analytical purposes such as advertisement or licensing as seen with YouTube and Netflix respectively. Both the way the databases can be managed and the tools used have their advantages and disadvantages alike, creating risks that could be irreversible. Nonetheless, databases offer the industry a competitive edge that will keep you watching and listening.

References

Brodkin, J. (2016, February 11). Netflix finishes its massive migration to the Amazon cloud. Retrieved February 12, 2017, from https://arstechnica.com/information-technology/2016/02/netflix-finishes-its-massive-migration-to-the-amazon-cloud/
Handley, L. (2017, January 24). What you do on Google can now determine the ads you’re served on YouTube. Retrieved February 12, 2017, from http://www.cnbc.com/2017/01/24/what-you-do-on-google-can-now-determine-the-ads-you-see-on-youtube.html
Jackson, J. (2012, December 14). YouTube scales MySQL with Go code. Retrieved February 12, 2017, from http://www.computerworld.com/article/2493815/database-administration/youtube-scales-mysql-with-go-code.html
O’Neill, S. (2014, February 10). Big Data Is Nothing If Not Visual. Retrieved February 12, 2017, from http://www.informationweek.com/big-data/big-data-analytics/big-data-is-nothing-if-not-visual/d/d-id/1113749
Weiss, G. (2016, September 29). Google Unveils YouTube-8M, A Research Database of 8 Million Videos to Help Improve Search. Retrieved February 12, 2017, from http://www.tubefilter.com/2016/09/29/google-youtube-8m-labeled-research-database/
YouTube Architecture. (2008, March 08). Retrieved February 12, 2017, from http://highscalability.com/youtube-architecture