Managing Big Data{2}


The exponential growth of computed data has become more and more difficult to manage.  Big data  refers to a collection of data that can be so large and complex, that traditional data management techniques cannot consistently manage due to the unpredictable variation in data.  Any average person today generates large amounts of data daily, with their mobile devices, computers, and Internet activity and behaviors.  All of these bits and details of data can add up to a significant amount.  Big data challenges are characterized by volume, velocity, and variety.  Volume is big data’s greatest challenge because even though some companies may be able to store vast amounts of data, they are not able to process it into meaningful information due to its sheer size.  Velocity is the “speed” in which the data flows.  Some organization’s servers cannot handle the increasing amount of demand.  In addition, large amounts of unstructured data such as photos, audio, and video have begun to flood in from the multitude of social networking outlets such as Facebook, Twitter, and YouTube, and are constantly streamed in real-time to billions of users.  Within the past decade, the significant increase in mobile products, such as cell phones and tablets, have also largely increased the demand for data services.  Lastly, the immense variety in data types make organizing and interpreting such data cumbersome.  Cisco forecasts that by 2017, annual global data center IP traffic will reach 7.7 zettabytes, or 8.26 billion terabytes.

 

One of the main problems with big data is that many organizations simply are not able to capture and process the data to become useful information due to its complexity or large volume.  These missed opportunities to analyze potentially valuable data can be costly.  In a report done by MGI and McKinsey & Company reports that, “If US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year.”  As a whole, data vendor companies generated nearly $18.6 billion in revenue from related services, hardware, and software.  Wikibon estimates that by 2017, this revenue will reach nearly $50 billion.  Palantir, a pure-play data vendor, generated $418 million in revenue in 2013 from $191 million in 2012.  Amazon Web Services (AWS) has their own kind of big data management, which allows companies to organize their data on a cloud.  Some of its most popular customers consist of NASA/JPL, GE, Reddit, and Netflix.  In 2013, AWS generated an estimated $3.8 billion in revenue.  As large data management solutions from providers like AWS become more popular, more businesses will be able to benefit from the collection and processing of big data.

 

Hadoop, developed by Apache Software Foundation, is an open-source, scalable, cluster-based software framework for processing data. Hadoop utilizes some technologies originally developed by Google for indexing search and structural information. Two of Hadoop’s key modules are MapReduce and Hadoop Distributed File System (HDFS). HDFS manages data by replicating it across different clusters, which then uses a technology called MapReduce to coordinate multiple computations from multiple clusters at once. By separating the computation over many other clusters, data can be computed and retrieved much faster. Since data is replicated among other clusters, a high level of robustness and reliability can be achieved. Hadoop can also detect and compensate for any hardware failures on the clusters. Additionally, Hadoop’s software allows clusters to check each other’s data for inconsistencies and errors to keep maintain data integrity.
Today, companies like LinkedIn, Twitter, Facebook, EBay, and Yahoo! all utilize Hadoop to manage their big data, and provide their users with big data derived content. In 2011, Facebook needed to migrate its 30 petabytes of data (over 30,000 terabytes) to a larger data center in Pineville, Oregon. By utilizing Hadoop’s HDFS and MapReduce, migrating this immense amount of data was done quickly and without significant downtime or data loss.

Companies and organizations should do their best to manage and interpret big data. By processing big data, they can take advantage of price optimization, staffing analysis, historical data analysis, and much more. Using technologies such as Hadoop to process the multitude of multi-structured data can help reduce time wasted on processing data with traditional database methods.

 

Sources

——————————————————————
Kelly, J. “Big Data Vendor Revenue And Market Forecast 2013-2017.” Wikibon. N.p., 11 Feb. 2014. Web. 12 Feb. 2014. http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017.
Laney, D. “Deja VVU.” Web log post. Gartner. N.p., 14 Jan. 2012. Web. 11 Feb. 2014. http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/.
Big Data: The next Frontier for Innovation, Competition, and Productivity. Rep. MGI, 11 May 2011. Web. 11 Feb. 2014. http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation.
McKendrick, Joe. “Hadoop Enters the Enterprise Mainstream, and Big Data Will Never Be the Same.” Database Trends and Applications 26.1 (2012): 4-8. ABI/INFORM. Web. 11 Feb. 2014.