Hadoop

Managing Big Data {2}

by Alex L

The exponential growth of computed data has become more and more difficult to manage.  Big data  refers to a collection of data that can be so large and complex, that traditional data management techniques cannot consistently manage due to the unpredictable variation in data.  Any average person today generates large amounts of data daily, with their mobile devices, computers, and Internet activity and behaviors.  All of these bits and details of data can add up to a significant amount.  Big data challenges are characterized by volume, velocity, and variety.  Volume is big data’s greatest challenge because even though some companies may be able to store vast amounts of data, they are not able to process it into meaningful information due to its sheer size.  Velocity is the “speed” in which the data flows.  Some organization’s servers cannot handle the increasing amount of demand.  In addition, large amounts of unstructured data such as photos, audio, and video have begun to flood in from the multitude of social networking outlets such as Facebook, Twitter, and YouTube, and are constantly streamed in real-time to billions of users.  Within the past decade, the significant increase in mobile products, such as cell phones and tablets, have also largely increased the demand for data services.  Lastly, the immense variety in data types make organizing and interpreting such data cumbersome.  Cisco forecasts that by 2017, annual global data center IP traffic will reach 7.7 zettabytes, or 8.26 billion terabytes. read more...

Hadoop or EDW {1}

by Brian B
The article that I picked with week is called “Big Data Debate: End Near for Data Warehousing?” by Doug Henschen. The article starts off by giving some background to EDW (Enterprise Data Warehouse). It says that while the technology behind EDW is time tested and thoroughly developed it remains rigid and inflexible when you have to go back and make changes to your data model.  It also says that this is often a very time consuming process that often costs a lot of money to plan out and implement if you ever finish actually modeling and developing the system. It then talks about Hadoop, “which lets you store data on a massive scale at low cost (compared with similarly scaled commercial databases) (Henschen, 2012).” The author says that this is an improvement over normal EDW because it allows more flexibility when it comes to making changes down the road. The problem is that it is not as developed as EDW so it can be difficult to find people who have an intimate knowledge of the software. The article then opens up into a debate between Ben Werther (Pro Hadoop) and Scott Gnau (Pro EDW). Werther essentially says that EDW is a dated technology because by the time you push out the model and get everything implemented you have what amounts to a view of the world a year or more ago, which may or may not be applicable to your business needs today, wasting your companies time and resources. Gnau’s argument boils down to the fact that while Hadoop maybe more flexible it does not allow you to have very good control over the data you have collected. He says that with all of that data being un-modeled it will cause issues for analyst’s to view and sort the data, which is why EDW will stick around to make their jobs more manageable. read more...

Distributed Data {2}

by CyberChic
In Michael Miller’s column FowardThinking, he talks about
the many new systems that have been emerging due
to the huge amounts of data generated by the  internet
companies.  He refers to these systems as distributed
file systems and because the type of data is different
than the typical data used in RDMSs these companies
have had to develop their own software to handle and
mine this new type of data which includes log and click
data and Web traffic.    Some of the companies that
have created these new databases are Apache Software
Foundation which has created Hadoop inconjuction with
Apache Hive data warehousing tools and the Pig platform. 
Google has also developed Google File System and Big
Table while Amazon has created Dynamo.  read more...

Trending in IT {4}

by Anthony T
The article focuses on Hadoop and MapReduce and their place in the market.
Hadoop is basically a software which can allows applications to work in
collaberatively with hundreds and even thousands of indepedant computers.
Another selling point to this software is the fact that it can handle
data in the range of petabytes. MapReduce is a framework. Its practical use
is for processing clusters or grids. The data used for processing can either
be unstructured(filesystem) or structured(database). The author reports that
there will be a significant jump in growth for Hadoop and MapReduce and
forecests that jump at 60% in 2016. The idea behind the software and framework is to split big amounts of data and process them among many different nodes; most of the data being unstructured and coming from internet sites and social networking apps. read more...

Hadoop Making the Headlines Once Again {Comments Off on Hadoop Making the Headlines Once Again}

by Asbed P
Since IT Executives mentioned that Hadoop was ready for enterprise use during the Hadoop World conference in New York about a month ago, big companies have already begun to switch over.  JPMorgan Chase, for example, still uses relational database systems for transaction processing, but it has now began using Hadoop technology for more and more services, including Fraud Detection, IT risk management, and self service.  Larry Feinsmith, managing director of IT, mentions, “Hadoop allows us to store data that we never stored before.”  He means the vast amount of unstructured data like web logs, transaction data and social media data.  eBay has also jumped on the wagon and is using Hadoop along with the Hbase database.  Hbase supports real-time analysis of Hadoop data and is a great companion to bring along for your business. Chase’s Feinsmith, however, warns of potential security issues that can arise from using Hadoop.  Hugh Williams of eBay also pointed out that “related technology like Hbase, are still somewhat immature,” raising questions about overall system stability. read more...

Jumping Through Hadoopla’s {Comments Off on Jumping Through Hadoopla’s}

by James C
Summary:

Apache Hadoop, a “Java-based software framework for distributed processing of data intensive transformations and analyses.” Basically, the software takes a big processing job, distributes it, and collects the information to a small easy to understand result. The cost of housing a Hadoop Distributed File System (HDFS) is considerably lower than commissioning a comparable relational database counterpart. The benefits of deploying such a framework have been so significant that many companies have adopted its open source platform, companies like Microsoft, IBM and even the database giant, Oracle, just to name a few. read more...

MarkLogic and Hadoop {Comments Off on MarkLogic and Hadoop}

by Asbed P
MarkLogic’s database, named MarkLogic 5, is another database that will use the open source Hadoop programming framework.  It was just released a few weeks ago, and has a Hadoop connector that allows its users to “aggregate data inside MarkLogic for richer analytics, while maintaining the advantages of MarkLogic indexes for performance and accuracy.”  The database is described by its own company as an enterprise class database which does not use SQL.  Instead it uses both XML and Xquery which means its more well suited for certain classes of applications.  It’s main appeal so far is its ability to manage, index and handle unstructored information from anything from text documents to media files. A great use of this database would be in, for example, an insurance company who has a great amount of documents that need information pulled off of them and sorted into a database.  This combination of MarkLogic and Hadoop will allow MarkLogic to pull the info and Hadoop to sort it and analyze it. read more...

Microsoft Server SQL 2012 {1}

by Chris S

Microsoft has been long developing its latest version of SQL under the code name “Denali.” Recently in a press release, they announced that the codename was dropped and the project was going to be titled simply, “Microsoft Server SQL 2012.” Microsoft President announced also that they look to contribute to the Apache Hadoop project and Big Data. Apache Hadoop is a project that develops open-source software for distributed computing. Using a user-friendly programming model, the software allows for distributed processing of large data sets across multiple nodes. The software relies on its library to detect and handle failures at the application layer, allowing those that use it to capture, store and process big data efficiently. Many companies use different forms of this project to access data from servers over the internet. It cuts data being held on one computer to being spread over a cluster of computers. read more...