Data

Chemical Storage {5}

Computers and technology are growing at an astronomical rate, according to Moore the advancement of technology is doubling every 18 months and don’t we see it. This advancement is calling for an increase in the amount of storage due to the fact that everything we do on devices that use and store information. The rise of social networking and other business related needs are what is fueling this rise in information consumption. How do we as consumers of this extraordinarily large market of technology keep it growing. Well the answer to that is to keep developing. Like stated everything that we develop would need to be stored, which if we keep growing might become a problem. Due to the rising need for information storage capacity researchers at MIT and all over the world are looking into a new form of storage that has the potential to have unsurmountable data storage abilities. This new media is not in a conventional display, it is in a micro-structure, DNA.

read more...

Managing Big Data {2}

The exponential growth of computed data has become more and more difficult to manage.  Big data  refers to a collection of data that can be so large and complex, that traditional data management techniques cannot consistently manage due to the unpredictable variation in data.  Any average person today generates large amounts of data daily, with their mobile devices, computers, and Internet activity and behaviors.  All of these bits and details of data can add up to a significant amount.  Big data challenges are characterized by volume, velocity, and variety.  Volume is big data’s greatest challenge because even though some companies may be able to store vast amounts of data, they are not able to process it into meaningful information due to its sheer size.  Velocity is the “speed” in which the data flows.  Some organization’s servers cannot handle the increasing amount of demand.  In addition, large amounts of unstructured data such as photos, audio, and video have begun to flood in from the multitude of social networking outlets such as Facebook, Twitter, and YouTube, and are constantly streamed in real-time to billions of users.  Within the past decade, the significant increase in mobile products, such as cell phones and tablets, have also largely increased the demand for data services.  Lastly, the immense variety in data types make organizing and interpreting such data cumbersome.  Cisco forecasts that by 2017, annual global data center IP traffic will reach 7.7 zettabytes, or 8.26 billion terabytes.

read more...

Big Big Data {1}

 Big Data is loosely used term in the database industry. At first this term is seen as so simple it becomes complicated or might be used so general it becomes ineffective. Analyzing Big Data is essential in well-established businesses to help grow and mold the company. Big Data as described by Forbes is “ a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.” (What is Big Data) With the expansion of the information and technology age this becomes more and more relevant everyday. There are even organizations that companies can outsource to do this type of analysis for them. Big Data is a relatively new idea and has been becoming more efficient but, there can still be improvement.

read more...

Dimensions of Data Quality {1}

The author of this article starts off by introducing the idea of “dimensions”, such as accuracy, consistency and timeliness and asks if these “dimensions” actually exists as intelligible concepts?  The author believes a strong case can be made that we are not thinking as clearly as we can be in this area, and that there is room for improvement. He then asks where does the term “dimension” come from when talking about data quality? In context of data quality, dimension is used as an analogy. The term gives the impression that data quality is as concrete as a solid object and that the dimensions of data quality can be measured. In data quality, the term dimension could be used interchangeably with criterion, a standard of judgment. Since data is immaterial, stating that the dimensions can be measured is an astonishing claim. The author then asks, are the dimensions credible? The more “duplication” there is in a list alongside “completeness” and “consistency”, the lower data quality likely it is, while the more completeness there is the higher data quality is. Therefore, the inclusion of “duplication” in a list of dimensions of data quality immediately creates lack of consistency in the list. A much more serious problem is that there seems to be no common agreement on what the dimensions of data quality actually are. Lastly, the author asks, are the dimensions over-abstractions? A worry is that each dimension is not a single concept, but is either a collection of disparate concepts or a generalization.

read more...

Data Integration – Mashup Service {1}

The article I read was entitled “Service-Oriented Architecture for High-Dimensional Private Data Mashup” by B. Fung, T. Trojer, P. Hung, L. Xiong, K. Al-Hussaeni and R. Dssouli. This article talks about the integration of data for a service known as Mashup. Mashup combines and integrates data from multiple sources and databases for web applications. The authors use the example of a social media website such as Facebook as an example. Facebook uses Mashup in the sense that it is collecting data from the user in the ways of status updates, Instagram photos, Spotify songs and check-in’s at locations. All of these are examples of data that is being sent in from multiple sources and be combined together. The authors talk through this article about the multiple issues that are faced in protecting user’s data in this type of service. While users usually never give out phone numbers, addresses or Social Security Numbers, much important information is still given out. For example, when a person checks-in at a location, they are broadcasting to everyone that they are at a certain location. This allows potentially unwanted people to have a reference as to where this user is from.

read more...

Google’s Solution to Unify Their Databases {4}

The article I chose this week is named “Google Spans Entire Planet With GPS-Powered Database” by Cade Metz. The article starts off by talking about a Google Engineer named Vijay Gill while he was at a conference. The question he was posed was how he would change how “Google’s datacenters if he had a magic wand (Metz, 2012).” His answer was “he would use that magic wand to build a single system that could automatically and instantly juggle information across all of Google’s data centers (Metz, 2012).” The interesting part of this article is that Google has done just that. The solution that he had in his answer is called Spanner. Spanner is a system that lets Google “juggle data across as many as 10 million servers sitting in “hundreds to thousands” of data centers across the globe (Metz, 2012).” The power of Spanner is that it lets many people handle the data around the world, while “all users see the same collection of information at all times (Metz, 2012).” Spanner accomplishes this task with its TrueTime API. Along with this API Google has also gone to the trouble of setting up master servers with built-in atomic clocks coupled with GPS to ensure accurate server times. This allows the entire network to stay roughly synched up with all of the different parts of Google’s data infrastructure. The article goes on to say that usually companies will just use a third party as their clock instead of installing their own.  It ends on the fact that this kind of approach would be cost too much for most companies to implement, but that Google tends to be ahead of the curve.

read more...

Data as a new Markets {3}

In the article ” Data Markets: The Emerging Data Economy” written by Gil Elbaz, the author talks about how people turn data into a new market where they collect, analyze and sell it to the market. There are advantages for both parties. One can make money out of it, the other can use the data and don’t have to maintain it. The author also give 2 examples of data markets like: Jigsaw and Kaggle. Jigsaw is a collection of contact information collect from individual and organization. On the other hand, Kaggle is more of a community where company provide the data and people from around the world join and analyze the data and make prediction, find pattern or whatever the goal of the project is. And in return, these contributor will get a reward.

read more...

A Cloud-Based Data Warehouse Service from Treasure Data {2}

The author of this article focused on the cloud-based data warehouse company, Treasure Data. The company received $1.5 million in funding that includes an investment from Yukihiro “Matz” Matsumoto — the creator of the Ruby programming language. Treasure Data has developed a service that brings high-end analysis to businesses that don’t have the resources to afford a solutions from major companies like IBM, Oracle or Teradata. According to the CEO of Treasure Data, Hiro Yoshikawa, the total cost of ownership for a data warehouse suite from one of the enterprise players can cost as much as $5 million. Treasure Data is a subscription service that, at low end, costs $1,500 per month or $1,200 per month with a 12 month commitment. Yoshikawa says that on average the cost over time is more than 10 times less than what an enterprise data warehouse offering would cost. Treasure data has more than 10 customers that include “Fortunes 500″ companies and it has more than 100 billion records stored and is processing 10,000 messages per second. Also, Treasure Data borrows from Hadoop but with a twist. Unlike Hadoop, Treasure Data does not require an infrastructure investment.

read more...

Data Warehouse Maturity Model {Comments Off on Data Warehouse Maturity Model}

The article I read this week was entitled “A Model of Data Warehousing Process Maturity” by Arun Sen, K. Ramamurthy and Atish Sinha. The point of this article was to talk about some of the issues that business’ experience when dealing with Data Warehouses. Some of the issues that firms experience with Data Warehousing are metadata management, data changes and results not being relevant to the end user. The authors say this is due to the lack of experience of some Data Warehouse engineers and also the fact that Data Warehouse’s are hard to create and upkeep. The authors suggest using a maturity model which would help with upkeep and design. This would also show the parties who are interested what the life time of this Data Warehouse will be.

read more...