by Garcello D
The article I decided to blog about this week is called “Data Cleansing for Data Warehousing,” it was written by Ari Baumgarten on February 27, 2007. The author opens up with an analogy comparing a politician to data cleansing by stating how Politicians raise money can be compared to data cleansing a warehouse in the sense that one cannot exist without the other. Data Cleansing is also known to be the most time intensive and contentious process for data warehousing projects but what really is data cleaning? Well I’m about to break it down for you.
By definition Data Cleansing ensures that undecipherable data does not enter the data warehouse, because it will affect reports generated from the data warehouse via OLAP, Data mining and KPI’s. To show you what the author meant I’ll give you a very simple example. Data cleansing would utilize how dates are stored in separate applications so if you had 2 different applications the date 11th March 2007 could be stored in different formats which are 03/11/07 or 11/03/07 etc. A data warehousing project would require these dates to of course be in one similar format or else it wouldn’t make sense so data cleansing pretty much makes it all uniform and match, just by seeing this simple example I can already tell how tedious of a task this would truly be.
The next part the article talks about is ETL, Which stands for Extract, Transform and Load. ETL refers to a category of tools that can assist in ensuring that data is cleansed, for example it conforms to one standard before the information is allowed to be entered into the warehouse. ETL usually sits in the front of the data warehouse listening for new incoming data, if data has been programmed to transform then it will make those changes before loading it into the data warehouse.
There are a bunch of different ETL tools which can be used to extract data from remote databases manually or through automatic scheduled events, but alternative tools also exist, choosing what is best only depends on the users budget and the project. Data base administrators can write scripts to perform these ETL tasks but usually only for smaller projects. Even Microsoft SQL server, the program we used for our very Project 2 comes with a free ETL tool called Data Transforming Service which is known to be pretty good but has limitations when it comes to data cleansing.
Here is a picture showing ETL sitting in front of a Data Warehouse.
I did my blog on Data Cleansing because our in class topic for the week is Data warehousing, After checking the power point it seemed that data cleansing has not been mentioned and may not be mentioned in class, so i wanted to bring something new to the table to ensure that more knowledge from an external source is shared with my fellow classmates.
Baumgarten, Ari (February 7, 2007) “Data Cleansing for Data Warehousing”
Retrieved From: http://www.dwreview.com/Articles/Data_Cleansing.html