Data Cleansing

by Garcello D
 

The article I decided to blog about this week is called “Data Cleansing for Data Warehousing,” it was written by Ari Baumgarten on February 27, 2007. The author opens up with an analogy comparing a politician to data cleansing by stating how Politicians raise money can be compared to data cleansing a warehouse in the sense that one cannot exist without the other. Data Cleansing is also known to be the most time intensive and contentious process for data warehousing projects but what really is data cleaning? Well I’m about to break it down for you.

By definition Data Cleansing ensures that undecipherable data does not enter the data warehouse, because it will affect reports generated from the data warehouse via OLAP, Data mining and KPI’s. To show you what the author meant I’ll give you a very simple example. Data cleansing would utilize how dates are stored in separate applications so if you had 2 different applications the date 11th March 2007 could be stored in different formats which are 03/11/07 or 11/03/07 etc. A data warehousing project would require these dates to of course be in one similar format or else it wouldn’t make sense so data cleansing pretty much makes it all uniform and match, just by seeing this simple example I can already tell how tedious of a task this would truly be.

The next part the article talks about is ETL, Which stands for Extract, Transform and Load. ETL refers to a category of tools that can assist in ensuring that data is cleansed, for example it conforms to one standard before the information is allowed to be entered into the warehouse.  ETL usually sits in the front of the data warehouse listening for new incoming data, if data has been programmed to transform then it will make those changes before loading it into the data warehouse.

There are a bunch of different ETL tools which can be used to extract data from remote databases manually or through automatic scheduled events, but alternative tools also exist, choosing what is best only depends on the users budget and the project. Data base administrators can write scripts to perform these ETL tasks but usually only for smaller projects. Even Microsoft SQL server, the program we used for our very Project 2 comes with a free ETL tool called Data Transforming Service which is known to be pretty good but has limitations when it comes to data cleansing.

Here is a picture showing ETL sitting in front of a Data Warehouse.

I did my blog on Data Cleansing because our in class topic for the week is Data warehousing, After checking the power point it seemed that data cleansing has not been mentioned and may not be mentioned in class, so i wanted to bring something new to the table to ensure that more knowledge from an external source is shared with my fellow classmates.

Baumgarten, Ari (February 7, 2007) “Data Cleansing for Data Warehousing”

Retrieved From: http://www.dwreview.com/Articles/Data_Cleansing.html

4 thoughts on “Data Cleansing

  • November 25, 2012 at 8:14 pm
    Permalink

    Nice article. I would have never thought that data cleansing would have been such a tasking procedure. It’s definitely much more complicated than just hitting the delete key. The picture you provided made it easier to understand how ETL fits in with data warehouses. Thanks for the post.

  • November 25, 2012 at 8:39 pm
    Permalink

    Nice article! As you said we have not talked in depth about data cleansing and as we all know it is a very important part of data wharehousing. All of the data needs to be extracted, transformed and loaded so that it is stored properly that way there is not any useless data stored. As a result all of the reports should be clean and easy to create/read. I also liked the picture that you posted as it helped me understand where ETL is. Nice post!

  • November 26, 2012 at 12:41 am
    Permalink

    Very insightful article on data cleansing. Since integration is one of the four characteristics in data warehouse, I believe that data cleansing is most important task. I didn’t know that ETL would be done from a separate server for incoming data. Considering that data would be coming from different RDMSs, each of them would have different formats that need to be standardized before storing into data warehouse/marts.

  • November 26, 2012 at 11:42 pm
    Permalink

    I did not think data cleansing would be such as resourceful procedure. It makes sense as the amount of data each data warehouse has and which data would be necessary for the warehouse or not.

Comments are closed.