Data Warehousing and Data Quality

by Asim K
In their peer-reviewed Journal, published in a booklet for Hawaii’s International Conference on System Sciences, Rudra and Yeo explore the key factors that determine what make data in a data warehouse inefficient or lacking in quality. They begin with a basic introduction on the concept of Data Warehousing and its history, purpose, etc., then go into the aim of the study (which is mainly catered to data warehousing for companies and industries in Australia). Data Quality is then explained in quick-and-dirtly detail, bullet pointed in a very direct manager, mentioning that the quality of data refers to “how relevant, precise, useful, and timely data is”. Rudra and Yeo explain that it has been found that many end users such as managers are unaware of the quality of the data they use in a data warehouse so there are many setbacks because of ineffective planning. They start beginning to explain data inconsistency in which there are different versions of the same data in a database – this section is concluded when mentioning that there is a direct relationship between data consistency and data integrity (chart provided in citation). After the background information of the research is given, the authors of the journal go into their findings in which they see that the quality of data is measured by: completeness of data, consistency of entries, accuracy of data, uniqueness of account numbers, and durability of business rules that pin down the data. Rudra and Yeo conclude that the mos common ways that data gets polluted in a data warehouse is that data is never fully captured, “heterogeneous” systems are integrated incorrectly, and there is a lack of planning on part of the management.

This research by Amit Rudra and Emilie Yeo is very well done and super duper informational for students who are still learning about data warehousing, or rather in our case, just getting introduced to it. In conjunction with my article last week in that 50% of the responsibility is on the end user when controlling a database, this journal amplifies the statement further by bringing yet another aspect of data control to the table in which end user error is at fault for most issues relating to inconsistent data. This type of research is needed in the DBMS community to point out errors that might be otherwise pinned to database system errors – as history can tell us, human pride may often get in the way of efficient computing (google: failed monopolies). Because of research like this, database administrators can now provide solid evidence of management that may be lacking in doing their job efficiently and for themselves to recheck what they have been doing wrong as well. I’m ending this blog with the definition of insanity: doing the same thing over and over and expecting different results. Checking and rechecking data, it seems, is key.

Rudra, A., & Yeo, E. (1999). Key Issues in Achieving Data Quality and Consistency in Data Warehousing among Large Organisations in Australia. Proceedings of the 32nd Hawaii International Conference on System Sciences, 7, 1-8. Retrieved from