Tuesday, November 10, 2009

Data Warehouse Concepts : Part 1

Now that we have decided to work on data warehousing, let us try to have an understanding on what exactly is a data warehouse, how different it is from the existing other applications and what is the main purpose of any business to go for a data warehouse.

What is a Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but can include data from other sources like EBS, SAP, Siebel, apart from your existing OLTP system in use. Data warehouses is mainly used for analysis and is separate from transaction workload, it enables an organization to consolidate data from several sources to come up for a centralized place where analysis can be done faster and in a better way. This helps in:

1. Maintaining historical data
2. Analyzing the data to gain a better understanding of the business and to improve the existing business.

In addition to a relational database, a data warehouse environment can include an extraction, transportation, transformation, and loading (ETL) solution (depending on the type/tool it can be ETL,ELT or ETL solutions), statistical analysis, reporting, data mining , client analysis , and many other applications that manage the process of gathering data, transforming it into useful information and finally delivering it to business users.

Why is it different from the existing Transaction system - OLTP or transactional database is mainly used for the purpose of daily activity i.e. Insert/update/delete i.e. to store data for daily operational purpose. Data warehouse or OLAP system is on top of this - ie it takes data from OLTP and transforms it for the management to come up with analysis on it, reporting, ad hoc reports so that better decisions can be made, where as end user of an OLTP system is the Operational team who maintains the day to day activities of the business.

Lets take for example a railway ticket booking system - when I enter in irctc.co.in to book my ticket then the main purpose is to get a seat booked – Operational. For this the backend data model/tables/ physical objects have to be designed in such a way that it helps in easy and fast processing of Inert/Update and Deleting of data. The best way of doing this is to have a Normalized database model which is the OLTP system (On line transaction processing) which stores data in the 3NF form making the data transfer very fast. Now when I enter my information it goes at the back end in numerous tables which are in the 3NF form enabling my processing faster for a ticket booking, same is the case of ATM transaction, shopping anything related to Operational Data i.e. day to day transactions. After I have booked my ticket, my work is done. The railway helpdesk maintaining this – his work is done. We have a vast database which maintains daily data.

After this lets talk on the other side of the Railway department – the ministry now wants to come up with which routes are doing good, which are less profit making, which services are running in profit and where there a need for improvement. Lets take for example, the management wants reports like –
a. Most profit making route in terms of day/train/passenger wise
b. Least profit making route in terms of day/train/passenger wise
c. Monthly report on same line
d. Quarterly report on same line
e. Yearly report on same line
f. Comparison of this day to same day in history, this month to same month last year, this quarter to last quarter, to last year etc.
g. Comparison of two routes in terms of profit, maintenances , time lines

There can be numerous reports that the management might want to look at for analysis and decision making at any given point of time. There can be a need of ad hoc reports also which might be required for some important presentations, decisions.

For all of this to happen, imagine a team of reporting people struggling hard to make different joins, different source data, consolidating data, cleansing operational data, integrating them, making them as per the required subject areas on top of huge and vast set of tables, a network of tables which are in the 3NF form making it a non user friendly environment for understanding and working on. Even if there is an expert in Oracle lets say, but data is coming from Oracle, Teradata, db2 and sap for different lines, different regions. Employee data is in different source, different format, trains information is in different format. Imagine the amount of work required to consolidate this information and come up with joins etc to make simple reports. It will take years to implement this reporting requirement on an OLTP system and an expert team of each area. More work, more people, more time required at each level making it a very difficult and cumbersome job that too question to reliability and stability. How to get historical data from archive files, how to centralize data from different places, how to make meaning to all the tables

Now imagine a data warehouse on top of this - a denormalized model which will help in maintaining the historical data from different sources at a centralized place to make the reporting a very easy job. Lets take the OLTP data and on top of that build our data warehouse model – we can have a Star, Snow flake schema, or a hybrid schema to facilitate this. We will need to come up with a Logical data model and then Physical data model. All this will be done to come up with a model that will help in storing data in such a way that it is easily retrievable and gives fast report building and query execution state to make it the best way of reporting for the management which in turn can come up with better decision making for any organization.

So coming back to our example of Railway system – what all do we need. First is requirement gathering , assume completed. Then to make a data warehouse we need to consider its characteristic like –

1. Subject Oriented
2. Integrated
3. NonVolatile
4. Time Variant

Data Warehouse Architectures
Now that we have decided to have a data warehouse, we need to decide how our architecture will be out of the commonly used three forms

1. Basic – End users directly access data derived from several source systems through the data warehouse



2. With Staging Area – We need to clean and process our operational data before putting it into the warehouse. A staging area simplifies building summaries and general warehouse management



3. With Staging Area and Data Marts – we may want to customize our warehouse’s architecture for different groups within our organization. We can do this by adding data marts,
which are systems designed for a particular line of business





Next : Logical and Physical Design

No comments: