Understanding data warehousing
A data warehouse allows investigation of corporate data from several perspectives, uncovering key insights that drive corporate decisions, says Jay Shah, associate VP at Nihilent Technologies.
In the previous article on business intelligence in the digital era, I introduced the three primary components of a BI solution: extract, transform, load (ETL); data warehouse; and visualisation. Now let's zoom in a little and try to understand the data warehouse in detail, says Jay Shah, Associate Vice-President at Nihilent Technologies.
As the name suggests, the data warehouse is a store for data. It is, in general, a database that is separate from the operational systems (ERP, CRM, SRM, among others). This separation carries two advantages. First off, any analytical processing does not consume the processing resources of the operational systems, and secondly, any inadvertent changes to the data in the data warehouse do not impact operations.
Typically, the data warehouse is known to perform three primary functions.
1. Storing the data extracted from the source systems in its native format;
2. Transforming/reorganising the data into formats conducive to analysis; and
3. Data querying or filtering of the data as per the reporting needs and presenting to the visualisation tools.
The area of the data warehouse where the data from operational systems is stored in its native format is known as the staging area. Staging areas are often transient in nature. The data in the staging area is deleted once it is updated into the data models (the next stage).
The staging area serves to increase the efficiency of the ETL process as it helps to consolidate, cleanse, and align the data before feeding it to the data models.
It might be pertinent to note that the first time around, all the relevant data from the source system would be loaded into the staging area. Subsequently, only the incremental data will need to be brought in. Hence the loads are typically scheduled to run once a day to bring data pertaining to transactions recorded during that day.
As stated in the earlier article, the primary driver for the data warehouse was to reorganise data in formats making it conducive for query and analysis rather than transaction. Several formats to model the data have been proposed over time, of which, the star schema has been found to be most suitable for analytics.
The star schema architecture comprises a central table known as the fact table. The fact table is surrounded by numerous dimension tables. Fact tables contain data that are measurable and which we want to report on. For instance, if we are creating a schema for sales, the fact table may contain invoice data detailing the sales figures. The dimension tables contain attributes used to filter or search the fact table.
Typical attributes for dimension tables include time, geography, product, sales personnel and so on. Surrogate keys are created for each dimension table as the unique identification key and is the substitute for the natural primary key of that dimension table. The fact table, in turn, contains a foreign key column for the primary keys of each dimension. The combination of these foreign keys defines the primary key for the fact table. This establishes the link between the dimension table and fact tables.
Another model, similar to the star schema, is the snowflake schema. In a snowflake schema, there could be one or more dimension tables that do not connect with the fact table directly. They join through other dimension tables. This, when represented, creates the image of a snowflake. Considerations of whether to model as star or snowflake include frequency of change of the dimension, and ease of use versus ease of maintenance.
Another component of the data model is the metadata. Metadata is commonly described as data about data. It provides a descriptive view, or "blueprint", of how data is mapped from the operational level to the data warehouse.
Once this schema is set up, the data from the staging table is appropriately populated into the fact and dimension tables. Initially, the entire data till date will have to be loaded into the fact and dimension tables. Thereafter, incremental loads will have to be scheduled on a regular basis for the additional transactions that take place in the transactional system.
Querying the data models
Once the data models are set up, one can query the fact table using one or more of the dimensions. A query helps extract select data from the fact tables, filtered by the dimensions and with some processing/calculations along the way. As there are several dimensions, the same data can be queried using any or a combination of dimensions. Thus, it is possible to have multiple perspectives of the same data or, in other words, perform multidimensional analysis. For instance, the sales data can be queried by geography as well as by product and by the sales personnel. Also, since the fact table contains all historical data, the analysis can be a year-on-year analysis or even a multi-year analysis. This allows executives to analyse the data from multiple perspectives, gaining key insights which could facilitate key decisions.
The data extracted using queries is fed to a visualisation tool that presents the data in the form of tables, charts, etc.
A data warehouse allows investigation of corporate data from several perspectives, thereby uncovering key insights that drive corporate decisions. Thus, data warehousing has emerged as a competitive advantage for an informed, efficient and a profitable organisation.