The term “Data Warehouse” is currently trending topic in the data area. You have at least heard of it once, but maybe you do not know what it really is. Or you know what it is, but do not know how to build one. Do not worry, I will be covering what a Data Warehouse is and how it is created from a SQL script.
From OLTP to OLAP
Due to the concurrent activities and the large amount of data managed by businesses, the databases have come to be distributed according the business needs. When you are working with transactional systems, it is normal to hear about OLTP databases. But what OLTP does stand for? OLTP refers to OnLine Transaction Processing, and it is used to depict those databases which are strictly normalized. The typical operations in this kind of databases are insert, update and delete statements, so the primary focus of these databases are current business operations. However, if you need to do some deep analysis of your data, then querying your OLTP database would be complex and time consuming due to normalization of the data in the OLTP databases – this means you need to create multiple joins to retrieve descriptive data. That is why OLAP databases exist.
An OLAP database stands for OnLine Analytical Processing, and it is totally focused on storing historical data for analysis. One of its distinctive features is a completely denormalized schema – which is created by adding redundant data or grouping data. Denormalization significantly increases the OLAP read performance, however, it decreases the write performance as compared to the OLTP model. Data Warehouse, as you probably have already assumed, uses the OLAP model. The purpose of the Data Warehouse is to aggregate wide-ranging business data from one or multiple sources to provide insights for comparison and analysis of business operations.
Figure 1. Specific Differences between OLTP and OLAP databases.
So… what’s the role of an ETL process in this situation?
Before getting deep into this, let me explain to you what ETL is. The ETL is used to refer the process of Extract, Transform and Load data into a Data Warehouse from transactional databases. When you use an ETL tool, such as SQL Server Integration Services (SSIS) you produce a package – called ETL package – which can be deployed into a server.
The data in the Data Warehouse populated by running the ETL process. First, you have to define your data source or sources – such as OLTP databases, flat files, Excel files, and so on – which becomes input of the package. Second, you need to create a target data model in the Data Warehouse, such as Star Schema or Snowflake Schema – we will talk about this below. And finally, transform the data from the OLTP database into the OLAP database with some level of aggregation. This OLAP database will be actually our Data Warehouse.
Figure 2. You can set more than one data source to build a Data Warehouse and each one can have a distinct source, such as an OLTP database, CSV files, ERP, and so on.
Set up the environment
If you want follow me this along this article, ensure you are running SQL Server on your machine and have the AdventureWorks database. In case you don’t have one of them, do not worry, go to the bottom of this article, and you will see the links to download them.
Data Source: Point Sales Database
First, we need to define our data source. We will be using a set of modified tables from the Adventure Works database (Figure 3).
Figure 3. Point Sales Entity Relationship Diagram.
I decided to model the set of tables by using an ER diagram tool – the one which I used was SqlDBM. I am going to take advantage of one of their features – SQL Generation. This feature will generate SQL statements for database objects creation, such as tables, columns, keys, indexes, and so on.
Run the script from the link below for creating the PointSales database and fill it with the AdventureWorks tables. I recommend to take a look of the data after create the database.
Modeling your schema: Snowflake Schema
Asides the distinct purposes of an OLTP and OLAP database, their main difference is their data structure along the tables. The OLTP tables are completely normalized, however the OLAP tables are denormalized. This means there is a weak relationship between tables and there is redundant data. In this case, it is beneficial to have redundant data because there will be less joins in the query, making the query a lot faster.
The schema used to build this data warehouse was the Snowflake schema (see the figure 4). The dimension’s tables from this schema can have their own category, so it makes the data more descriptive.
Figure 4. Snowflake schema. Notice that a dimension could have a subdimension called “category”.
Running the T – SQL Script
Just run the script from the link below to create the data warehouse. Look carefully how the data is extracted to be stored in each dimension and fact table.
By doing this, now you can analyze your data with much freedom than making complex queries in a OLTP database. You can also connect a reporting/data visualization tool to this database, such as SSRS, PowerBI or Tableau. The figure below was the result of this script, it was generated by using the “Import Existing Database” feature of SqlDBM.
Figure 5. This is the result of running the above script.
This was a simple tutorial of a T – SQL script for creating a data warehouse. I hope this may give to you a clear idea of how data warehouses are created.
Author: Alfonso Hernandez
- SQL Server 2017 Developer Edition: https://www.microsoft.com/en-us/sql-server/sql-server-downloads
- AdventureWorks database: https://github.com/Microsoft/sql-server-samples/tree/master/samples/databases/adventure-works