Do you know that feeling when you’re starting a completely new project? You’ll meet new people and new business models. There is a great chance you’ll encounter some problems you have never ever heard of. You’ll boldly go where you never went before. But before you boldly do it so, give yourself a moment or a two to build a solid data model. In the today’s article, we’ll go with few advices how to to it.
Few general tips
Before you dive into any project you should answer few questions first. Although we’re focused on data-related projects, this could be applied also to other IT-related projects but also to non-IT projects. Questions you should answer yourself are:
- Do you completely understand what the client needs?
- Are you ready to listen to the client even if you think it should be done the other way?
- Are you ready to go into constructive discussions and propose a better solution for this problem?
- Are you aware how time-consuming and money-consuming will this project be for you and for the client?
- Do you have all the skills and expertise needed to complete the project successfully and in time?
Although these are general tips, if you haven’t answered these questions, no matter how trivial they might sound, maybe it’s better not to start working on that project.
OK, we’ve answered all questions mentioned above and now we’re ready for the next step. Before you start coding you’ll need to spend some time with your client. The more time you spend with the client and the more info you get from him, the less job you’ll have later. It could be frustrating at the beginning because you won’t produce anything for days or even weeks. Still, that’s a great investment for the future and it greatly increases the chance that you’ll complete the project successfully.
Maybe the most important thing to say here is: “If you do not understand the data behind the requirements, you’re doomed to fail.” Maybe I’m overstating but you’ll definitely have more problems in later stages of the project. These problems will lead to more unpaid work or maybe even to canceling the whole project. In any case, they will lead to financial loss and you’ll devote (unplanned, unpaid, and unwanted) time for that project.
How can DB modeling help us here?
For any data-related project, DB modeling should be the first step. Although, there will be changes in the model along the way (there will be less change if you have discussed the project with the client more detail) caused by unexpected situations or new ideas, the core of the model will stay the same. Therefore, building an initial version of the model in the proper manner is the great way to start.
In the rest of the article, I’ll focus on one of the well-known problems and that is the data quality. There are some rules we should follow to have consistent data and some well-known data-related issues.
Data integrity can be divided into physical and logical integrity. Physical integrity is closely related to hardware and will not be the point of our interest (in this article) while logical integrity is closely related to data and relations in the database.
Logical integrity implies the correctness of the data in the given context. Some of the rules usually applied are entity integrity and referential integrity. In order to understand the positive aspect, let’s take a look at the negative ones first.
Data redundancy and inconsistent data
Data redundancy refers to the state when the same data is stored multiple times in a few different locations, instead of being stored only once. While storing redundant data will definitely require more disk space this is not the only and the worst problem. Maybe the greatest threat here is the fact that we could have problems with data consistency. Let’s take a look at the one such example.
At first, the data structure in the picture above could seem completely ok. Still, there is at least one problem (for start). For each location, we’ll unnecessarily repeat the same data for city and country. While this doesn’t look like a big deal it is. This example contains only 8 rows. Now imagine that we have 8 million rows instead of 8 rows. We would repeat data for cities and towns unnecessarily 8 million times. That is not only unwise, but it also requires additional disk space and will affect performance.
Maybe the even bigger problem is data inconsistency. Imagine that there is a change from “USA” to “USB”. While this shouldn’t happen, always keep in mind that everything you can think of, no matter how unlikely to happen it seems, will happen. Databases are a game of big numbers and if you leave a space for error to happen, it will take place sooner or later. With this organization, we could assume we have 3 different states “UK”, “USA” and “USB”.
The only logical thing we can do is to separate real-world entities into separate tables. Therefore, we’ll have separate tables for locations, cities, and countries and we’ll establish relations between these tables using primary keys and foreign keys. If we change “USA” to “USB” now, we’ll change it only in one place, and that change will affect all records (regardless, that’s a mistake; Notice that correcting this mistake and changing “USB” to “USA” will also happen only in one place, and that’s great 😊 ).
In some cases, we want to have some redundant data, e.g. we’ll store a result of a complex calculation used for regular reports to avoid making that same calculation over and over again. This will improve the performance of the whole system so we have a good reason for storing such redundant data.
How to build a normalized DB?
Normalized data structure
Notice that each table contains only one type of the data from the real world and relations to other tables
Normalization is the set of rules that would lead us to the optimal data design. Still, you could build a database in 3NF directly without a need to have data in 1NF, then in 2NF and finally in 3NF.
A design process is pretty straightforward:
- You should use a naming convention for all objects in your database to increase the readability.
- Each table/entity should represent one entity from the real world and you could add it to your model.
- Start with entities you surely know you’ll need, like “client”, “task”, “service” etc.
- Each table should have defined a single-column primary key (type integer with auto-increment property set).
- Relations between tables (one-to-one, one-to-many, many-to-many) should follow the nature of these relations in the real world. E.g. if a pilot could fly different aircraft during the time and an aircraft could be piloted by different pilots during the time, this is obviously many-to-many relation and should be solved in our database model accordingly (we’ll have an article related to that in the near future).
- After you’ve finished with the DB design, revise it multiple times. Imagine that you’re starting from scratch and ask yourself, would you do it the same way.
- When you’re finished with the backbone of the model, you can start coding. Add tables when you run into new requirements. If you’ve done everything right, you should have almost no changes in the backbone of your model.
- Always remember that data-related projects are iterative processes that will require revising your code but also the data structure.
Nothing says it better than this one – “weeks of coding can save hours of planning”. I guess we all had at least one situation where we left something to be solved later and then it erupted when we had no time to deal with that (and of course couldn’t believe what we did). I know I have. Have you? If you would like, please share your experience in the comments.