The 3 Simple Steps to Prep Data for Visualization (2022)
In my career I have helped build out multiple data environments for reporting and analytics teams. Over time I’ve learned a simple 3 step process of taking raw data and prepping it for data visualization. For this post I’ll introduce these 3 steps for those of you who are doing your own data prep.
The 3 steps of preparing data for visualization are: 1) import raw data, 2) normalize/transform and 3) prep it into dashboard consumable objects. Let’s go through these one by one.
STEP 1: IMPORT
Our first step is going to be bringing the raw data into your reporting environment, such as a SQL database. There are a few key points to highlight when doing this.
Retain the Structure
You will want to import the data structure exactly like it is from the source. If it’s a data object such as a SQL table, that means keeping the schema for all the fields intact. Keep text fields as text fields, date as date, number as number, retain primary keys, etc.
It also means keeping any perceived discrepancies in the data as-is. For example, you might see a number stored as text and be tempted to convert it to a number on import. However, that could very well cause you more work down the road. If the source application writes a character string to that field based on user input, then your import process fails.
Remember, the code for the imports is often written differently than how you might code for the remaining data prep. An import might be written in Python or PowerShell, while your data transformation might be in SQL. The import is also typically maintained by a different team or person other than the analyst. You will save yourself time if you keep the data as-is on import so that you can manage the use of the data independently.
Bring One, Bring All (BOBA)
BOBA means to import ALL the fields and records in a data object, and, in some cases, all the different data objects related to the same source. There are pros and cons to this strategy of course, such as data storage cost vs. velocity of change. Overall though, the advantages of BOBA outweigh the disadvantages in most scenarios.
Let’s go through each of the options in BOBA.
Fields: Importing all the fields (columns) in a data object means bringing everything beyond just the data points currently needed. It’s like bringing your entire tool box when all you think you need is a screwdriver. This method ensures that you have what you might need in the future to make enhancements as business needs evolve.
Records: Seemingly pretty simple at first glance, this means to import all records in the data set. It’s easy to assume that you don’t need the records marked deleted or inactive. Or perhaps to exclude subject records outside the population you currently need. However, being able to see everything is critical for data validation and visualization. Imagine the case where a user proclaims a record is missing from your dashboard. If you can’t research back to the raw data, you won’t be able to determine where it’s filtered out.
Objects: Ideally you want to bring in all objects related to your subject matter so that, once again, you have the data available for your visualizations as they grow with the business needs. There might be additional objects related to code tables, change history tables, subject matter attributes, etc. You want to import as many related objects as reasonably possible, and it is also a good idea to take space and authorized use into consideration.
STEP 2: NORMALIZE AND TRANSFORM
Now that you have all your raw data in one place, you need to start the process of cleaning it up and standardizing the structure for use. You will want to stage the data in a way that can be used for your current visualization as well as future visualizations, analytics, data science, etc.
I’ve combined Normalize and Transform into one step because they typically happen at the same time, and often in the same block of code. Let’s define each one:
Normalize
This speaks to the process of moving the raw data into a database model and cleaning it up in the process. The best way to explain is go through some examples.
- To improve query performance and reduce space take the fields with many duplicate text values and convert them to codes. Let’s say we have a transaction level table with a field for Transaction Type containing 3 distinct values. Convert it to a code and then create another look up table to pull int the text value as needed.
- Create one or more “Base Tables” that will be your primary sources of data. For example, if working with employee data, I will want a Base Table that has one record per employee. Then I can append all the primary data points related to each person in that table.
- We notice a number field stored as text, so we convert that to a number to improve query performance. Later, if the raw data ever includes a text value, the Base Table will break on refresh (not the import). It’s much faster to fix Base Table logic than import code.
Transform
The process of transformation is primarily to create new data points that don’t already exist in the raw data. These are going to be fields that can be used for multiple visualizations, analytics, etc. Don’t worry about adding fields that are specific only to your visualization as that will happen in the final step of data prep. Here are a few examples:
- Performing age calculations.
- Adding groupings and/or detail breakdowns of one of the raw fields.
- Creating brand new data points, such as adding a Country field when the raw data only had a City field.
STEP 3: VIZ TABLES
With your base tables in place, now you need to build objects that your visualization can connect to directly and consume. You could connect directly to your Base Tables, but there are many disadvantages to that, with performance being a big one. Here I’ll share the top 4 advantages of having viz tables.
- Performance. It allows you to increase viz performance by streamlining objects to contain just the fields and records you need.
- Structure. The data structure preferred by visualization software is often different than your normalized data models used for other disciplines like analytics.
- Dedicated Fields. Calculations used solely by your visualization can be placed here and do not clutter up your base tables used by others. (For example, you might need a field for a custom 1 or 0 filter in your dashboard).
- Housekeeping. When looking at all your data objects it is easy to identify those objects which are used by a visualization. This is important in a team environment with multiple people troubleshooting and maintaining dashboards. (Best Practice: Try to integrate the name of the viz into the name of the viz table).
So, there you have it, the 3 Steps of Data Prep! This was a very high-level overview and there are of course many considerations to make when implementing this yourself. If you are just starting out, try and map out your process and structure in advance so that any special considerations can be addressed. And because data is such a complicated space, I’d love to hear from you if you have a different approach based on your own situation, thanks!