How to Select the Best Data Prep Tool: Part 1
Choosing a data prep tool for your company or team can be pretty stressful! It can include months of research and testing, writing big checks, rollouts, testing, etc. And what if you pick the wrong one!? I’ve been fortunate to be able to learn multiple data prep tools without being forced to migrate to them. This has given me good insight into the key features to look for. In this first of a two-part post, I’ll cover the things you need to know even before comparing tools.
The first phase of selecting a data prep tool includes 1) understanding the purpose and basic functionalities, 2) knowing what code vs. no-code tools are, 3) assessing your needs, 4) determining a test plan, and 5) assembling your selection team.
What is a Data Prep Tool?
For this purpose, a data prep tool is software that takes raw data and transforms it into something useable for visualization. We are not including data pipeline management here, which includes scheduling and dependency management.
Basic Functionalities of Data Prep
There are two essential features that nearly every data prep solution will provide. What you’ll find is that each tool you look at performs these functions differently. Your focus on these items shouldn’t be if the functions can be performed, but how well they can be performed.
- Data Blending – The combining of multiple, disparate data sets.
- Data Transformation – Creating new data elements for visualization that do not exist in the raw data.
Understand Code vs. No-Code (Low-Code) Tools
The first decision is selecting a coding tool vs one that requires no, or very minimal, coding skills. A code-based tool performs data prep through writing code, such as SQL, Python or HiveQL. Knowing how to use a code-based tool is basically an exercise in learning that coding language. Which one you select is also largely driven by the data platform you are using.
For example, if you have Microsoft (MS) SQL Server, then you will use MS SQL as a coding language. From there you might look at tools such as MS SQL Server Management Studio (SSMS) or TOAD, both of which accommodate MS SQL code. For more, check out my post on 2 Proven Coding Languages You Need to Use for Epic Data Visualizations.
A no-code (or low-code) tool is one with a point and click interface. These tools perform many of the same function, leveraging a UI that doesn’t require coding knowledge. This type of tool may also be called “Low-Code” if it offers a feature that allows you to paste in code blocks for complex actions that the tool cannot handle natively.
No-code tools have been increasing in popularity primarily because of their accessibility to people who don’t know how to code. There is more data, a greater demand for decisions driven by that data, and more people needed for data prep. This trend is not unique to data careers either. I remember building my first website and being forced to learn HTML code to do it. Now, you can create websites easily through the use of point and click solutions like BlueHost, Wix, Squarespace and many others, no coding required!
To get a sense of the different no-code tools on the market, below is a snapshot from Matt Truck’s Machine Learning, Artificial Intelligence, and Data (MAD) Landscape for 2021. The highlighted section includes the big names, but it certainly isn’t a complete list. And if you enjoy a good game of Where’s Waldo for tools and platforms, be sure to visit his full map at the link above!
Assess Your Data Prep Needs
Next, you will need to evaluate what your needs are prior to diving into the sea of options available. This helps to ensure that you don’t get side tracked by features which appear cool, yet have no impact on your ability to prep your data effectively. Here are the considerations that often come up:
1) Associate Skill Level
I’m putting this on here not because it is something you should NOT consider! It’s tempting to choose a tool where the complexity of the tool is within the skill level of their teams. However, this would be the wrong approach to take. The right tool will be the one which best meets the data prep needs of your platform. If you find that your team’s skills are not adequate, then focus on skill building to level them up accordingly.
2) How Complex is Your Prep?
Unfortunately, there is no ruler with which to measure this. However, you need to get a sense for the level of challenge that your associates face with your data. At a minimum, try to determine if your needs are 1) simple to medium, or 2) complex. Some examples of complex data prep include:
- Use of tagged data (i.e., XML, JSON, etc.)
- Use of no SQL data (i.e., graph, geospatial, document store, etc.)
- Use of very large data sets (millions of records and up)
- High number of disparate data sources
- High number of data visualizations being produced
3) Do You Use a Data Warehouse?
A data warehouse (or data lakehouse) is typically leveraged when your data needs to be modeled. Data models are critical in spaces where used by multiple visualizations, ad hoc analytics, and data science. When using a data warehouse, you prep data first for the model, and second for the target visualization.
This is important because without a data warehouse teams will prep data directly from source to visualization. In this scenario, the data they prep is not needed by anyone else, so it can exist in a silo. However, if there are multiple dependencies on the prepped data, you will need the warehouse model.
The use of a data warehouse is like a central metro hub, with multiple lines coming in and out. Conversely, data prep without a data warehouse is similar to a single metro line running beginning to end.
4) What Other Tools are In-house?
There are many sister tools that work together on a data visualization platform. There is also a natural tendency to want fewer tools to learn and use. As such, some data prep tools try to become a one-stop shop for data prep. Here are some examples or related functionality:
- Automation – Scheduling individual data prep processes to run with inter-dependencies on other processes to ensure integrity of data.
- Data Import/Export – Moving data in and out of a centralized location such as a data warehouse.
- Data Quality – A tool that leverages rules to monitor the health of the data sets and provide alerts when out of tolerance.
- Analytics Software – Specific use tools such as R which require unique skill sets to leverage them.
As mentioned, some data prep tools are strategically trying to include as many of these features as possible. I strongly recommend caution here as they will never fully replace a stand-alone tool. It’s like when I want to get some good Chinese food, I’ll go to a Chinese restaurant. I’ve found that Chinese restaurants that try to also serve Japanese, Korean, and Thai food generally deliver poor quality all all of them.
Do everything data prep tools can’t be the best at everything. Find a data prep tool that focuses on data prep and is awesome at that. Your other in-house tools should handle the rest.
What is Your Test Plan?
The CISO of Uber once said that the demo is the new presentation and the pilot is the new demo! Get the tool, apply it to real use cases using your own data prep specialists and push the limit on functionality. Your test plan should document all of the features you are looking for in advance. Your use case should then be designed to cover all of those features.
Assemble a Selection Team
Selecting a data prep tool is also not a solo project by any means. The biggest mistake that can be made is when a decision is made by an individual or team who doesn’t perform data prep themselves. The data landscape is evolving rapidly, and needs can change significantly year over year. You’ll want people involved in the pilot and selection process whose typical day includes hands on keyboards prepping data.
Now, with your pre-work out of the way, you should be ready to engage vendors and conduct your pilots! If you have any suggestions on additional ways that you can prepare to evaluate data prep tool candidates, please let me know in the comments below. In Part 2 of this post, I’ll dive deeper into the key functionalities of the tool that you should be reviewing.