How to Select the Best Data Prep Tool: Part 2
The data prep landscape has gotten more complex along with the variety of data platforms. This is especially challenging for data prep tools which try to be one-stop-shop for all things data prep. In How to Select the Best Data Prep Tool: Part 1, I covered the pre-work needed. In this post, I’ll focus on the next decision, choosing between a code-based or no-code (low-code) data prep tool.
Choosing between a code-based or no-code data prep tool centers around nine key considerations: 1) Data Prep Complexity; 2) The Interface; 3) Big Data; 4) Data Cleansing/Quality Features; 5) Governance; 6) Automation; 7) Data Warehousing; 8) Data Sourcing; and 9) Special Features.
The Key Considerations of Code-Based and No-Code Tools
1) Data Prep Complexity
Coding tools are focused on maximizing capabilities while no-code tools focus on ease of use. For example, when I built my last website I used a no-code tool because the functionality I needed was basic. However, if my needs become more complex, then I’ll likely need a tool that allows me to write custom HTML code.
Low-code (instead of no-code) tools offer a small compromise by allowing you to add blocks of code to your workflow. Keep in mind though, that a low-code tool is a code “runner” and not a great code “writer”. For teams with a complex environment who still want a low-code tool, they will need BOTH a low-code AND a code-based tool.
Winner: No-Code Tools for Low to Medium Complexity; Coding Tools for Complex Environments
2) The Interface
No-code tools will leverage point and click, or drag and drop methods to build workflows. As such, the interface is typically well designed, intuitive and easy to use. On the other hand, coding tools have very little need for advanced interfaces. When coding, you are simply typing code and running it. There might be some minor navigation in the UI when looking at result sets or searching for data sources. However, you simply don’t have many needs from the interface, and as such they can be fairly basic.
Where no-code tools really shine with the interface is through providing a visual representation of your workflow. This makes it much easier to view and see what the workflow is doing. If you have ever tried to fix someone’s script with 1,000 lines of code, then you can appreciate this!
Winner: No-Code Tool
3) Big Data
A code-based tool has no sampling limitations, meaning if your data set has 1M records, your query will run against all of them. That’s because the processing is actually happening on the data platform itself. You only use your local computer’s processing power when returning results to the screen or downloading the data. This allows you to query large amounts of data using complex commands very quickly.
Conversely, no-code tools typically bring the data from the platform to your computer for processing. That means the power of your computer determines how fast the workflow runs. To speed up processing times, no-code tools leverage sampling. So for that 1M record data set, the no-code tool may only run against a sample of 10,000 records.
The assumption no-code tools are making is that they are grabbing a statistically relevant sample representative of the whole. Unfortunately, unless your data is cleaned with low variability in content, this typically doesn’t hold true. Sampling can actually mask unique data values that need to be addressed in the data prep process.
In consideration of this, many no-code tools will allow you to switch off the sampling. However, turning that off might slow down your workflow significantly. If you work from home over a VPN, the speed will be even more negatively impacted.
Winner: Coding Tool
4) Data Cleansing/Quality Features
If all application developers started as a data analyst, I feel that data sets would be a lot cleaner! Unfortunately that won’t happen, so the data prep tool helps cleanse and validate the integrity of your data.
Data cleansing is the act of making a field ready for use. For example, a user input field might have multiple spellings of “In Process”, such as “In_Process”, “InProcess”, and “In Proces”. For use in visualizations, those values will need to be cleansed by consolidating them into one version. Data Quality is the act of setting up automated rules that look for issues in the data. For example, I may need a rule that identifies if a new variation of the “In Process” value appears. If it did, the rule would trigger and send me an alert so that I could update my workflow.
Both data cleansing and quality have been traditionally handled through coding. However, no-code tools can do the same thing, with the benefit of a modern interface that makes the identification of data issues much easier. A no-code tool provides a quick look at your data set, with visual insights into the volume and distribution of values. Then you can clean the data as well set up your rules for future alerts.
Below is a screenshot of Trifacta, a data wrangling low-code tool with some great data cleansing and data quality features.
Winner: No-Code Tool
5) Governance
Governance is a broad topic, but one element is gaining visibility into what happens in the data prep workflow. Imagine governance wanting to know what your 1,000 line SQL code script was doing. You would literally have to open it up, read through it and create the documentation by hand.
No-code tools are tackling this challenge by automatically producing governance documentation. This documentation can indicate what functions are being performed within the workflow, as well as data lineage. How accurate and useful this documentation will be is still too new to assess. However, my guess is that for governance, a limited something is still better than nothing.
Also, keep in mind that the tool is documenting a single workflow, without visibility into external processes. Taking data lineage as an example, you will only have visibility into the data that is used within that workflow. To understand the upstream source behind what’s used by the workflow would require platform level documentation. It’s still just a flashlight in one room of a dark house, but much better than none at all!
Winner: No-Code Tool
6) Automation
Automation is critical to an analytics platform, but it hasn’t traditionally been in scope for a data prep tool. There are two core features of automation, 1) a task scheduler and 2) a task runner. A task runner should be able to kick off multiple types of tasks. The task scheduler mainly orchestrates the running and interdependencies.
Scripts created by coding tools have long been in scope for task runners. However, no-code tools are new file types generally not handled by these tools. No-code tools are tackling this by providing scheduler and runner functionality within their tool feature set. Here are a few key points to keep in mind regarding automation capabilities in no-code tools:
- Additional Software – You may need a supplemental product to do this, such as server version.
- Limited Scope – If you have other tasks outside of the tool, then you will need a solution to schedule tasks with cross-dependencies.
Frankly, this issue exists with code scripts as well. You cannot use an automation tool that only executes coding tasks. The right solution here is to have an independent scheduler and runner that can handle multiple types of tasks. With some creative solutioning in your no-code workflows, you can integrate these as well, which will be a topic for another post.
Winner: Neither, use a stand-alone automation tool that can automate both no-code flows and code-based scripts.
7) Data Warehousing
In its most basic form, data prep is the act of connecting to existing data, transforming/blending it and then outputting it for use by a visualization. However, if you use a data warehouse (or even a data lakehouse!), then you have a 2-step process.
When using a data warehouse, you will need to first create modeled data sets such as base tables, mapping tables, reporting tables, etc. These data sets are critical for larger teams who leverage common, curated data sets for multiple purposes.
The flows/scripts to create a data model can be handled in both code and no-code tools. However, it’s going to be much simpler in a coding tool, here are a few reasons why:
- Efficiency – There will almost always be complex transforms needed which are much easier and faster to build using code.
- Version Control – Data models require constant changes by the team. The simple before/after code comparison in version control makes management much easier.
- Eliminate Code Duplication – When using a no-code tool, you write your code in a coding tool and then copy/paste into the flow. Now you have two copies of your code that the team must ensure remains in sync.
Winner: Coding Tool
8) Data Sourcing
When importing data there are special considerations that need to be made. Most important is flexibility since you will be sourcing data from different platforms, such as Excel, SQL, Hadoop and SharePoint. Consequently, a dedicated platform coding tool such as MS SQL or HiveQL does not have data movement capabilities across platforms.
Typically we would use a multi-purpose language, such as Python or PowerShell, for moving data before prepping it. That might of course require a specialized skill set on your team commonly found with Data Engineers. The advent of no-code tools makes sourcing easier by having the ability to connect to virtually any data platform. Now a Data Analyst can connect to the source and export it to a data warehouse or visualization tool. The data prep can then happen within the no-code tool or in the data warehouse using a coding tool.
For example, I had one scenario where I had 100+ excel files to ingest, with 5 different core file names plus a date in the name for uniqueness. I was able to create a flow in Alteryx that would ingest each of the five core file types, then looping through to pick up each subsequent date file. The imported data was merged together into 5 SQL tables in our data warehouse for additional data prep.
Winner: No-Code Tool
9) Special Features
There will always be cases where you need some extra functionality which a traditional coding tool cannot accommodate. For example, you might use MS SQL, but want to create a trend forecast data set for visualization. Creating that data would be hard to do in MS SQL, and you might instead turn to Python or R. That in turn forces you to learn another coding language, which might not make sense for a one-time project.
This is where a no-code tool can help. Consider Alteryx, which has R functionality embedded in the tool using the same point-and-click process. You can now create that forecast without having to know R coding. The breadth of special features in a no-code tool gives you more options where the learning curve involved with a separate tool would not make sense.
Winner: No-Code Tool
Conclusion
That wraps up the key considerations I feel are important when selecting a code-based or no-code tool for data prep. There are of course many other considerations such as sharing functionality, security, support and more. You’ll have to modify your list by adding anything else that is important for you to compare between the tools.
In case you are wondering what my decision would be going through this process, I can share that as well. For my situation, we use a data warehouse and have complex data prep needs. We also leverage stand-alone tools which handle data sourcing and automation. As a result, a code-based tool makes the most sense. We can then use code-based solutions for data quality and governance documentation such as data lineage and dictionaries.
With that said, the special features and visual data quality aspects of a no-code tool still add value in our team, especially as our needs evolve. Consequently we also use a low-code tool to supplement our data prep solutions when it makes sense to do so.
As always, I’m happy to hear from you on those items which you think are also critical to the selection process. And I’m even more interested in knowing where you disagree with me! Debate sparks innovation, so feel free to leave a comment below, thank you!