What are Data Tests
Data tests is a powerful tool offered by Y42, in order to assure the quality of data.
We all know that data coming from different sources through the pipeline contain the risk of being faulty and errors are very difficult to spot especially if there are thousands and millions of rows.
This is were Data tests come in the rescue. You can set them up easy through the UI of Y42. They can run:
- Manually or every time directly in the table after you set them up.
- Automatically when a table containing the tests is triggered from an Orchestration (or when the Table containing the tests is triggered manually for a full or incremental import).
Pro tip: you can use data test not only to check the quality of the data but also to make business decisions!
Types of tests
There are 5 types of tests, which you can combine to make your data quality bulletproof.
Schema tests
First you can do is add the schema of the tests, which sets the first conditions under which the actual tests are build.
This will be your first mandatory test!
In the schema you can:
- Add/Remove the Columns you will test further.
- Select the Column type (By default it will be mapped the actual column type).
- Select if you accept null values of them
- Select if these columns are required or not (Should or should not exist in the actual output table)
- Allow new columns to be added in the output table or not.
Note: You can change this setup anytime later.
Column tests
You can add as many validations for columns as you want based on the column values.
If the all the data is true the test will pass.
If any of the data is false (and the threshold is 0) test will fail
You can even add an absolute or relative threshold of the result.
For example in the failed example there are 75 unexpected values.
if we add an absolute 75 threshold value the test will pass, stating the number of failed rows
Note: To make the test pass any time, you can make the Error threshold type Relative and the value 1
Row number Drift
It sets a minimal and an upper threshold number of rows.
If the number of rows is less then the threshold the test will fail.
The threshold can be Relative or Absolute.
Note: To make the test pass any time, you can make the Error threshold type Relative and the value 1
Unique Key
It checks whether a column has unique values or not
If the column does not have unique values it will display how many are duplicated and the test will fail. If every value is unique it will pass.
You can also check if you accept Null values or not.
Low-Code Condition
Works similar as the Column tests, the only difference being that you can write the conditions with low code rather than with the UI.
How to Use them
1- Make sure you have created an output out of a model, or a successful integration job.
Note: The data Test button will not be clickable if you do not have at least 1 successful Job
2- Go to Job Overview (Of an Integration, UI Model, SQL model)
3- Find a Ready status table that you want to test and click the button Data Tests.
4- Add the Schema Test.
First you can do is add the schema of the tests, which sets the first conditions under which the actual tests are build.
This will be your first mandatory test!
In the schema you can:
- Add/Remove the Columns you will test further.
- Select the Column type (By default it will be mapped the actual column type).
- Select if you accept null values of them
- Select if these columns are required or not (Should or should not exist in the actual output table)
- Allow new columns to be added in the output table or not.
Note: You can change this setup anytime later.
You have the possibility to preview the actual table data in the UI of the test setup making it easier to make the necessary decisions while sitting up the tests.
5- Run the test
Click the run button and the Table will be checked for the conditions you setup.
If everything matches then the test will be successful.
If I change one condition intentionally, like for the first column in the image from Float type into String and run it, the test will fail.
6- Save the test.
After you click the save button the first test (Schema test) will be saved.
Now you can add other types of tests.
7- Add other types of tests (Optional)
Now you can add ass many tests as you want.
In this step you can add:
- Column tests
- Row number Drift
- Unique Key
- Low-Code Condition
8- Add an Orchestration (Optional)
To make the test run every time the table is imported/triggered, you need to run an orchestration to make the process automatic and reliable.
9- View the Jobs in the Job overview of the data tests (Optional)
In this section are listed all the statuses of the test runs.