Data Tests in Y42
What are Data Tests
Data tests is a powerful tool offered by Y42, in order to assure the quality of data.
We all know that data coming from different sources through the pipeline contain the risk of being faulty and errors are very difficult to spot especially if there are thousands and millions of rows.
This is were Data tests come in the rescue. You can set them up easy through the UI of Y42. They can run:
- Manually or every time directly in the table after you set them up.
- Automatically when a table containing the tests is triggered from an Orchestration (or when the Table containing the tests is triggered manually for a full or incremental import).
Pro tip: you can use data test not only to check the quality of the data but also to make business decisions!
Types of tests
There are 5 types of tests, which you can combine to make your data quality bulletproof.
Schema tests
First you can do is add the schema of the tests, which sets the first conditions under which the actual tests are build.
This will be your first mandatory test!
In the schema you can:
- Add/Remove the Columns you will test further.
- Select the Column type (By default it will be mapped the actual column type).
- Select if you accept null values of them
- Select if these columns are required or not (Should or should not exist in the actual output table)
- Allow new columns to be added in the output table or not.
Note: You can change this setup anytime later.
Column tests
You can add as many validations for columns as you want based on the column values.
If the all the data is true the test will pass.
If any of the data is false (and the threshold is 0) test will fail
You can even add an absolute or relative threshold of the result.
For example in the failed example there are 75 unexpected values.
if we add an absolute 75 threshold value the test will pass, stating the number of failed rows
Note: To make the test pass any time, you can make the Error threshold type Relative and the value 1
Row number Drift
It sets a minimal and an upper threshold number of rows.
If the number of rows is less then the threshold the test will fail.
The threshold can be Relative or Absolute.
Note: To make the test pass any time, you can make the Error threshold type Relative and the value 1
Unique Key
It checks whether a column has unique values or not
If the column does not have unique values it will display how many are duplicated and the test will fail. If every value is unique it will pass.
You can also check if you accept Null values or not.
Low-Code Condition
Works similar as the Column tests, the only difference being that you can write the conditions with low code rather than with the UI.
Learn how to set up data tests by visiting this article.
Updated about 2 years ago