Data Tests in Y42

What are Data Tests

Data tests is a powerful tool offered by Y42, in order to assure the quality of data.

We all know that data coming from different sources through the pipeline contain the risk of being faulty and errors are very difficult to spot especially if there are thousands and millions of rows.

This is were Data tests come in the rescue. You can set them up easy through the UI of Y42. They can run:

  • Manually or every time directly in the table after you set them up.
  • Automatically when a table containing the tests is triggered from an Orchestration (or when the Table containing the tests is triggered manually for a full or incremental import).

Pro tip: you can use data test not only to check the quality of the data but also to make business decisions!

Types of tests

There are 5 types of tests, which you can combine to make your data quality bulletproof.

Schema tests

First you can do is add the schema of the tests, which sets the first conditions under which the actual tests are build.

This will be your first mandatory test!

In the schema you can:

  • Add/Remove the Columns you will test further.
  • Select the Column type (By default it will be mapped the actual column type).
  • Select if you accept null values of them
  • Select if these columns are required or not (Should or should not exist in the actual output table)
  • Allow new columns to be added in the output table or not.

Note: You can change this setup anytime later.

Column tests

You can add as many validations for columns as you want based on the column values.

If the all the data is true the test will pass.

If any of the data is false (and the threshold is 0) test will fail

You can even add an absolute or relative threshold of the result.

For example in the failed example there are 75 unexpected values.

if we add an absolute 75 threshold value the test will pass, stating the number of failed rows

Note: To make the test pass any time, you can make the Error threshold type Relative and the value 1

Row number Drift

It sets a minimal and an upper threshold number of rows.

If the number of rows is less then the threshold the test will fail.

The threshold can be Relative or Absolute.

Note: To make the test pass any time, you can make the Error threshold type Relative and the value 1

Unique Key

It checks whether a column has unique values or not

If the column does not have unique values it will display how many are duplicated and the test will fail. If every value is unique it will pass.

You can also check if you accept Null values or not.

Low-Code Condition

Works similar as the Column tests, the only difference being that you can write the conditions with low code rather than with the UI.

Learn how to set up data tests by visiting this article.