Import Overview
Import is the process an Integration does to extract data from a data source and load it into the designated data warehouse.
Every execution inside Y42 is considered a job. You can see more about this concept here.
For integrations, the act of a connector to perform the extraction of data from a data source and loading it into a data warehouse is considered an import.
Import Types
When setting up a new integration, a full Import is required to enable the Integration to be used. During this step, the integration will load data from the selected tables available on your Source, from the start date until the current day, and it will load it into the designated destination.
After the first successful full import, it's possible to trigger manually, or schedule in the Orchestration, another full import or an incremental import. Incremental import allows only data that has been modified or added to be extracted and loaded, on a pre-defined schedule.
In some cases, it may be needed to re-run a full import to fix a data integrity error or a schema change. This process is called a re-import.
Start date
The start date is the starting point in time the integration will extract historical data from the Data source. There are some APIs limiting the timeframe available to load.
Full Import
Full imports update all tables from the historical date until the current day.
During a full import, the integration gets all data from the start date to the current day of the source and copies it into the Destination. Full imports include all your selected data.
How long the imports takes depends on the amount of data and the limitations of the source.
Incremental Import
This import enables you to extract and load new and modified data from your source into your destination. Incremental imports are efficient because they update only new information, instead of re-importing the whole table again.
Incremental Import for Applications
To make an import be incremental, the source table has to have a suitable replication key. Some applications provide a cursor field such as updated_at
or created_at
, but sometimes, it's only possible to find a primary key, such as a task_id
.
The existence of a replication key and its type is determined by the App, and will influence if the incremental import works and if it will replicate only new records or updated as well.
Y42 also support Change Data Capture for databases, a strong feature to replicate all your data in near real-time.
Incremental Import for Databases
Log-based replication uses the transaction logs that some databases, implement natively as part of their core functionality.
The replication tool looks for logs to identify changes in a database, such as INSERT, UPDATE, or DELETE operations to replicate its data.
Check our database documentation for coverage of CDC and Incremental import.
When log-based CDC is not possible, The standard incremental import for databases will use the key-based method.
Key-Based Incremental Data Replication
Key-based incremental is a method in which the data sources identify new and updated data using the column called the Replication Key. A key can be a timestamp, integer, or datestamp column that exists in the source table. Such as task_id
, insert_at
,updated_at
etc.
Note: Having an auto-incrementing integer as an ID selected as a replication key, may cause the replication to not capture updated rows, because the ID will remain the same.
Re-Import
A re-import is nothing more than a full import, for an existing table. Running a full import again will completely overwrite the data in your destination from your Datasource.
It is used to fix data integrity issues in selected tables, caused when historical data is changed and you cannot replicate those changes back into your data warehouse. When this occurs, your table is out of sync, and reimporting it will solve the issue.
Updated almost 2 years ago