Syncs
Last updated
Was this helpful?
Last updated
Was this helpful?
A CloudQuery sync fetches data from a source integration and delivers it to one or more destination integrations. This might mean fetching data from AWS and delivering it to ClickHouse, or it could mean fetching data from GCP and delivering it to BigQuery, Kafka and Neo4j, all at once. It all depends on the pipeline.
Data is synced to the destination in a streaming fashion. As soon as data is received for a source integration resource, it is delivered to the destination integration. Source and destination integrations may batch writes for performance reasons, but generally data will be delivered to the destination as the sync progresses.
Table syncs come in two flavors: full
and incremental
. A single sync can combine both these types, and which type is used for a particular table depends on the table definition. This will be indicated in the table's documentation in the .
This is the normal mode of operation for most tables. For tables in this mode, a snapshot of all data is fetched from the corresponding APIs on every sync. Depending on the destination write mode, the data is then appended (write_mode
: append
), overwritten while keeping stale rows from previous syncs (write_mode
: overwrite
) or overwritten and rows from previous syncs deleted at the end of the sync (write_mode
: overwrite-delete-stale
).
Not all destinations support all write modes. ClickHouse, for example, only supports append
mode. CloudQuery Platform uses additional views to make it easy to query the latest snapshot of data.
Some APIs lend themselves to being synced incrementally. Rather than fetch all past data on every sync, an incremental table will only fetch data that has changed since the last sync. This is done by storing some metadata in a state backend. The metadata is known as a cursor, and it marks where the last sync ended, so that the next sync can resume from the same point. Incremental syncs can be vastly more efficient than full syncs, especially for tables with large amounts of data. This is because only the data that's changed since the last sync needs to be retrieved, and in many cases this is a small subset of the overall dataset.
Incremental tables are always clearly marked as "incremental" in integration table documentation, along with an indication of which columns are used for the value of the cursor. Because they use state, incremental tables require a little more management. For more details, see Managing Incremental Tables under the Advanced Topics section.