Extracting data incrementally is universally beneficial — it speeds up the extraction and lowers the load on both the API and Keboola Storage (thus saving credits).
After you have incrementally extracted data from an API, the data must be
incrementally loaded
into Storage. To do that, simply set "incrementalOutput": true
in the config
section.
There are, however, a number of implications in the incremental loads. It essentially boils downs to the following use cases, depending on what kind of data you are importing (extracting from an API):
incrementalOutput
is turned on, the data will be
simply appended to the target table in Storage. Turning incrementalOutput
to false probably makes no sense
because the table will contain only the new entries.incrementalOutput
is turned on, set a primary key on the table so that new rows are added and existing rows are updated.
If the primary key is not set, the modified entries will be duplicated in the target table. Turning
incrementalOutput
to false probably makes no sense because the table will contain only the new entries.incrementalOutput
to false. Turning incrementalOutput
to true probably makes no sense because the table will
contain duplicate entries. If you set the primary key, new rows will be added and modified rows will be updated.
Note that in this case more credits are consumed.In neither of these situations will the missing rows get deleted. If you want to do so, the only way is
to turn incrementalOutput
to false and do full loads.
Using incremental loads obviously requires some support from the API. Generic Extractor supports incremental
loads by using previousStart
and the
time
function. Setting the primary key is done using
mappings.
Assume you have an API supporting a parameter modified_since
which expects a
Unix Timestamp. The response then contains only the
records that were modified after the specified date. The following configuration can be used:
The configuration adds the modified_since
parameter as a reference to the internal
time.previousStart
value, which contains the timestamp of the last
successful start of the extraction of the particular configuration. The request generated by this configuration is something like:
GET /users?modified_since=1492606006
where 1492606006
is the variable timestamp of the last successful start. This introduces state into the
Generic Extractor configuration as it now remembers when it last successfully ran. This means
that if you run the above configuration every five minutes, it will extract the data modified within the last five minutes.
If you run it every hour, it will extract the data modified within the last hour.
Should one of the runs fail or be skipped for any reason, the extraction will pick up where it ended the last time it was successful. See example [EX107].
The last successful time is stored in the configuration state. If for some reason you need to reset it, update the configuration via API.
If an API similar to the one in the above example requires the date to be
sent as a string, the following jobs configuration (which uses the date
function)
can be used:
This sends a request like:
GET /users?modified_since=2017-04-19%2012%3A46%3A46
in a more readable url-decoded form:
GET /users?modified_since=2017-04-19 12:46:46
Otherwise the configuration behaves the same way as the previous example.
See example [EX108].
Another option is an API which requires the from
and to
parameters. The following
configuration generates the from
date as the date of the last extraction (using the time.previousStart
value). It also generates the to
date as the date
of the current extraction (using the time.currentStart
value):
This configuration will send a request similar to this one:
GET /109-incremental-load-from-to/users?from=2017-04-19&to=2017-04-24
See example [EX109].
Suppose you have an API supporting the from
and to
parameters as in the above example and
want to extract the last day data. It can be done using the following configuration:
This configuration leads to a request similar to this one:
GET /110-incremental-relative/users?from=2017-04-23&to=2017-04-24
Remember, this is not a truly reliable incremental load. If you put such configuration into an orchestration, and the configuration does not run for some reason, you may miss some data. However, this may still be a useful approach for obtaining samples of data for POCs.
See example [EX110].