After you have incrementally extracted data from an API, the data must be
into Storage. To do that, simply set
"incrementalOutput": true in the
There are, however, a number of implications in the incremental loads. It essentially boils downs to the following use cases, depending on what kind of data you are importing (extracting from an API):
incrementalOutput is turned on, the data will be
simply appended to the target table in Storage. Turning
incrementalOutput to false probably makes no sense
because the table will contain only the new entries.
incrementalOutput is turned on, set a primary key on the table so that new rows are added and existing rows are updated.
If the primary key is not set, the modified entries will be duplicated in the target table. Turning
incrementalOutput to false probably makes no sense because the table will contain only the new entries.
incrementalOutput to false. Turning
incrementalOutput to true probably makes no sense because the table will
contain duplicate entries. If you set the primary key, new rows will be added and modified rows will be updated.
Note that in this case more credits are consumed.
In neither of these situations will the missing rows get deleted. If you want to do so, the only way is
incrementalOutput to false and do full loads.
Using incremental loads obviously requires some support from the API. Generic Extractor supports incremental
loads by using
previousStart and the
time function. Setting the primary key is done using
Assume you have an API supporting a parameter
modified_since which expects a
Unix Timestamp. The response then contains only the
records that were modified after the specified date. The following configuration can be used:
The configuration adds the
modified_since parameter as a reference to the internal
time.previousStart value, which contains the timestamp of the last
successful start of the extraction of the particular configuration. The request generated by this configuration is something like:
1492606006 is the variable timestamp of the last successful start. This introduces state into the
Generic Extractor configuration as it now remembers when it last successfully ran. This means
that if you run the above configuration every five minutes, it will extract the data modified within the last five minutes.
If you run it every hour, it will extract the data modified within the last hour.
Should one of the runs fail or be skipped for any reason, the extraction will pick up where it ended the last time it was successful. See example [EX107].
This sends a request like:
in a more readable url-decoded form:
GET /users?modified_since=2017-04-19 12:46:46
Otherwise the configuration behaves the same way as the previous example.
See example [EX108].
Another option is an API which requires the
to parameters. The following
configuration generates the
from date as the date of the last extraction (using the
value). It also generates the
to date as the date
of the current extraction (using the
This configuration will send a request similar to this one:
See example [EX109].
Suppose you have an API supporting the
to parameters as in the above example and
want to extract the last day data. It can be done using the following configuration:
This configuration leads to a request similar to this one:
Remember, this is not a truly reliable incremental load. If you put such configuration into an orchestration, and the configuration does not run for some reason, you may miss some data. However, this may still be a useful approach for obtaining samples of data for POCs.
See example [EX110].