Incremental Loading

Extracting data incrementally is universally beneficial — it speeds up the extraction and lowers the load on both the API and Keboola Storage (thus saving credits).

Options

After you have incrementally extracted data from an API, the data must be incrementally loaded into Storage. To do that, simply set "incrementalOutput": true in the config section.

There are, however, a number of implications in the incremental loads. It essentially boils downs to the following use cases, depending on what kind of data you are importing (extracting from an API):

  • The imported data contains only added entries. When incrementalOutput is turned on, the data will be simply appended to the target table in Storage. Turning incrementalOutput to false probably makes no sense because the table will contain only the new entries.
  • The imported data contains added and modified entries. When incrementalOutput is turned on, set a primary key on the table so that new rows are added and existing rows are updated. If the primary key is not set, the modified entries will be duplicated in the target table. Turning incrementalOutput to false probably makes no sense because the table will contain only the new entries.
  • The imported data contains all rows. In this case, set a primary key for the table or turn incrementalOutput to false. Turning incrementalOutput to true probably makes no sense because the table will contain duplicate entries. If you set the primary key, new rows will be added and modified rows will be updated. Note that in this case more credits are consumed.

In neither of these situations will the missing rows get deleted. If you want to do so, the only way is to turn incrementalOutput to false and do full loads.

Using incremental loads obviously requires some support from the API. Generic Extractor supports incremental loads by using previousStart and the time function. Setting the primary key is done using mappings.

Examples

Previous Start Example

Assume you have an API supporting a parameter modified_since which expects a Unix Timestamp. The response then contains only the records that were modified after the specified date. The following configuration can be used:

{
    "config": {
        "incrementalOutput": true,
        "outputBucket": "mock-server",
        "jobs": [
            {
                "endpoint": "users",
                "dataType": "users",
                "params": {
                    "modified_since": {
                        "time": "previousStart"
                    }
                }
            }
        ]
    }
}

The configuration adds the modified_since parameter as a reference to the internal time.previousStart value, which contains the timestamp of the last successful start of the extraction of the particular configuration. The request generated by this configuration is something like:

GET /users?modified_since=1492606006

where 1492606006 is the variable timestamp of the last successful start. This introduces state into the Generic Extractor configuration as it now remembers when it last successfully ran. This means that if you run the above configuration every five minutes, it will extract the data modified within the last five minutes. If you run it every hour, it will extract the data modified within the last hour.

Should one of the runs fail or be skipped for any reason, the extraction will pick up where it ended the last time it was successful. See example [EX107].

The last successful time is stored in the configuration state. If for some reason you need to reset it, update the configuration via API.

Previous Start Date

If an API similar to the one in the above example requires the date to be sent as a string, the following jobs configuration (which uses the date function) can be used:

{
    "jobs": [
        {
            "endpoint": "users",
            "dataType": "users",
            "params": {
                "modified_since": {
                    "function": "date",
                    "args": [
                        "Y-m-d H:i:s",
                        {
                            "time": "previousStart"
                        }
                    ]
                }
            }
        }
    ]
}

This sends a request like:

GET /users?modified_since=2017-04-19%2012%3A46%3A46

in a more readable url-decoded form:

GET /users?modified_since=2017-04-19 12:46:46

Otherwise the configuration behaves the same way as the previous example.

See example [EX108].

Incremental Load From To

Another option is an API which requires the from and to parameters. The following configuration generates the from date as the date of the last extraction (using the time.previousStart value). It also generates the to date as the date of the current extraction (using the time.currentStart value):

{
    "jobs": [
        {
            "endpoint": "users",
            "dataType": "users",
            "params": {
                "from": {
                    "function": "date",
                    "args": [
                        "Y-m-d",
                        {
                            "time": "previousStart"
                        }
                    ]
                },
                "to": {
                    "function": "date",
                    "args": [
                        "Y-m-d",
                        {
                            "time": "currentStart"
                        }
                    ]
                }
            }
        }
    ]
}

This configuration will send a request similar to this one:

GET /109-incremental-load-from-to/users?from=2017-04-19&to=2017-04-24

See example [EX109].

Incremental Relative Load

Suppose you have an API supporting the from and to parameters as in the above example and want to extract the last day data. It can be done using the following configuration:

{
    "jobs": [
        {
            "endpoint": "users",
            "dataType": "users",
            "params": {
                "from": {
                    "function": "date",
                    "args": [
                        "Y-m-d",
                        {
                            "function": "strtotime",
                            "args": [
                                "-1 day",
                                {
                                    "time": "currentStart"
                                }
                            ]
                        }
                    ]
                },
                "to": {
                    "function": "date",
                    "args": [
                        "Y-m-d",
                        {
                            "time": "currentStart"
                        }
                    ]
                }
            }
        }
    ]
}

This configuration leads to a request similar to this one:

GET /110-incremental-relative/users?from=2017-04-23&to=2017-04-24

Remember, this is not a truly reliable incremental load. If you put such configuration into an orchestration, and the configuration does not run for some reason, you may miss some data. However, this may still be a useful approach for obtaining samples of data for POCs.

See example [EX110].