Generic Extractor Configuration

To configure your first Generic Extractor, follow our tutorial.

To get an overall idea of what to expect when configuring Generic Extractor, look at the following overview of various configuration sections.

Then review a sample configuration featuring all configuration options and their nesting. The configuration map is also available as a separate article.

User Interface

Public Beta Warning:
This feature is currently in public beta. Please provide feedback using the feedback button in your project.

Recently, we created a convenient user interface that allows you to build a configuration for the Generic Extractor without writing JSON code. You can set up and test the connection in a few clicks, just like you are used to in some other popular API development tools.

Features such as cURL import, request tests, output mapping generator, or dynamic function templates and evaluation make the configuration process as easy as ever.

You can switch between the JSON representation and the user interface in the upper right corner of the configuration editor.

UI Switch

Backward compatibility

The new user interface is mostly backward compatible with the old JSON configuration. However, some features are not yet supported in the new UI. In such cases, you will be notified in the UI what sections are not supported.

NOTE: The new UI does not affect the functionality of old configurations. All configurations will continue to work. However, in some cases, you might need to perform some manual adjustments in order to make the UI compatible.

JSON Configuration Sections

Click on the section names if you want to learn more.

  • parameters
    • api — sets the basic properties of the API.
      • baseUrl — defines the URL to which the API requests should be sent.
      • caCertificate — defines custom certificate authority bundle in crt/pem format.
      • #clientCertificate — defines client certificate and private key in crt/pem format.
      • pagination — breaks a result with many items into separate pages.
      • authentication — needs to be configured for any API which is not public.
      • retryConfig — automatically and repeatedly, retries failed HTTP requests.
      • http — sets the timeouts, default headers, and parameters sent with each API call.
    • aws
      • signature — defines AWS credentials for signature request
    • config — describes the actual extraction.
      • debug — shows all HTTP requests sent by Generic Extractor.
      • outputBucket — defines the name of a Storage Bucket in which the extracted tables will be stored.
      • http — sets the HTTP headers sent with every request.
      • jobs — describes the API endpoints (resources) to be extracted.
      • mappings — describes how the JSON response is converted into CSV files that will be imported into Storage.
      • incrementalOutput — loads the extracted data into Storage incrementally.
      • userData — adds arbitrary data to extracted records.
      • sshProxy — securely access HTTP(s) endpoints inside your private Network.
      • iterations — executes a configuration multiple times, each time with different values.
  • authorization — allows injecting OAuth authentication.

There are also simple pre-defined functions available, adding extra flexibility when needed.

Generic Extractor can be run from within the Keboola user interface (only configuration JSON is needed), or locally (Docker is needed).

Configuration Map

The following sample configuration shows various configuration options and their nesting. You can use the map to navigate between them. The parameter map is also available separately, and we recommend pinning it to your toolbar for quick reference.

{
    "parameters": {
        "api": {
            "baseUrl": "https://example.com/v3.0/",
            "caCertificate": "-----BEGIN CERTIFICATE-----\nMIIFaz....",
            "pagination": {
                "method": "multiple",
                "scrollers": {
                    "offset_scroll": {
                        "method": "offset",
                        "offsetParam": "offset",
                        "limitParam": "count"
                    }
                }
            },
            "authentication": {
                "type": "basic"
            },
            "retryConfig": {
                "maxRetries": 3
            },
            "http": {
                "headers": {
                    "Accept": "application/json"
                },
                "defaultOptions": {
                    "params": {
                        "company": 123
                    }
                },
                "requiredHeaders": ["X-AppKey"],
                "ignoreErrors": [405],
                "connectTimeout": 30,
                "requestTimeout": 300
            }
        },
        "aws": {
            "signature": {
                "credentials": {
                    "accessKeyId": "testAccessKey",
                    "#secretKey": "testSecretKey",
                    "serviceName": "testService",
                    "regionName": "testRegion"
                }
            }
        },
        "config": {
            "debug": true,
            "username": "dummy",
            "#password": "secret",
            "outputBucket": "ge-tutorial",
            "incrementalOutput": true,
            "compatLevel": 2,
            "http": {
                "headers": {
                    "X-AppKey": "ThisIsSecret"
                }
            },
            "jobs": [
                {
                    "endpoint": "users",
                    "method": "get",
                    "dataField": "items",
                    "dataType": "users",
                    "params": {
                        "type": {
                            "attr": "userType"
                        }
                    },
                    "responseFilter": "additional.address/details",
                    "responseFilterDelimiter": "/",
                    "scroller": "offset_scroll",
                    "children": [
                        {
                            "endpoint": "users/{user_id}/orders",
                            "dataField": "items",
                            "recursionFilter": "id>20",
                            "placeholders": {
                                "user_id": "id"
                            }
                        }
                    ]
                }
            ],
            "mappings": {
                "content": {
                    "parent_id": {
                        "type": "user",
                        "mapping": {
                            "destination": "campaign_id",
                            "primaryKey": true
                        }
                    },
                    "name": {
                        "type": "column",
                        "mapping": {
                            "destination": "text"
                        }
                    },
                    "address": {
                        "type": "table",
                        "destination": "addresses",
                        "tableMapping": {
                            "street": {
                                "type": "column",
                                "mapping": {
                                    "destination": "streetName"
                                }
                            }
                        }
                    },
                    "created.date": {
                        "delimiter": "/",
                        "type": "column",
                        "mapping": {
                            "destination": "createdDate"
                        }
                    }
                }
            },
            "userData": {
                "tag": "development"
            }
        },
        "iterations": [
            {
                "userType": "active"
            },
            {
                "userType": "inactive"
            }
        ],
        "sshProxy": {
            "host": "proxy.example.com",
            "user": "proxy",
            "port": 22,
            "#privateKey": "-----BEGIN RSA PRIVATE KEY-----\n...\n-----END RSA PRIVATE KEY-----"
        }
    },
    "authorization": {
        "oauth_api": {
            "credentials": {
                "#data": "{\"status\": \"ok\",\"refresh_token\": \"1234abcd5678efgh\"}",
                "appKey": "someId",
                "#appSecret": "clientSecret"
            }
        }
    }
}