Generic Extractor Configuration

To configure your first Generic Extractor, follow our tutorial.

To get an overall idea of what to expect when configuring Generic Extractor, take a look at the following overview of various configuration sections.

Then go through a sample configuration featuring all configuration options and their nesting. The configuration map is also available as a separate article.

Configuration Sections

Click on the section names if you want to learn more.

  • parameters
    • api — sets the basic properties of the API.
      • baseUrl — defines the URL to which the API requests should be sent.
      • caCertificate — defines custom certificate authority bundle in crt/pem format.
      • #clientCertificate — defines client certificate and private key in crt/pem format.
      • pagination — breaks a result with a large number of items into separate pages.
      • authentication — needs to be configured for any API which is not public.
      • retryConfig — automatically, and repeatedly, retries failed HTTP requests.
      • http — sets the timeouts, default headers and parameters sent with each API call.
    • aws
      • signature — defines AWS credentials for signature request
    • config — describes the actual extraction.
      • debug — shows all HTTP requests sent by Generic Extractor.
      • outputBucket — defines the name of a Storage Bucket in which the extracted tables will be stored.
      • http — sets the HTTP headers sent with every request.
      • jobs — describes the API endpoints (resources) to be extracted.
      • mappings — describes how the JSON response is converted into CSV files that will be imported into Storage.
      • incrementalOutput — loads the extracted data into Storage incrementally.
      • userData — adds arbitrary data to extracted records.
      • sshProxy — securely access HTTP(s) endpoints inside your private Network.
      • iterations — executes a configuration multiple times, each time with different values.
  • authorization — allows injecting OAuth authentication.

There are also simple pre-defined functions available, adding extra flexibility when needed.

Generic Extractor can be run from within the Keboola user interface (only configuration JSON needed), or locally (Docker needed).

Configuration Map

The following sample configuration shows various configuration options and their nesting. You can use the map to navigate between them. The parameter map is also available separately and we recommend pinning it to your toolbar for quick reference.

{
    "parameters": {
        "api": {
            "baseUrl": "https://example.com/v3.0/",
            "caCertificate": "-----BEGIN CERTIFICATE-----\nMIIFaz....",
            "pagination": {
                "method": "multiple",
                "scrollers": {
                    "offset_scroll": {
                        "method": "offset",
                        "offsetParam": "offset",
                        "limitParam": "count"
                    }
                }
            },
            "authentication": {
                "type": "basic"
            },
            "retryConfig": {
                "maxRetries": 3
            },
            "http": {
                "headers": {
                    "Accept": "application/json"
                },
                "defaultOptions": {
                    "params": {
                        "company": 123
                    }
                },
                "requiredHeaders": ["X-AppKey"],
                "ignoreErrors": [405],
                "connectTimeout": 30,
                "requestTimeout": 300
            }
        },
        "aws": {
            "signature": {
                "credentials": {
                    "accessKeyId": "testAccessKey",
                    "#secretKey": "testSecretKey",
                    "serviceName": "testService",
                    "regionName": "testRegion"
                }
            }
        },
        "config": {
            "debug": true,
            "username": "dummy",
            "#password": "secret",
            "outputBucket": "ge-tutorial",
            "incrementalOutput": true,
            "compatLevel": 2,
            "http": {
                "headers": {
                    "X-AppKey": "ThisIsSecret"
                }
            },
            "jobs": [
                {
                    "endpoint": "users",
                    "method": "get",
                    "dataField": "items",
                    "dataType": "users",
                    "params": {
                        "type": {
                            "attr": "userType"
                        }
                    },
                    "responseFilter": "additional.address/details",
                    "responseFilterDelimiter": "/",
                    "scroller": "offset_scroll",
                    "children": [
                        {
                            "endpoint": "users/{user_id}/orders",
                            "dataField": "items",
                            "recursionFilter": "id>20",
                            "placeholders": {
                                "user_id": "id"
                            }
                        }
                    ]
                }
            ],
            "mappings": {
                "content": {
                    "parent_id": {
                        "type": "user",
                        "mapping": {
                            "destination": "campaign_id",
                            "primaryKey": true
                        }
                    },
                    "name": {
                        "type": "column",
                        "mapping": {
                            "destination": "text"
                        }
                    },
                    "address": {
                        "type": "table",
                        "destination": "addresses",
                        "tableMapping": {
                            "street": {
                                "type": "column",
                                "mapping": {
                                    "destination": "streetName"
                                }
                            }
                        }
                    },
                    "created.date": {
                        "delimiter": "/",
                        "type": "column",
                        "mapping": {
                            "destination": "createdDate"
                        }
                    }
                }
            },
            "userData": {
                "tag": "development"
            }
        },
        "iterations": [
            {
                "userType": "active"
            },
            {
                "userType": "inactive"
            }
        ],
        "sshProxy": {
            "host": "proxy.example.com",
            "user": "proxy",
            "port": 22,
            "#privateKey": "-----BEGIN RSA PRIVATE KEY-----\n...\n-----END RSA PRIVATE KEY-----"
        }
    },
    "authorization": {
        "oauth_api": {
            "credentials": {
                "#data": "{\"status\": \"ok\",\"refresh_token\": \"1234abcd5678efgh\"}",
                "appKey": "someId",
                "#appSecret": "clientSecret"
            }
        }
    }
}