Before configuring Generic Extractor, you should have a basic understanding of REST API and JSON format. This tutorial uses the MailChimp API, so have its documentation at hand. You also need the MailChimp API key.
Generic Extractor configuration is written in JSON format and comprises several sections (a configuration map for navigation is available).
A user interface is available that can help you with the configuration and generate the JSON configuration for you.
The first configuration part is a Base Configuration
section where you can set the Base URL and Authentication method of the
API you connect to.
In our case, we will use the MailChimp API, so the Base URL
will be https://us13.api.mailchimp.com/3.0/
, and the Authentication
method will be Basic Authentication
.
Important: Make sure that the baseUrl
URL ends with a slash!
In the Destination
section, you can set:
Output Bucket
where the data will be stored. It will be set to the ID of the Storage BucketIncremental Output
option, which defines whether you want the result to overwrite the existing data or append to it. See more
If you switch to the JSON
mode, the created configuration will translate to the api
section where you set the basic properties of the API.
In the most simple case, this is the baseUrl
property and authentication
, as shown in this JSON snippet:
Important: Make sure that the baseUrl
URL ends with a slash!
The config
section describes the actual extraction. Its most important parts are the outputBucket
and
jobs
properties. outputBucket
must be set to the ID of the Storage Bucket
where the data will be stored. If no bucket exists, it will be created.
It also contains the authentication parameters, such as username
and password
. Start with this
configuration section:
The password
property is prefixed with the hash mark #
, meaning the value will be encrypted once
you save the configuration.
Once you set up the Base Configuration, you can set up the actual endpoint to be queried.
Start by clicking the + New Endpoint button:
You will be asked to provide the relative endpoint URL path. In our case, we will use the campaigns
endpoint.
Base URL
you set up in the Base Configuration
section.
https://us13.api.mailchimp.com/campaigns
, which is invalid
(it is missing the 3.0
part). An alternative would be to put /3.0/campaigns
in the endpoint
property.Now you are getting close to a runnable configuration, and you may proceed with testing the configuration by clicking the TEST ENDPOINT
button:
In the test endpoint popup, you will see the following sections:
Records
: The actual data that will be used for parsing.Response
: The response from the API. It includes headers, status code, and response body in the data
property.Request
: The request that has been sent to the API.Debug log
: A log outputted by the component for debugging purposes.In the Records
section, you will now see the following:
[
"The root element of the response is not a list; please change your Data Selector path to list"
]
Also, if you try to run this configuration, you will get an error similar to this:
The response contains more than one array! Use the 'dataField' parameter to specify a key to the data array.
(endpoint: campaigns, arrays in the response root: campaigns, _links)
That means that the extractor got the response but cannot automatically process it. The Data Selector
path doesn’t point to an array.
Examine the data
attribute of the response, and you will see the following objects: campaigns
, total_items
, and _links
:
Generic Extractor expects the response to be an array of items. If it receives an object, it
searches its properties to find an array. Finding multiple arrays will be confusing because it is unclear which array you want.
To fix this, change the Data Selector
parameter (aka dataField
) to value campaigns
to point to the array of items you want to extract.
Now, run the configuration by clicking the Run button and go to the job details to see what happened:
The extraction produced two tables. The in.c-ge-tutorial.campaigns
table contains all the
fields of a campaign and as many rows as you have campaigns.
The table in.c-ge-tutorial.campaigns__links
contains the contents of the _links
property.
Because the _links
property is a nested array within a single campaign object, it cannot be easily
represented in a single column of the campaigns
table. Generic Extractor, therefore, replaces the column
value with a generated key, for example, campaigns_75d5b14d79d034cd07a9d95d5f0ca5bd
, and automatically
creates a new table that has the column JSON_parentId
with that value so that you can join the tables together.
The main parts of the configuration and their nesting are shown in the following schema:
The resulting JSON configuration will look like this:
Important: It may seem confusing that the endpoint
and dataField
properties are set to campaigns
.
This is just a coincidence; the endpoint
property refers to the campaigns
in the resource URL, and
the dataField
refers to the campaigns
property in the JSON retrieved as the API response.
The above tutorial demonstrates a very basic configuration of Generic Extractor. The extractor is capable of doing much more; see other parts of this tutorial for an explanation of pagination, jobs and mapping: