Before you start configuring Generic Extractor, you should have a basic understanding of REST API and JSON format. This tutorial uses the MailChimp API, so have its documentation at hand. You also need the MailChimp API key.
The main parts of the configuration and their nesting are shown in the following schema:
The first configuration part is the
api section where you set the basic properties of the API.
In the most simple case, this is the
baseUrl property and
authentication, as shown in this JSON snippet:
Important: Make sure that the
baseUrl URL ends with a slash!
config section describes the actual extraction. Its most important parts are the
outputBucket must be set to the id of the Storage Bucket
where the data will be stored. If no bucket exists, it will be created.
It also contains the authentication parameters
password. Start with this
password property is prefixed with the hash mark
#, which means that the
value will be encrypted once
you save the configuration.
jobs section is the most complex part of the whole configuration. The first part
jobs configuration is the
Important: Make sure not to start the URL with a slash. If you do so, the URL
will be absolute from the domain:
https://us13.api.mailchimp.com/campaigns, which is not valid (it is
3.0 part). An alternative would be to put
/3.0/campaigns in the
Now you are getting close to a runnable configuration:
If you run this configuration, you will get an error similar to this:
More than one array found in the response! Use the 'dataField' parameter to specify a key to the data array. (endpoint: campaigns, arrays in the response root: campaigns, _links)
This means that the extractor got the response, but cannot automatically process it. Examine the sample
response in the documentation,
and you will see that it is an object with three items:
Generic Extractor expects the response to be an array of items. If it receives an object, it
searches through its properties to find an array. If it finds multiple arrays, it becomes confused
because it is unclear which array you want. To fix this, add the
as the error message suggests:
Important: It may seem confusing that both the
dataField properties are set to
This is just a coincidence; the
endpoint property refers to the
campaigns in the resource URL.
dataField refers to the
campaigns property in the JSON retrieved as the API response.
Now run the above configuration by simply pasting it into the Generic Extractor configuration field:
Notice that when you save the configuration, the
#password property gets
Hit the Run button and go to the job details to see what happened:
The extraction produced two tables. The
in.c-ge-tutorial.campaigns table contains all the
fields of a campaign, and as many rows as you have campaigns.
in.c-ge-tutorial.campaigns__links contains the contents of the
_links property is a nested array within a single campaign object, it cannot be easily
represented in a single column of the
campaigns table. Generic Extractor therefore replaces the column
value with a generated key, for example,
campaigns_75d5b14d79d034cd07a9d95d5f0ca5bd, and automatically
creates a new table which has the column
JSON_parentId with that value so that you can join the tables together.
The above tutorial demonstrates a very basic configuration of Generic Extractor. The extractor is capable of doing much more; see other parts of this tutorial for an explanation of pagination, jobs and mapping: