Basic Configuration

Before you start configuring Generic Extractor, you should have a basic understanding of REST API and JSON format. This tutorial uses the MailChimp API, so have its documentation at hand. You also need the MailChimp API key.

Configuration

Generic Extractor configuration is written in JSON format and is composed of several sections (a configuration map for navigation is available).

The main parts of the configuration and their nesting are shown in the following schema:

Schema - Generic Extractor configuration

API Section

The first configuration part is the api section where you set the basic properties of the API. In the most simple case, this is the baseUrl property and authentication, as shown in this JSON snippet:

{
    "api": {
        "baseUrl": "https://us13.api.mailchimp.com/3.0/",
        "authentication": {
            "type": "basic"
        }
    }
}

Important: Make sure that the baseUrl URL ends with a slash!

Configuration Section

The config section describes the actual extraction. Its most important parts are the outputBucket and jobs properties. outputBucket must be set to the id of the Storage Bucket where the data will be stored. If no bucket exists, it will be created.

It also contains the authentication parameters username and password. Start with this configuration section:

"config": {
    "username": "dummy",
    "#password": "c40xxxxxxxxxxxxxxxxxxxxxxxxxxxxx-us13",
    "outputBucket": "ge-tutorial",
    "jobs": []
}

The password property is prefixed with the hash mark #, which means that the value will be encrypted once you save the configuration.

Jobs Section

The jobs section is the most complex part of the whole configuration. The first part of the jobs configuration is the endpoint:

"jobs": [
    {
        "endpoint": "campaigns"
    }
]

Important: Make sure not to start the URL with a slash. If you do so, the URL will be absolute from the domain: https://us13.api.mailchimp.com/campaigns, which is not valid (it is missing the 3.0 part). An alternative would be to put /3.0/campaigns in the endpoint property.

Now you are getting close to a runnable configuration:

{
    "parameters": {
        "api": {
            "baseUrl": "https://us13.api.mailchimp.com/3.0/",
            "authentication": {
                "type": "basic"
            }
        },
        "config": {
            "username": "dummy",
            "#password": "c40xxxxxxxxxxxxxxxxxxxxxxxxxxxxx-us13",
            "outputBucket": "ge-tutorial",
            "jobs": [
                {
                    "endpoint": "campaigns"
                }
            ]
        }
    }
}

If you run this configuration, you will get an error similar to this:

More than one array found in the response! Use the 'dataField' parameter to specify a key to the data array.
(endpoint: campaigns, arrays in the response root: campaigns, _links)

This means that the extractor got the response, but cannot automatically process it. Examine the sample response in the documentation, and you will see that it is an object with three items: campaigns, total_items and _links:

{
  "campaigns": [
    {
      "id": "42694e9e57",
      "type": "regular",
      ...
    },
    {
      "id": "f6276207cc",
      "type": "regular",
      ...
    }
  ],
  "total_items": 2,
  "_links": [
    {
      "rel": "parent",
      "href": "https://usX.api.mailchimp.com/3.0/",
      "method": "GET",
      "targetSchema": "https://api.mailchimp.com/schema/3.0/Root.json"
    },
    {
      "rel": "self",
      "href": "https://usX.api.mailchimp.com/3.0/campaigns",
      "method": "GET",
      "targetSchema": "https://api.mailchimp.com/schema/3.0/Campaigns/Collection.json",
      "schema": "https://api.mailchimp.com/schema/3.0/CollectionLinks/Campaigns.json"
    }
  ]
}

Generic Extractor expects the response to be an array of items. If it receives an object, it searches through its properties to find an array. If it finds multiple arrays, it becomes confused because it is unclear which array you want. To fix this, add the dataField parameter as the error message suggests:

{
    "parameters": {
        "api": {
            "baseUrl": "https://us13.api.mailchimp.com/3.0/",
            "authentication": {
                "type": "basic"
            }
        },
        "config": {
            "username": "dummy",
            "#password": "c40xxxxxxxxxxxxxxxxxxxxxxxxxxxxx-us13",
            "outputBucket": "ge-tutorial",
            "jobs": [
                {
                    "endpoint": "campaigns",
                    "dataField": "campaigns"
                }
            ]
        }
    }
}

Important: It may seem confusing that both the endpoint and dataField properties are set to campaigns. This is just a coincidence; the endpoint property refers to the campaigns in the resource URL. The dataField refers to the campaigns property in the JSON retrieved as the API response.

Now run the above configuration by simply pasting it into the Generic Extractor configuration field:

Screenshot - Generic Extractor configuration

Notice that when you save the configuration, the #password property gets encrypted. Hit the Run button and go to the job details to see what happened:

Screenshot - Generic Extractor job

The extraction produced two tables. The in.c-ge-tutorial.campaigns table contains all the fields of a campaign, and as many rows as you have campaigns.

Screenshot - Campaigns Table

The table in.c-ge-tutorial.campaigns__links contains the contents of the _links property. Because the _links property is a nested array within a single campaign object, it cannot be easily represented in a single column of the campaigns table. Generic Extractor therefore replaces the column value with a generated key, for example, campaigns_75d5b14d79d034cd07a9d95d5f0ca5bd, and automatically creates a new table which has the column JSON_parentId with that value so that you can join the tables together.

Summary

The above tutorial demonstrates a very basic configuration of Generic Extractor. The extractor is capable of doing much more; see other parts of this tutorial for an explanation of pagination, jobs and mapping:

  • Pagination — breaks a result with a large number of items into separate pages.
  • Jobs — describe the API endpoints (resources) to be extracted.
  • Mapping — describes how the JSON response is converted into CSV files that will be imported into Storage.