Mapping Tutorial

In the previous part of the tutorial, you extracted the content of a MailChimp campaign. Now it’s time to clean up the response.

This is the initial configuration:

{
    "parameters": {
        "api": {
            "baseUrl": "https://us13.api.mailchimp.com/3.0/",
            "authentication": {
                "type": "basic"
            },
            "pagination": {
                "method": "offset",
                "offsetParam": "offset",
                "limitParam": "count",
                "limit": 1
            }
        },
        "config": {
            "debug": true,
            "username": "dummy",
            "#password": "c40xxxxxxxxxxxxxxxxxxxxxxxxxxxxx-us13",
            "outputBucket": "ge-tutorial",
            "jobs": [
                {
                    "endpoint": "campaigns",
                    "dataField": "campaigns",
                    "children": [
                        {
                            "endpoint": "campaigns/{campaign_id}/send-checklist",
                            "dataField": "items",
                            "placeholders": {
                                "campaign_id": "id"
                            }
                        },
                        {
                            "endpoint": "campaigns/{campaign_id}/content",
                            "dataField": ".",
                            "placeholders": {
                                "campaign_id": "id"
                            }
                        }
                    ]
                }
            ]
        }
    }
}

It extracts MailChimp campaigns together with the send-checklist items and campaign content. However, there are some parts of the content resource you are probably not interested in. Also,the table contains duplicates.

Technical note on duplicates: If you examine the job events, you will see the request GET /3.0/campaigns/f7ed43aaea/content?count=1&offset=0 sent. That is to say, the pagination applies to all API requests. Generic Extractor tries to page the unpaged /content resource. This may ultimately lead to duplicates because the extraction of that resource is terminated only after the resource returns the same response twice.

Mapping

A mapping defines the shape of Generic Extractor outputs. It is stored in the config.mappings property and is identified by the resource data type. When a resource is assigned an internal dataType, a mapping can be created for it. To be able to use a mapping, first define a dataType in the job property. For example:

{
    "endpoint": "campaigns/{campaign_id}/content",
    "dataField": ".",
    "dataType": "content",
    "placeholders": {
        "campaign_id": "id"
    }
}

The value of the dataType property is an arbitrary name. Apart from identifying the resource type, it is also used as the output table name. If you run the job, the content will be stored in in.c-ge-tutorial.content.

Each mapping item is identified by the property name of the resource and must contain mapping.destination with the target column name in the output table. For example:

"mappings": {
    "content": {
        "plain_text": {
            "mapping": {
                "destination": "text"
            }
        }

The above mapping setting defines that for the content data type, the resource property plain_text will be stored in the table column text. No other properties of the content resource will be imported. In other words, the mapping defines all columns of the output table.

To give an example, if you are interested in having the plain_text and html versions of the campaign content, use a mapping like this:

"mappings": {
    "content": {
        "plain_text": {
            "mapping": {
                "destination": "text"
            }
        },
        "html": {
            "mapping": {
                "destination": "html"
            }
        }
    }
}

Note that the destination value is arbitrary, but it must be a valid column name. The data type name (content) must match the value of the dataType property as defined in some of the jobs.

Parent Reference

The above mapping works, but is missing the campaign id and you would not be able to match the content to some campaign records. Therefore you need to extract the campaign id from the context (i.e. from the job parameter). This can be done using a special user mapping.

When the mapping type is set to user, use the special prefix parent_ to refer to a placeholder defined in the job. You can create the following mapping:

"mappings": {
    "content": {
        "parent_id": {
            "type": "user",
            "mapping": {
                "destination": "campaign_id"
            }
        }
    }
}

The above configuration defines a mapping for the content data type. In the result table named content, the column campaign_id will be created. Its content will be the value of the id placeholder (parent_id minus the parent_ prefix) in the respective job.

Apart from specifying what columns should be present in the output table, the mapping allows you to set a column to be part of a primary key. The entire configuration would then look like this:

{
    "parameters": {
        "api": {
            "baseUrl": "https://us13.api.mailchimp.com/3.0/",
            "authentication": {
                "type": "basic"
            },
            "pagination": {
                "method": "offset",
                "offsetParam": "offset",
                "limitParam": "count",
                "limit": 10
            }
        },
        "config": {
            "debug": true,
            "username": "dummy",
            "#password": "c40xxxxxxxxxxxxxxxxxxxxxxxxxxxxx-us13",
            "outputBucket": "ge-tutorial",
            "jobs": [
                {
                    "endpoint": "campaigns",
                    "dataField": "campaigns",
                    "children": [
                        {
                            "endpoint": "campaigns/{campaign_id}/send-checklist",
                            "dataField": "items",
                            "placeholders": {
                                "campaign_id": "id"
                            }
                        },
                        {
                            "endpoint": "campaigns/{campaign_id}/content",
                            "dataField": ".",
                            "dataType": "content",
                            "placeholders": {
                                "campaign_id": "id"
                            }
                        }
                    ]
                }
            ],
            "mappings": {
                "content": {
                    "parent_id": {
                        "type": "user",
                        "mapping": {
                            "destination": "campaign_id",
                            "primaryKey": true
                        }
                    },
                    "plain_text": {
                        "mapping": {
                            "destination": "text"
                        }
                    },
                    "html": {
                        "mapping": {
                            "destination": "html"
                        }
                    }
                }
            }
        }
    }
}

Review

Now, let’s review what parts are connected and how. Note that the values in blue have been chosen arbitrarily when the configuration was created:

Configuration Schema

Summary

Mapping lets you define precisely what the extraction output will look like; it also defines primary keys.

If you are doing a one time ad-hoc extraction, you may skip setting up the mapping and clean the extracted data later in Transformations. However, if you intend to use your configuration regularly, or want to make it into its own component, setting up a mapping is recommended.