If new to Generic Extractor, learn about jobs in our tutorial first. Use Parameter Map to help you navigate among various configuration options.
The jobs section of the extraction configuration contains descriptions of the API resources to be
extracted. The jobs
configuration property is an array of processed API endpoints. A
single job represents a single API resource.
Generic Extractor reads and processes the responses from the API endpoints in a pretty complex
way. Each response is processed in the following steps:
dataField
property or default rules.The first two steps are the responsibility of Jobs resulting in an array of objects. Generic Extractor then tries to find a common super-set of properties of all objects, for example, with the following response:
The super-set of object properties consists of id
, name
, color
and size
. In the Generic Extractor
configuration, this is referred to as dataType
. If the dataType
configuration is not set, a
name is automatically generated. Merging the object structure requires that the objects are compatible.
The responses are merged into type-less tables. This means that values 42
and apples
are perfectly compatible
because they get converted to a string. Also, the scalar and array values are compatible because the
scalar is upgraded to an array. The following are incompatible combinations:
For example, this would not be allowed:
If you want to process the above response, use the
responseFilter
setting.
The endpoint property is required and represents the URL of the resource. It can be either of the following:
baseURL
property of the API definitionbaseURL
property of the API definitionAssume the following API definition:
The relative endpoint must not start with a slash; so, with
endpoint
set to campaign
, the final resource URL would be
https://example.com/3.0/campaign
.
The absolute endpoint must start with a slash. So, with /endpoint
set to campaign
, the final resource URL would be https://example.com/campaign
.
This means that the path part specified in the baseURL
is ignored and fully
replaced by the value specified in endpoint
.
The full absolute URL must start with a protocol. So, with the endpoint set to
https://eu.example.com/campaign
, this would be the final resource URL
and the path specified in the baseURL
is completely ignored.
The following table summarizes possible outcomes:
baseURL |
endpoint |
actual URL |
---|---|---|
https://example.com/3.0/ |
campaign |
https://example.com/3.0/campaign |
https://example.com/3.0/ |
campaign/ |
https://example.com/3.0/campaign/ |
https://example.com/3.0/ |
/1.0/campaign |
https://example.com/1.0/campaign |
https://example.com/3.0/ |
https://eu.example.com/3.0/ |
https://eu.example.com/3.0/campaign |
https://example.com/ |
campaign |
https://example.com/campaign |
https://example.com |
campaign |
https://example.comcampaign |
https://example.com/ |
https://elpmaxe.com/endpoint |
https://elpmaxe.com/endpoint |
It is highly recommended to use the relative URL fragments. This means that the
baseURL
property of the api
section must end with a slash.
Use the other two options for handling exceptions in the API extraction (for instance, falling back to an older API version). Note that using a different domain (or even a base path) may interfere with the authentication — depending on the specification of the target API.
Also, closely follow the target API specification regarding trailing slashes. For some APIs,
both https://example.com/3.0/campaign
and https://example.com/3.0/campaign/
URLs may
be accepted and valid. For other APIs, however, only one version may be supported.
The params
section defines request parameters. They
may be optional or required, depending on the target API specification. The params
section is
an object with arbitrary properties (or, more precisely, parameters understood by the target
API). It is also allowed to use function calls.
Assume that api.baseUrl
is set to https://example.com/3.0/
, jobs[].endpoint
is set to mock-api
and that the param
parameters are set as follows:
See our examples.
The method
parameter defines the HTTP request method.
The following are the allowed values:
GET
(default)POST
FORM
The HTTP method encodes the parameters in the URL. Therefore the above params
definition gets transformed
in the following URL:
https://example.com/3.0/mock-api?startDate=2016-01-20&types%5B0%5D=new&types%5B1%5D=active&types%5B2%5D=finished&filter%5Bquery%5D=q%3Duser%3AjohnDoe&filter%5Btags%5D%5Bfirst%5D=1&filter%5Btags%5D%5Bsecond%5D=0
or, in a more readable URLDecoded form:
https://example.com/3.0/mock-api?/mock-server/web/users/12/orders/2/tickets/000/comments?startDate=2016-01-20&types[0]=new&types[1]=active&types[2]=finished&filter[query]=q=user:johnDoe&filter[tags][first]=1&filter[tags][second]=0
The HTTP POST method sends the parameters in the request body. They are sent as a JSON object in the same form
as entered in the configuration. For the above defined params
property, the request body would be:
Also, the Content-Type: application/json
HTTP header is added to the request.
See our examples.
The FORM
method type sends the request the same way the HTTP POST method does. However,
the parameters from the param
object are encoded as form data, mimicking the request being sent by
a web form. This method does not support nested objects in the param
object.
For example, the following params
field:
will be sent as the following POST request body:
startDate=2016-01-20&types%5B0%5D=new&types%5B1%5D=active&types%5B2%5D=finished
or, in a more readable URLDecoded form:
startDate=2016-01-20&types[0]=new&types[1]=active&types[2]=finished
Also, the Content-Type: application/x-www-form-urlencoded
HTTP header will be added to the request.
The dataType
parameter assigns a name to the object(s) obtained from the endpoint.
Setting it is optional. If not set, a name will be generated automatically from the endpoint
value and parent jobs.
Data types are used in mappings and for naming output tables within their output buckets.
Note that you can use the same dataType
for multiple resources, provided that the result objects may
be merged into a single one. This can be used,
for example, in a situation where two API endpoints return the same resource:
In the above case, only a single tickets
table will be produced in the output bucket. It
will contain records from both API endpoints.
The dataField
parameter is used to determine what part of the API response will be
extracted. The following rules apply by default:
dataField
is configured.Apart from cases where required, the dataField
configuration may also be set to override the
above default behaviour. The dataField
parameter contains a
dot separated path to the response property you want to
extract. The dataField
parameter may be written in two ways — either as a simple string or
as an object with the path
property. For instance, these two configurations are equivalent:
The path to the response property is by default expected to be dot separated. That is — a path
members.active
refers to the property active
nested inside the property members
. If you need to refer to a
property containing a dot, you have to change the data field path delimiter to some other character. This can be
done using the delimiter
property:
The above configuration refers to the property named members.active
. To refer to the property items
nested
inside the property members.active
you have to use:
The delimiter
character is completely arbitrary but must be something that is not used in the property names in the response.
See example [EX120].
The responseFilter
option allows you to skip parts of the API response from processing. This can
be useful in these cases:
The value of the responseFilter
property is either a path to a property in the response, or
an array of such paths. The path is dot-separated unless set otherwise in the responseFilterDelimiter
configuration.
If you want to refer to the items of an array, use []
— see an example below.
The same result can be achieved using forceType
parameter in
column mapping.
The children
configuration allows you to retrieve sub-resources of the processes API resource.
These child jobs (nested jobs) are executed for each object retrieved from the
parent response. The definition of child jobs is the same as the definition of parent jobs,
except for placeholders. The children configuration is described in a
separate article.
The scroller
parameter assigns a predefined scroller when
multiple
pagination is used,
and is pointless when the multiple
pagination method is not used.
If scroller
is not set, the pagination method specified in the api
configuration
is used. If there is no pagination method specified, the job has no pagination.
The following examples show how simple objects are extracted from different objects.
To extract data from the following API response:
do not set the dataField
parameter at all, or set it to an empty string ("dataField": ""
).
The following table will be extracted:
id | name | married |
---|---|---|
123 | John Doe | 1 |
234 | Jane Doe |
Notice that the boolean value married
is converted
to 1
when true and left empty otherwise (false
and null
).
See example [EX001].
To extract data from the following API response:
do not set the dataField
parameter at all, or set it to an empty string or to the value users
("dataField": ""
or "dataField": "users"
).
The following table will be extracted:
id | name |
---|---|
123 | John Doe |
234 | Jane Doe |
See example [EX002].
To extract data from the following API response:
set the dataField
parameter to the value users
("dataField": "users"
). Not setting the
dataField
parameter would result in an error
(More than one array found in the response! Use the 'dataField' parameter to specify a key to the data array.
).
The following table will be extracted:
id | name |
---|---|
123 | John Doe |
234 | Jane Doe |
See example [EX003].
To extract data from the following API response:
set the dataField
parameter to the value members.active
("dataField": "members.active"
). Not setting the
dataField
parameter would result in a warning (No data array found in the response!
).
The following table will be extracted:
id | name |
---|---|
123 | John Doe |
234 | Jane Doe |
See example [EX004].
To extract both active
and inactive
arrays from the above API response, use two jobs:
In the first job, set the dataField
parameter to the value members.active
. In the second job, set
the dataField
parameter to the value members.inactive
. The entire jobs
section will look like this:
The following table will be extracted:
id | name |
---|---|
123 | John Doe |
234 | Jane Doe |
345 | Jimmy Doe |
See example [EX005].
You may encounter an API response like this:
You have to set the dataField
parameter to the value .
("dataField": "."
). Not setting the
dataField
parameter would result in a warning (No data array found in the response!
) and no data extracted.
The following table will be extracted:
id | name |
---|---|
123 | John Doe |
See example [EX006].
You may encounter an API response like this:
Set the dataField
parameter to the value user
("dataField": "user"
). Not setting the
dataField
parameter would result in a warning (No data array found in the response!
) and no data extracted.
The following table will be extracted:
id | name |
---|---|
123 | John Doe |
See example [EX007].
You may encounter an API response like this:
To extract the first item from the history
array, set the dataField
parameter to the value member.history.0
.
The following table will be extracted:
id | name | version |
---|---|---|
123 | John Doe | 2 |
See example [EX008].
You may encounter an API response like this:
To extract the items
from the members
array, set the dataField
parameter to the value members.0.items
.
The following table will be extracted:
id | name |
---|---|
123 | John Doe |
234 | Jane Doe |
See example [EX009].
The above examples show how simple objects are extracted from different objects. Generic Extractor can also extract objects with non-scalar properties. The default JSON to CSV mapping flattens nested objects and produces secondary tables from nested arrays.
You may encounter an API response like this:
To extract the members
array, set the dataField
parameter to the value members
or to an empty value.
The following tables will be extracted:
Users:
id | name | tags |
---|---|---|
123 | John Doe | users-10_3ca896f39b257a4f2d2f4784e7680c87 |
234 | Jane Doe | users-10_a15f4be71e739e1b2ea32bd4209d756e |
Tags:
data | JSON_parentId |
---|---|
active | users-10_3ca896f39b257a4f2d2f4784e7680c87 |
admin | users-10_3ca896f39b257a4f2d2f4784e7680c87 |
active | users-10_a15f4be71e739e1b2ea32bd4209d756e |
Each member contains a nested array of tags
that cannot be serialized into a single
database (CSV) column. Therefore the [JSON-CSV mapper] creates another table for the
tags
with tag values. It also generates a unique member identifier, puts it
in the tags
column and uses it in a new JSON_parentId
column. This
way, the 1:N relationship between Members and Tags is represented.
See example [EX010].
If the response contains an array nested in an array, Generic extractor is not able to process it. In such case the, contents of the array are extracted as a JSON encoded string. See example [EX130]. You will also see a warning in the extraction events, e.g.:
Converting nested array 'rows.[]' to JSON string.
You may encounter the following API response:
When you extract the members
array (set the dataField
parameter to the value members
or to an empty value),
the following tables will be extracted:
Users:
id | name | tags |
---|---|---|
123 | John Doe | users-17_c6f3e32262682b6efd6c85ad97d2d503 |
234 | Jane Doe | users-17_92df9d5b9af8821316172285b196318e |
Tags:
data | JSON_parentId |
---|---|
active | users-17_c6f3e32262682b6efd6c85ad97d2d503 |
active | users-17_92df9d5b9af8821316172285b196318e |
admin | users-17_92df9d5b9af8821316172285b196318e |
As you can see, the scalar value tags
in the first member object was automatically upgraded to
a single-element array because the tags
property is an array elsewhere (second member) in the response.
See example [EX017].
You may encounter an API response like this:
To extract the members
array, set the dataField
parameter to the value members
or to an empty value.
The following table will be extracted:
id | name | address_street | address_city | address_state |
---|---|---|---|---|
123 | John Doe | Elm Street | New York | |
234 | Jane Doe | Bates Street | Chicago | USA |
The properties of nested address
objects are automatically flattened into the parent object. Therefore
the address.city
property is flattened into the address_city
column.
See example [EX011].
The above two examples show the basic principles of the JSON-CSV mapping used by Generic Extractor. They are applied to all child properties. So, when you encounter an API response like this:
the following two tables will be extracted:
Users:
id | name | contacts |
---|---|---|
123 | John Doe | users-12_8505d6585e28c00d461ba64f085d1055 |
234 | Jane Doe | users-12_ec8c48efecb10334072f03a860113ea2 |
Contacts:
type | properties_street | properties_city | properties_address | properties_state | properties_number | primary | JSON_parentId |
---|---|---|---|---|---|---|---|
address | Elm Street | New York | users-12_8505d6585e28c00d461ba64f085d1055 | ||||
john.doe@example.com | 1 | users-12_8505d6585e28c00d461ba64f085d1055 | |||||
address | Bates Street | Chicago | USA | users-12_ec8c48efecb10334072f03a860113ea2 | |||
phone | 123 456 789 | 1 | users-12_ec8c48efecb10334072f03a860113ea2 |
The obtained table is rather sparse because the properties of the nested contacts
objects do not match exactly. For example, the properties_number
column was created
as a result of flattening the properties.number
object that is contained in the response
only once. Therefore the column has a single value.
The rows in the Contacts table are again linked through an
auto-generated key to the parent Users table. Also notice that the
Boolean value
primary
is converted to 1
when true and left empty otherwise.
See example [EX012].
If you have an API response like this:
and extract the members
array with the
default settings, two tables will be
produced. If you set the response filter to "responseFilter": "tags"
, then the tags
property of the members
items will not be processed and will be stored as a serialized
JSON string. The following table will be extracted:
id | name | tags |
---|---|---|
123 | John Doe | [“active”,”admin”] |
234 | Jane Doe | [“active”] |
The tags
column contains serialized JSON fragments which can be processed by
the JSON capable database (e.g., Snowflake).
See example [EX013].
If you have the following API response:
and extract the members
array with the
default settings,
two tables will be produced and the properties
object will be flattened into a sparse table.
To avoid that, set the response filter to "responseFilter": "contacts[].properties"
. This will
leave the properties
child of the contacts
array of the members
array unprocessed.
The following two tables will be produced:
Users:
id | name | contacts |
---|---|---|
123 | John Doe | users-12_0b9650e0f68b0c6738843d5b4ff0a961 |
234 | Jane Doe | users-12_cf76fb6794380244946d2bc4fa3aa04a |
Contacts:
type | properties | primary | JSON_parentId |
---|---|---|---|
address | {“street”:”Elm Street”,”city”:”New York”} | users-12_0b9650e0f68b0c6738843d5b4ff0a961 | |
{“address”:”john.doe@example.com”} | 1 | users-12_0b9650e0f68b0c6738843d5b4ff0a961 | |
address | {“street”:”Bates Street”,”city”:”Chicago”,”state”:”USA”} | users-12_cf76fb6794380244946d2bc4fa3aa04a | |
phone | {“number”:”123 456 789”} | 1 | users-12_cf76fb6794380244946d2bc4fa3aa04a |
The properties
column contains JSON serialized objects. When setting the responseFilter
parameter,
remember to use the correct path to the properties you wish to skip from processing. That is to say that
setting responseFilter
to
contacts
skips the entire contacts
property and does not create the Contacts: table at all.properties
does nothing because there is no properties
property under the members
array items.contacts.properties
does nothing because there is no properties
property under the contacts
array.The last two options might seem inconsistent. This is because the responseFilter
path is set relative to
the objects of the processed array (not to the array itself, not to the JSON root). Thus the only correct
setting in this case is contacts[].properties
.
See example [EX014].
If you have an API response like this:
you will receive an error similar to Error parsing response JSON: Unhandled type change from "scalar" to "object" in 'users-16.color'
. This means that the objects returned in the response are incompatible and cannot
be merged into a table by Generic Extractor.
To avoid the error and still retrieve the data,
use the responseFilter
to skip the color
property. When you set "responseFilter": "color"
, you
will obtain the following table:
id | name | color |
---|---|---|
123 | foo | “green” |
321 | bar | {“items”:[“red”,”blue”]} |
See example [EX016].
You might have a complex API response like this:
Because both tags
and contacts.properties
properties are inconsistent (sometimes using an object,
sometimes using a scalar value), you have to define multiple response filters. This can be done by using
an array of paths:
Then you will obtain the following tables:
Users:
id | name | tags | contacts |
---|---|---|---|
123 | John Doe | {“items”:[“active”,”admin”]} | users-18_19318ac6aa76a92c8d90e603f69e02f6 |
234 | Jane Doe | “none” | users-18_3fdf6b12b11f85cb4eb9c34ce0322ecd |
Contacts:
type | properties | primary | JSON_parentId |
---|---|---|---|
address | {“street”:”Elm Street”,”city”:”New York”} | users-18_19318ac6aa76a92c8d90e603f69e02f6 | |
“john.doe@example.com” | 1 | users-18_19318ac6aa76a92c8d90e603f69e02f6 | |
address | {“street”:”Bates Street”,”city”:”Chicago”,”state”:”USA”} | users-18_3fdf6b12b11f85cb4eb9c34ce0322ecd | |
phone | “123 456 789” | 1 | users-18_3fdf6b12b11f85cb4eb9c34ce0322ecd |
See example [EX018].
The default delimiter used for referencing nested properties is a dot .
. If the names of
properties in the API response contain dots, it might be necessary to change the default delimiter.
The API response might look like this:
If you want to filter the secondary.address
field, you cannot set the responseFilter
setting to
secondary.address
because it would be interpreted as an address
property of the secondary
property.
If you set "responseFilter": "secondary.address
, the extraction will work as if you did not set the
filter at all; it will be filtering the non-existent address
property.
For the filter to work correctly, set the responseFilterDelimiter
to an arbitrary character not
used in the response property names. The following would be a valid configuration:
It might by tempting to change the response filter to secondary#address
. However, this would be
incorrect as it would again mean that we are referring to an address
property nested in the secondary
object. With the above settings you will obtain a table like this:
id | name | primary_address_street | primary_address_city | primary_address_state | secondary_address |
---|---|---|---|---|---|
123 | John Doe | Elm Street | New York | {“street”:”Cemetery Ridge”,”city”:”New York”} | |
234 | Jane Doe | Blossom Avenue | U.K. | {“street”:”1313 Webfoot Walk”,”city”:”Duckburg”,”state”:”Calisota”} |
See example [EX019].
For the custom set delimiter in the response filter, you need to have a complex API response. For example:
To filter out all the tags
properties, you need to set the following:
You will obtain a table similar to the one below:
id | name | primary_address_street | primary_address_city | primary_address_tags | primary_address_state | secondary_address_street | secondary_address_city | secondary_address_tags | secondary_address_state |
123 | John Doe | Elm Street | New York | Cemetery Ridge | New York | [“work”,”usaddress”] | |||
234 | Jane Doe | Blossom Avenue | [“home”] | U.K. | 1313 Webfoot Walk | Duckburg | Calisota |
See example [EX020].
Assume that you have an API with the endpoint users
which requires the
GET parameter type
to specify which
users are to be retrieved. For example, a request to /users?type=active
returns a response
with active users:
To retrieve inactive users, send a request to /users?type=inactive
. This
can be solved using the following jobs configuration:
The params
configuration option specifies the
parameters to be sent to the API. Therefore the type
property is the name defined by the API itself.
The above configuration produces the following table:
id | name | married |
---|---|---|
123 | John Doe | 1 |
234 | Jane Doe | |
345 | Jimmy Doe |
See example [EX033] or example [EX136] which is also using functions.
You may encounter an API which is not exactly RESTful
and has to be queried using the HTTP POST method.
Assume that you have an API with the endpoint getUsers
that expects an empty HTTP POST request. The endpoint
then returns the following response:
Generic Extractor can handle this too, using the method
configuration:
The above configuration produces the following table:
id | name | married |
---|---|---|
123 | John Doe | 1 |
234 | Jane Doe |
See example [EX034].
A not-exactly-RESTful API (see above) may require some JSON
parameters in the request. Let’s say you have the getUsers
endpoint which requires an HTTP POST request with
the following body:
The request returns the following JSON:
The above situation can be handled by passing the entire request JSON to the
params
configuration.
The above configuration produces the following table:
id | name |
---|---|
123 | John Doe |
234 | Jane Doe |
See example [EX035].
Sometimes even the HTTP GET requests require complex parameters. Suppose the API
endpoint /users
requires the filter
and return
definitions. The API may describe
the configuration in many different ways, for instance:
filter | name | example value |
---|---|---|
Name of property for filtering | field | type |
Filtering operator | operator | equal |
Value to use in filter | value | active |
return | name | example value |
---|---|---|
Names of properties to return in response | fields | id,name |
In the HTTP protocol, this would be encoded in the following query string:
filter[field]=type&filter[operator]=equal&filter[value]=active&return[fields][0]=id&return[fields][1]=name
or, in the URL Encoded form:
filter%5Bfield%5D%3Dtype%26filter%5Boperator%5D%3Dequal%26filter%5Bvalue%5D%3Dactive%26return%5Bfields%5D%5B0%5D%3Did%26return%5Bfields%5D%5B1%5D%3Dname
The following JSON is returned:
The above situation can be handled by encoding the parameters in a JSON into the
params
configuration.
The above configuration produces the following table:
id | name |
---|---|
123 | John Doe |
234 | Jane Doe |
See example [EX036].