To configure your first Generic Extractor, follow our tutorial. Use Parameter Map to help you navigate among various configuration options.
The config
section of Generic Extractor configuration describes the actual extraction, including properties of HTTP requests,
and mapping between source JSON and target CSV.
A sample config
configuration can look like this:
Apart from the properties listed below, the config
section can contain any number of
other properties which are not used by Generic Extractor itself, but may be referenced
from within functions.
The keys prefixed by the hash character #
are automatically encrypted when the
configuration is saved. It is advisable to store sensitive information in such fields. Note, however, they
are not automatic aliases to un-encrypted fields. That means that when using a #password
field, you
must always refer to it as #password
(for instance, in functions).
Also, you cannot encrypt any Generic Extractor configuration fields (such as jobs
, mappings
, …).
The Jobs configuration describes the API endpoints (resources) which will be extracted. This
includes configuring the HTTP method and parameters. The jobs
configuration is
required and is described in a separate article.
The outputBucket
option defines the name of the Storage Bucket
in which the extracted tables will be stored. The configuration is required unless
the extractor is published as a standalone component with the
Default Bucket option.
The following configuration will make Generic Extractor put all extracted tables in the ge-tutorial
bucket
(the names of the tables are defined by the dataType
setting):
If you omit the outputBucket
configuration, you will receive an error similar to this:
CSV file 'campaigns' file name is not a valid table identifier, either set output mapping for 'campaigns' or make sure that the file name is a valid Storage table identifier.
The Mappings configuration describes how the JSON response is converted into
CSV files that will be imported into Storage. The mappings
configuration is optional and
is described in a separate article.
The debug
boolean option allows you to turn on more verbose logging which shows
all HTTP requests sent by Generic Extractor. The default value is false
.
Read more about running Generic Extractor in a separate article.
The http
option allows you to set the HTTP headers sent with every request. This primarily serves the purpose of providing values for api.http.requiredHeaders
option.
It is also possible to use the http
option without api.http.requiredHeaders
in
which case it is essentially equal to api.http.headers
.
See example [EX074].
The incrementalOutput
boolean option allows you to load the extracted data into
Storage incrementally. This flag in no way affects the data extraction.
When incrementalOutput
is set to true
, the contents of the target table in Storage will not be cleared.
The default value is false
.
How to configure Generic Extractor to extract data in increments from an API is described in a dedicated article.
See example [EX075].
The userData
option allows you to add arbitrary data to extracted records.
It is an object with arbitrary property names which are added as columns to all records extracted
from parent jobs. The property values are the columns values. It is also possible to use
functions as userData
property values.
The following configuration:
and the following response:
will produce the following users
table:
id | name | tag | mode |
123 | John Doe | fullExtract | development |
234 | Jane Doe | fullExtract | development |
The userData
values are added to the parent jobs only. They will not affect the
child jobs. If the result table contains
columns with the same names as the userData
properties. If there is already a column with the same name,
the userData
column will be renamed.
See example [EX076].
As we develop the Generic Extractor, some of the new features might lead to minor differences in extraction results.
When such a situation arises, a new compatibility level is introduced. The compatLevel
setting allows
you to force the old compatibility level and temporarily maintain the old behavior. The current
compatibility level is 3. The compatLevel
setting is intended only to ease updates and migrations,
never use it in new configurations (any version of old behavior is considered unsupported).
When a new Level is introduced, the following will happen:
compatLevel
will stay unchanged.Note that there is an exception: all configurations running before Level 3 was introduced will use compatibility
Level 1. This means that they use the legacy (Level 1) JSON parser, and you will see the following warning in
events: Using legacy JSON parser, because it is in configuration state.
Level 2 has different behavior in responseFilter
handling.
In current behavior (level 3 and above), a filtered JSON property consistently produces a valid JSON.
Previously (level 2 and below), a scalar value was not filtered. Given the data:
With responseFilter
set to data
, the level 2 version produces the following table:
id | data |
---|---|
1 | {“a”:”b”} |
2 | c |
Level 3 and above produces:
id | data |
---|---|
1 | {“a”:”b”} |
2 | “c” |
That means that the data
column is always a valid JSON string.
Compare the results of examples
EX121
and
EX122.
using compatibility level 2 with the result produced by examples
EX016
and
EX018
which use the current JSON parser.
Level 1 uses a JSON parser which cannot handle duplicate columns properly. This applies to a number of situations:
parent_id
, JSON_parentId
), e.g. – in a child job:In either of these situations, the Level 1 extractor generates an empty column with hash, and
the original (or first encountered) column values are overwritten. In the current version
(Level 2 and above), both columns are retained. The second encountered column has a
numbered suffix. If you are upgrading from a Level 1 extractor, delete the column
with hash from the target Storage table, otherwise you’ll get an error (Some columns are missing in
the csv files
).
There are also some differences in the naming of very long columns. For example, a property
data.modules.#DistributionGroups.outputs.groupCharacteristics.persistent
is shortened to
d__m__DistributionGroups_outputs_groupCharacteristics_persistent
in a Level 1 extractor, and
to DistributionGroups_outputs_groupCharacteristics_persistent
in Level 2 and above.
Compare the results of examples EX124 and EX125, using compatibility level 1 with the result produced by examples EX025 and EX076 which use the current JSON parser.