config section of Generic Extractor configuration describes the actual extraction, including properties of HTTP requests,
and mapping between source JSON and target CSV.
config configuration can look like this:
Apart from the properties listed below, the
config section can contain any number of
other properties which are not used by Generic Extractor itself, but may be referenced
from within functions.
The keys prefixed by the hash character
# are automatically encrypted when the
configuration is saved. It is advisable to store sensitive information in such fields. Note, however, they
are not automatic aliases to un-encrypted fields. That means that when using a
#password field, you
must always refer to it as
#password (for instance, in functions).
Also, you cannot encrypt any Generic Extractor configuration fields (such as
The Jobs configuration describes the API endpoints (resources) which will be extracted. This
includes configuring the HTTP method and parameters. The
jobs configuration is
required and is described in a separate article.
outputBucket option defines the name of the Storage Bucket
in which the extracted tables will be stored. The configuration is required unless
the extractor is registered as a standalone component with the
Default Bucket option.
The following configuration will make Generic Extractor put all extracted tables in the
(the names of the tables are defined by the
If you omit the
outputBucket configuration, you will receive an error similar to this:
CSV file 'campaigns' file name is not a valid table identifier, either set output mapping for 'campaigns' or make sure that the file name is a valid Storage table identifier.
The Mappings configuration describes how the JSON response is converted into
CSV files that will be imported into Storage. The
mappings configuration is optional and
is described in a separate article.
debug boolean option allows you to turn on more verbose logging which shows
all HTTP requests sent by Generic Extractor. The default value is
Read more about running Generic Extractor in a separate article.
http option allows you to set the HTTP headers sent with every request. This primarily serves the purpose of providing values for
It is also possible to use the
http option without
which case it is essentially equal to
See example [EX074].
incrementalOutput boolean option allows you to load the extracted data into
Storage incrementally. This flag in no way affects the data extraction.
incrementalOutput is set to
true, the contents of the target table in Storage will not be cleared.
The default value is
How to configure Generic Extractor to extract data in increments from an API is described in a dedicated article.
See example [EX075].
userData option allows you to add arbitrary data to extracted records.
It is an object with arbitrary property names which are added as columns to all records extracted
from parent jobs. The property values are the columns values. It is also possible to use
userData property values.
The following configuration:
and the following response:
will produce the following
userData values are added to the parent jobs only. They will not affect the
child jobs. If the result table contains
columns with the same names as the
userData properties. If there is already a column with the same name,
userData column will be renamed.
See example [EX076].
As we develop the Generic Extractor, some of the new features might lead to minor differences in extraction results.
When such a situation arises, a new compatibility level is introduced. The
compatLevel setting allows
you to force the old compatibility level and temporarily maintain the old behavior. The current
compatibility level is 3. The
compatLevel setting is intended only to ease updates and migrations,
never use it in new configurations (any version of old behavior is considered unsupported).
When a new Level is introduced, the following will happen:
compatLevelwill stay unchanged.
Note that there is an exception: all configurations running before Level 3 was introduced will use compatibility
Level 1. This means that they use the legacy (Level 1) JSON parser, and you will see the following warning in
Using legacy JSON parser, because it is in configuration state.
Level 2 has different behavior in
In current behavior (level 3 and above), a filtered JSON property consistently produces a valid JSON.
Previously (level 2 and below), a scalar value was not filtered. Given the data:
responseFilter set to
data, the level 2 version produces the following table:
Level 3 and above produces:
That means that the
data column is always a valid JSON string.
Compare the results of examples
using compatibility level 2 with the result produced by examples
which use the current JSON parser.
Level 1 uses a JSON parser which cannot handle duplicate columns properly. This applies to a number of situations:
JSON_parentId), e.g. – in a child job:
In either of these situations, the Level 1 extractor generates an empty column with hash, and
the original (or first encountered) column values are overwritten. In the current version
(Level 2 and above), both columns are retained. The second encountered column has a
numbered suffix. If you are upgrading from a Level 1 extractor, delete the column
with hash from the target Storage table, otherwise you’ll get an error (
Some columns are missing in
the csv files).
There are also some differences in the naming of very long columns. For example, a property
data.modules.#DistributionGroups.outputs.groupCharacteristics.persistent is shortened to
d__m__DistributionGroups_outputs_groupCharacteristics_persistent in a Level 1 extractor, and
DistributionGroups_outputs_groupCharacteristics_persistent in Level 2 and above.