Processors are additional components which may be used before or after running an arbitrary component (extractor, writer, etc.).
When Docker Runner runs a Docker image (a container is created), a processor
may be used to pre-process the inputs (files or tables) supplied to that container, or it may be used to post-process
the container outputs. For example, if an extractor extracts CSV data in a non-UTF8 encoding, you can use the
iconv
processor as a post-processor to
convert the CSV to UTF-8 as expected by Storage. See the
tutorial for a quick example of using processors.
Processors are technically supported in any configuration of any component. However, as an advanced feature, they have little to no support in the UI. To manually configure processors, you have to use the Component Configuration API. See the respective part of our documentation for examples of working with the Component Configuration API. If you want to implement your own processor, see our implementation notes.
If the component does not contain the respective configuration field or an advanced configuration mode, processors are completely invisible in the UI. In such case, modifying the configuration through the UI may delete the processor configuration (though you can always rollback). Therefore be sure to add an appropriate warning to the configuration description.
By running the
Get Configuration Detail
request for a specific component ID and configuration ID, you obtain the actual configuration contents.
You can see an example request
for getting a configuration with ID 365111648
for the component called Email Attachments extractor (ID keboola.ex-email-attachments
):
From this, the actual configuration is the contents of the configuration
node. Therefore:
Processors are configured in the processors
section in the before
array or the after
array (rarely both).
For example, you might want to configure the processor-skip-lines
:
The configuration parameters of a processor are always described in its documentation.
The above configuration defines that a keboola.processor-skip-lines
(which removes a certain number of lines from the file)
will run after this particular configuration of the Email Attachment extractor is finished,
but before its results are loaded into Storage. When the processor is finished, its outputs are loaded
into Storage as if they were the outputs of the extractor itself.
To save the configuration, you need to use the Update Configuration API call.
When updating the configuration, you must provide componentId
, configurationId
, and the actual contents of
the configuration in the configuration
form field. Make sure to supply only the contents of the configuration
node and to properly escape the form data.
See our configuration documentation for a more thorough description and the Add processor to Email Attachments Extractor Configuration example in our collection. Remember, the processors can be chained to achieve more advanced processing.
You can obtain a list of available processors using the
Developer Portal UI or the List Components Public API
of the Developer Portal. The important parts are id
, which is required for configuration,
and documentationUrl
, which describes additional parameters of the processor.
A processor may allow (or require) parameters. These are entered in the parameters
section.
The below configuration sets values for two parameters — lines
and direction_from
:
The names and allowed values of the parameters are fully up to the processor interpretation and validation and are described in the respective processor documentation.
If the configuration uses Configuration Rows, you have to use the Update Configuration Row API call to set the processors.
Provide componentId
, configurationId
, rowId
and the contents of the configuration in
the same manner as when adding a processor to configuration.
See an example Add processor to S3 Extractor configuration Row in
our collection.
It shows how to set a processor for the configuration row with ID 364481153
in configuration 364479526
of
the AWS S3 extractor (component ID keboola.ex-aws-s3
). The configuration is the following:
Remember, processors can be chained and therefore should be as simple as possible. For example, a processor reading tables in CSV should assume that these are available in the standard format and that the table manifests are available.
For example, assume that you have a component which extracts the following data:
Dump from ACME Anvil CRM
SLA: 24h
Day|AnvilsDelivered
2050-12-10|100|5|4|4
2050-12-11|56|1|2
2050-12-12|131|9|7|3
First apply the processor-skip-lines to obtain something resembling a CSV file:
Day|AnvilsDelivered
2050-12-10|100|5|4|4
2050-12-11|56|1|2
2050-12-12|131|9|7|3
Then apply the processor-create-manifest to set the delimiter and enclosure in the file manifest.
After that, use the processor-format-csv to convert the file from the format specified in the manifest to the standard format:
"Day","AnvilsDelivered"
"2050-12-10","100","5","4","4"
"2050-12-11","56","1","2"
"2050-12-12","131","9","7","3"
Finally, you can use the processor-headers to make the data orthogonal:
"Day","AnvilsDelivered","col1","col2","col3"
"2050-12-10","100","5","4","4"
"2050-12-11","56","1","2",""
"2050-12-12","131","9","7","3"
A chain similar to the above can be used for a writer too. Assume that you need to send the following data to the very special ACME Anvil CRM:
Import: CRM
ImportFormat: AnvilPSV
Date: 2018-10-01
Type: MANF-DLVR-PLAN
Day|AnvilManufacturingPlan|AnvilDeliveryPlan
2050-12-10|100|533
2050-12-11|100|695
2050-12-12|100|923
The data exported from Storage will be in the following format:
"Day","AnvilManufacturingPlan","AnvilDeliveryPlan"
"2050-12-10","100","533"
"2050-12-11","100","695"
"2050-12-12","100","923"
Then apply the processor-format-csv to convert the file from the standard format to the format required by the Anvil CRM writer:
Day|AnvilManufacturingPlan|AnvilDeliveryPlan
2050-12-10|100|533
2050-12-11|100|695
2050-12-12|100|923
Create a custom processor to put the header in:
Import: CRM
ImportFormat: AnvilPSV
Date: 2018-10-01
Type: MANF-DLVR-PLAN
Day|AnvilManufacturingPlan|AnvilDeliveryPlan
2050-12-10|100|533
2050-12-11|100|695
2050-12-12|100|923
Finally, the Anvil CRM writer can send the result to the CRM system. Or you can have the header function be part of the writer itself. That decision should be made depending on whether the header must always be present (part of the writer) or is optional (processor).