Data folders are one of the possible channels to exchange data between your component and Keboola.
The /data/
folder is the root folder for exchanging data.
Your component reads its input from the /data/in
folder and writes its results to the /data/out
folder.
Keboola takes care of injecting required tables and files into the input folder and
picking up tables and files generated by your code.
The data folders contain actual data files (tables and files) and metadata.
For each datafile, a manifest file is created.
It contains metadata information (creation time, keys for tables, etc.).
The data folder is always available in the component under the absolute /data/
path. The relative path to the data folder
depends fully on your component code (or Dockerfile). If you want to use a different path (for component development),
use the KBC_DATADIR
environment variable. In production,
this variable will always be set to /data/
. During development, you can set it to your liking.
To create a data folder sample, use the Debug API call via the Docker Runner API. All the resources you need in your component will be provided in a ZIP archive.
The predefined data exchange folder structure is as follows:
/data/in/tables
/data/in/files
/data/out/tables
/data/out/files
This folder structure is always available to your component.
Do not put arbitrary files in the /data/
folder as they will be uploaded into the user project
(or cause errors in the output mapping).
For working or temporary files, use the /tmp/
folder. Other directories have 10GB of free space in total.
The folder contains tables defined in the input mapping; they are serialized in the CSV format:
"
,
File names are specified in the input mapping, defaulting to {tableId}
(a file name can be changed in the UI).
The table metadata is stored in a manifest file.
All output tables from your component must be placed in this folder. The destination table in Storage is defined by the following rules (listed in order):
defaultBucket
(see below) is specified, the table will be uploaded to a bucket whose name is created
from the component and configuration names./data/out/tables
folder, the destination is the name of the table in Storage. An output mapping which
cannot be matched to a physical file produces an error (i.e., fulfilling the output mapping is mandatory).out.c-data.my-table.csv
will create a my-table table in the out-c-data bucket). The file name
must have at least two dots.Manifests allow you to process files in the /data/out
folder without explicitly being defined in the
output mapping. That allows for a flexible and dynamic output mapping where the structure is unknown at the beginning.
Using file names (e.g., out.c-data.my-table.csv
) for an output mapping is great for saving implementation time
in a simple or POC component.
Important: All files in the /data/out/tables
folder will be uploaded, not only those specified in the output
mapping or manifests.
This is the expected CSV format (RFC 4180):
"
,
A manifest file can specify a different enclosure and delimiter.
If you cannot define a bucket or want to get it automatically, set the Default Bucket for your component in the Developer Portal.
All tables in /data/out/tables
will then be uploaded to a bucket identified by your
component id (generated when the component was created), configuration id (generated when an end-user adds a new component configuration) and stage (in
or out
).
The file name (without the .csv
suffix) will be used as the table name. The destination
attributes
in the output mapping and file manifests will be overridden.
Important: The Default Bucket
flag always requires the config
parameter when creating a job manually using
the API even if the config
configuration does not exist in Storage.
Sometimes your component will download the CSV file in slices (chunks). You do not need to manually merge them, simply put them in a subfolder with the same name you would use for a single file. All files found in the subfolder are considered slices of the table.
/data/out/tables/myfile.csv/part01
/data/out/tables/myfile.csv/part02
/data/out/tables/myfile.csv.manifest
Sliced files cannot have header rows. They must have their columns specified in the manifest file or in the output mapping configuration.
The following is an example of specifying columns in the manifest file /data/out/tables/myfile.csv.manifest
:
{
"destination": "in.c-mybucket.table",
"columns": ["col1", "col2", "col3"]
}
All files from the folder are uploaded irrespective of their name or extension. They are uploaded to Storage in parallel and in an undefined order. Use sliced tables in case you want to upload tables larger than 5GB. The slices may be compressed by gzip. A rule of thumb is that slices are best around 10-100 MB in size compressed.
All files defined in the input mapping are stored in their raw form. File names are numeric and
equal to {fileId}_{filename}
in Storage. All other information about the files is available
in the manifest file.
All files in this folder are uploaded to Storage. File names are preserved, and tags and other upload options
can be specified in the manifest file.
Note that all files in the /data/out/files
folder will be uploaded, not only those specified in the output mapping.
The component may also exchange data with Storage using Amazon S3.
In this case, the data folders contain only manifest files and
not the actual data. This mode of operation can be enabled by setting the Staging storage input option to AWS S3 in
component settings. If this option is enabled, all the data folders
will contain only manifest files, extended with an additional
s3
section.
Note: Exchanging data via S3 is currently only available for input mapping.
The component may also exchange data with Storage using Azure Blob Storage (ABS).
In this case, the data folders contain only manifest files and
not the actual data. This mode of operation can be enabled by setting the Staging storage input option to ABS in
component settings. If this option is enabled, all the data folders
will contain only manifest files, extended with an additional
abs
section.
Note: Exchanging data via ABS is currently only available for input mapping.
Note: this is a preview feature and may change considerably in the future.
The component may also exchange data with Storage using Workspaces. This mode of operation can be enabled by setting the Staging storage input or Staging storage output option to Workspace Snowflake, Workspace Redshift, or Workspace Synapse. A workspace is an isolated database to which data are loaded before the component job is run and unloaded when the job finishes. The workspace is created just before the job starts and is deleted when the job is terminated.
Using this option will load Storage Tables into the provided storage workspace, but Storage Files will still be loaded into the local filesystem like in the standard configuration.
If this option is enabled, the table data folder will contain only manifest files. The actual data will be loaded as
database tables into the workspace database. The destination
in input and source
in output refer to database
table names. This mode of operation is useful for components which want to manipulate data using SQL queries.
The component can run arbitrary queries against the database. The database credentials are available in the
authorization
section of the configuration file:
Notice that some of the values might be empty for different workspace backends (e.g., Redshift is not using warehouse
).
They will be always present, though.
When exchanging data via workspace, there are couple of differences to loading data into files:
days
attribute is not supported for filtering table, use changed_since
instead.columns
of the output table must be specified, this can be done either in the output manifest or in the output mapping.Note: Currently only some combinations of input/output staging storage settings are supported:
local<->local
, local<->s3
, workspace-snowflake<->workspace-snowflake
, workspace-redshift<->workspace-redshift
.
Note: this is a preview feature and may change considerably in future. *Note: currently only Azure Blob Storage workspaces (abs-workspace) are supported for this type and those only work with Synapse storage backend
The component may also exchange data with a provisioned file workspace (Azure Blob Storage) using Workspaces. This mode of operation can be enabled by setting the Staging storage input or Staging storage output option to Workspace ABS. A filesystem workspace is an isolated file storage to which data are loaded before the component job is run (when staging storage input is set) and unloaded from when the job finishes (when staging storage output is set). The workspace is created just before the job starts and is deleted when the job is terminated.
If this option is enabled, the data and the manifests will be loaded to the azure storage blob container under the data folder similarly to how it does when using the default local filesystem.
Files are loaded into the workspace as [file name]/[file ID]
. For example, if a file ‘test.txt’ with ID ‘12345’ is in
the input mapping then the file will appear in the storage blob container with URL https://[storage_account_name].blob.core.windows.net/[container-name]/data/in/files/test.txt/12345
Note that this is only available on Synapse storage backend
Synapse only exports tables as sliced files.
So for example, if you set as table input mapping the table in.c-main.my-input
as source and my-input.csv
as
destination then in the ABS workspace you will find it with the following structure:
To sum up, below is a sample storage configuration and where the files are written from and to:
Direction | Source | Destination |
---|---|---|
input | in.c-main.my-table-from-abs-workspace | Many slices like [abs-workspace-root]/data/in/tables/my-inpupt-table.csv/[random identifier].txt |
input | file with tag my-input-files named input-file.txt |
[abs-workspace-root]/data/in/files/test.txt/12345 |
output | [abs-workspace-root]/data/out/tables/my-output-table.csv |
out.c-main.my-table-from-abs-workspace |
output | [abs-workspace-root]/data/out/files/my-file.txt |
file my-file.txt with tag uploaded-from-abs-workspace |
authorization
section of the configuration file:
To connect to the ABS storage workspace you need to use the SAS connection string which is stored in the authorization section as shown below.