Data Folders Specification

Data folders are one of the possible channels to exchange data between your component and Keboola Connection (KBC).

Root Folder /data/

The /data/ folder is the root folder for exchanging data. Your component reads its input from the /data/in folder and writes its results to the /data/out folder. Keboola Connection takes care of injecting required tables and files into the input folder and picking up tables and files generated by your code. The data folders contain actual data files (tables and files) and metadata. For each datafile, a manifest file is created. It contains metadata information (creation time, keys for tables, etc.).

The data folder is always available in the component under the absolute /data/ path. Relative path to the data folder depends fully on your component code (or Dockerfile). If you want use different path (for component development), use the KBC_DATADIR environment variable. In production, this variable will be always set to /data/. During development you can set it to your liking.

To create a data folder sample, use the Debug API call via the Docker Runner API. All the resources you need in your component will be provided in a ZIP archive.

The predefined data exchange folder structure is as follows:

/data/in/tables
/data/in/files
/data/out/tables
/data/out/files

This folder structure is always available to your component. Do not put arbitrary files in the /data/ folder as they will be uploaded into the user project (or cause errors in the output mapping). For working or temporary files, use the /tmp/ folder. Other directories have 10GB of free space in total.

Folder /data/in/tables/

The folder contains tables defined in the input mapping; they are serialized in the CSV format:

  • string enclosure "
  • delimiter ,
  • no escape character

File names are specified in the input mapping, defaulting to {tableId} (a file name can be changed in the UI). The table metadata is stored in a manifest file.

Folder /data/out/tables/

All output tables from your component must be placed in this folder. The destination table in Storage is defined by the following rules (listed in order):

  • If defaultBucket (see below) is specified, the table will be uploaded to a bucket whose name is created from the component and configuration names.
  • If the output mapping is specified (through the UI, usually) and its source matches the physical file name in the /data/out/tables folder, the destination is the name of the table in Storage. An output mapping which cannot be matched to a physical file produces an error (i.e., fulfilling the output mapping is mandatory).
  • If a manifest file exists, it can specify the destination of the table in Storage if no output mapping is present.
  • If none of the above options are used, the destination is defined by the name of the file (for example, out.c-data.my-table.csv will create a my-table table in the out-c-data bucket). The file name must have at least two dots.
  • If neither rule can be applied, an error is thrown.

Manifests allow you to process files in the /data/out folder without explicitly being defined in the output mapping. That allows for a flexible and dynamic output mapping where the structure is unknown at the beginning. Using file names (e.g., out.c-data.my-table.csv) for an output mapping is great for saving implementation time in simple or POC component.

Important: All files in the /data/out/tables folder will be uploaded, not only those specified in the output mapping or manifests.

This is the expected CSV format (RFC 4180):

  • string enclosure "
  • delimiter ,
  • no escape character

A manifest file can specify a different enclosure and delimiter.

Default Bucket

If you cannot define a bucket or want to get it automatically, set the Default Bucket for your component in the Developer Portal.

All tables in /data/out/tables will then be uploaded to a bucket identified by your component id (generated when the component was created), configuration id (generated when an end-user adds a new component configuration) and stage (in or out). The file name (without the .csv suffix) will be used as the table name. The destination attributes in the output mapping and file manifests will be overridden.

Important: The Default Bucket flag always requires the config parameter when creating a job manually using the API even if the config configuration does not exist in Storage.

Sliced Tables

Sometimes your component will download the CSV file in slices (chunks). You do not need to manually merge them, simply put them in a subfolder with the same name you would use for a single file. All files found in the subfolder are considered slices of the table.

/data/out/tables/myfile.csv/part01
/data/out/tables/myfile.csv/part02
/data/out/tables/myfile.csv.manifest

Sliced files cannot have header rows. They must have their columns specified in the manifest file or in the output mapping configuration.

The following is an example of specifying columns in the manifest file /data/out/tables/myfile.csv.manifest:

{
    "destination": "in.c-mybucket.table",
    "columns": ["col1", "col2", "col3"]
}

All files from the folder are uploaded irrespective of their name or extension. They are uploaded to Storage in parallel and in an undefined order. Use sliced tables in case you want to upload tables larger than 5GB. The slices may be compressed by gzip. A rule of thumb is that slices are best around 10-100 MB in size compressed.

Folder /data/in/files/

All files defined in the input mapping are stored in their raw form. File names are numeric and equal to {fileId}_{filename} in Storage. All other information about the files is available in the manifest file.

Folder /data/out/files/

All files in this folder are uploaded to Storage. File names are preserved, and tags and other upload options can be specified in the manifest file. Note that all files in the /data/out/files folder will be uploaded, not only those specified in the output mapping.

Exchanging Data via S3

The component may also exchange data with Storage using Amazon S3. In this case, the data folders contain only manifest files and not the actual data. This mode of operation can be enabled by setting the Staging storage input option to AWS S3 in component settings. If this option is enabled, all the data folders will contain only manifest files, extended with an additional s3 section.

Note: Exchanging data via S3 is currently only available for input mapping.

Exchanging Data via Workspace

Note: this is a preview feature and may change considerably in future.

The component may also exchange data with Storage using Workspaces. This mode of operation can be enabled by setting the Staging storage input or Staging storage output option to Workspace Snowflake or Workspace Redshift. A workspaces is an isolated database to which data are loaded before the component job is run and unloaded when the job finishes. The workspace is created just before the job starts and is deleted when the job is terminated.

If this option is enabled, the table data folder will contain only manifest files. The actual data will be loaded as database tables into the workspace database. The destination in input and source in output refer to database table names. This mode of operation is useful for components which want to manipulate data using SQL queries. The component can run arbitrary queries against the database. The database credentials are available in the authorization section of the configuration file:

{
  "storage": {

  },
  "parameters": {
    ...
  },
  "authorization": {
    "workspace": {
      "host": "database.example.com",
      "warehouse": "test",
      "database": "my-db",
      "schema": "my-schema"
      "user": "john-doe",
      "password": "secret"
    }
  }
}

Notice that some of the values might be empty for different workspace backends (e.g. Redshift is not using warehouse). They will be always present, though.

When exchanging data via workspace, there are couple of differences to loading data into files:

  • Loading to workspaces supports only storage tables, storage files are always saved to the directory structure.
  • The days attribute is not supported for filtering table, use changed_since instead.
  • Automatic Incremental Processing (also known as Adaptive Input Mapping) is not supported.
  • When used for output mapping, the columns of the output table must be specified, this can be done either in the output manifest or in the output mapping.

Note: Currently only some combinations of input/output staging storage settings are supported: local<->local, local<->s3, workspace-snowflake<->workspace-snowflake, workspace-redshift<->workspace-redshift.