Data Folders Specification

Data folders are one of the possible channels to exchange data between your component and Keboola.

Root Folder /data/

The /data/ folder is the root folder for exchanging data. Your component reads its input from the /data/in folder and writes its results to the /data/out folder. Keboola takes care of injecting required tables and files into the input folder and picking up tables and files generated by your code. The data folders contain actual data files (tables and files) and metadata. For each datafile, a manifest file is created. It contains metadata information (creation time, keys for tables, etc.).

The data folder is always available in the component under the absolute /data/ path. The relative path to the data folder depends fully on your component code (or Dockerfile). If you want to use a different path (for component development), use the KBC_DATADIR environment variable. In production, this variable will always be set to /data/. During development, you can set it to your liking.

To create a data folder sample, use the Debug API call via the Docker Runner API. All the resources you need in your component will be provided in a ZIP archive.

The predefined data exchange folder structure is as follows:

/data/in/tables
/data/in/files
/data/out/tables
/data/out/files

This folder structure is always available to your component. Do not put arbitrary files in the /data/ folder as they will be uploaded into the user project (or cause errors in the output mapping). For working or temporary files, use the /tmp/ folder. Other directories have 10GB of free space in total.

Folder /data/in/tables/

The folder contains tables defined in the input mapping; they are serialized in the CSV format:

  • string enclosure "
  • delimiter ,
  • no escape character

File names are specified in the input mapping, defaulting to {tableId} (a file name can be changed in the UI). The table metadata is stored in a manifest file.

Folder /data/out/tables/

All output tables from your component must be placed in this folder. The destination table in Storage is defined by the following rules (listed in order):

  • If defaultBucket (see below) is specified, the table will be uploaded to a bucket whose name is created from the component and configuration names.
  • If the output mapping is specified (through the UI, usually) and its source matches the physical file name in the /data/out/tables folder, the destination is the name of the table in Storage. An output mapping which cannot be matched to a physical file produces an error (i.e., fulfilling the output mapping is mandatory).
  • If a manifest file exists, it can specify the destination of the table in Storage if no output mapping is present.
  • If none of the above options are used, the destination is defined by the name of the file (for example, out.c-data.my-table.csv will create a my-table table in the out-c-data bucket). The file name must have at least two dots.
  • If neither rule can be applied, an error is thrown.

Manifests allow you to process files in the /data/out folder without explicitly being defined in the output mapping. That allows for a flexible and dynamic output mapping where the structure is unknown at the beginning. Using file names (e.g., out.c-data.my-table.csv) for an output mapping is great for saving implementation time in a simple or POC component.

Important: All files in the /data/out/tables folder will be uploaded, not only those specified in the output mapping or manifests.

This is the expected CSV format (RFC 4180):

  • string enclosure "
  • delimiter ,
  • no escape character

A manifest file can specify a different enclosure and delimiter.

Default Bucket

If you cannot define a bucket or want to get it automatically, set the Default Bucket for your component in the Developer Portal.

All tables in /data/out/tables will then be uploaded to a bucket identified by your component id (generated when the component was created), configuration id (generated when an end-user adds a new component configuration) and stage (in or out). The file name (without the .csv suffix) will be used as the table name. The destination attributes in the output mapping and file manifests will be overridden.

Important: The Default Bucket flag always requires the config parameter when creating a job manually using the API even if the config configuration does not exist in Storage.

Sliced Tables

Sometimes your component will download the CSV file in slices (chunks). You do not need to manually merge them, simply put them in a subfolder with the same name you would use for a single file. All files found in the subfolder are considered slices of the table.

/data/out/tables/myfile.csv/part01
/data/out/tables/myfile.csv/part02
/data/out/tables/myfile.csv.manifest

Sliced files cannot have header rows. They must have their columns specified in the manifest file or in the output mapping configuration.

The following is an example of specifying columns in the manifest file /data/out/tables/myfile.csv.manifest:

{
    "destination": "in.c-mybucket.table",
    "columns": ["col1", "col2", "col3"]
}

All files from the folder are uploaded irrespective of their name or extension. They are uploaded to Storage in parallel and in an undefined order. Use sliced tables in case you want to upload tables larger than 5GB. The slices may be compressed by gzip. A rule of thumb is that slices are best around 10-100 MB in size compressed.

Folder /data/in/files/

All files defined in the input mapping are stored in their raw form. File names are numeric and equal to {fileId}_{filename} in Storage. All other information about the files is available in the manifest file.

Folder /data/out/files/

All files in this folder are uploaded to Storage. File names are preserved, and tags and other upload options can be specified in the manifest file. Note that all files in the /data/out/files folder will be uploaded, not only those specified in the output mapping.

Exchanging Data via S3

The component may also exchange data with Storage using Amazon S3. In this case, the data folders contain only manifest files and not the actual data. This mode of operation can be enabled by setting the Staging storage input option to AWS S3 in component settings. If this option is enabled, all the data folders will contain only manifest files, extended with an additional s3 section.

Note: Exchanging data via S3 is currently only available for input mapping.

Exchanging Data via ABS

The component may also exchange data with Storage using Azure Blob Storage (ABS). In this case, the data folders contain only manifest files and not the actual data. This mode of operation can be enabled by setting the Staging storage input option to ABS in component settings. If this option is enabled, all the data folders will contain only manifest files, extended with an additional abs section.

Note: Exchanging data via ABS is currently only available for input mapping.

Exchanging Data via Database Workspace

Note: this is a preview feature and may change considerably in the future.

The component may also exchange data with Storage using Workspaces. This mode of operation can be enabled by setting the Staging storage input or Staging storage output option to Workspace Snowflake, Workspace Redshift, or Workspace Synapse. A workspace is an isolated database to which data are loaded before the component job is run and unloaded when the job finishes. The workspace is created just before the job starts and is deleted when the job is terminated.

Using this option will load Storage Tables into the provided storage workspace, but Storage Files will still be loaded into the local filesystem like in the standard configuration.

If this option is enabled, the table data folder will contain only manifest files. The actual data will be loaded as database tables into the workspace database. The destination in input and source in output refer to database table names. This mode of operation is useful for components which want to manipulate data using SQL queries. The component can run arbitrary queries against the database. The database credentials are available in the authorization section of the configuration file:

{
  "storage": {

  },
  "parameters": {
    ...
  },
  "authorization": {
    "workspace": {
      "host": "database.example.com",
      "warehouse": "test",
      "database": "my-db",
      "schema": "my-schema"
      "user": "john-doe",
      "password": "secret"
    }
  }
}

Notice that some of the values might be empty for different workspace backends (e.g., Redshift is not using warehouse). They will be always present, though.

When exchanging data via workspace, there are couple of differences to loading data into files:

  • Loading to workspaces supports only storage tables, storage files are always saved to the directory structure.
  • The days attribute is not supported for filtering table, use changed_since instead.
  • Automatic Incremental Processing (also known as Adaptive Input Mapping) is not supported.
  • When used for output mapping, the columns of the output table must be specified, this can be done either in the output manifest or in the output mapping.

Note: Currently only some combinations of input/output staging storage settings are supported: local<->local, local<->s3, workspace-snowflake<->workspace-snowflake, workspace-redshift<->workspace-redshift.

Exchanging Data via File System Workspace

Note: this is a preview feature and may change considerably in future. *Note: currently only Azure Blob Storage workspaces (abs-workspace) are supported for this type and those only work with Synapse storage backend

The component may also exchange data with a provisioned file workspace (Azure Blob Storage) using Workspaces. This mode of operation can be enabled by setting the Staging storage input or Staging storage output option to Workspace ABS. A filesystem workspace is an isolated file storage to which data are loaded before the component job is run (when staging storage input is set) and unloaded from when the job finishes (when staging storage output is set). The workspace is created just before the job starts and is deleted when the job is terminated.

If this option is enabled, the data and the manifests will be loaded to the azure storage blob container under the data folder similarly to how it does when using the default local filesystem.

Files

Files are loaded into the workspace as [file name]/[file ID]. For example, if a file ‘test.txt’ with ID ‘12345’ is in the input mapping then the file will appear in the storage blob container with URL https://[storage_account_name].blob.core.windows.net/[container-name]/data/in/files/test.txt/12345

Tables

Note that this is only available on Synapse storage backend

Synapse only exports tables as sliced files. So for example, if you set as table input mapping the table in.c-main.my-input as source and my-input.csv as destination then in the ABS workspace you will find it with the following structure:

  • [containerName]/data/in/tables/my-inpupt.csv/[random identifier1].txt
  • [containerName]/data/in/tables/my-inpupt.csv/[random identifier2].txt
  • [containerName]/data/in/tables/my-inpupt.csv/[random identifier3].txt

Mappings

To sum up, below is a sample storage configuration and where the files are written from and to:

Direction Source Destination
input in.c-main.my-table-from-abs-workspace Many slices like [abs-workspace-root]/data/in/tables/my-inpupt-table.csv/[random identifier].txt
input file with tag my-input-files named input-file.txt [abs-workspace-root]/data/in/files/test.txt/12345
output [abs-workspace-root]/data/out/tables/my-output-table.csv out.c-main.my-table-from-abs-workspace
output [abs-workspace-root]/data/out/files/my-file.txt file my-file.txt with tag uploaded-from-abs-workspace
{
  "storage": {
    "input": {
      "tables": [
        {
          "source": "in.c-main.my-table-from-abs-workspace",
          "destination": "my-input-table.csv"
        },
        ...
      ],
      "files": [
        {
          "tags": ["my-input-files"]
        }
      ]
    }
    "output": {
      "tables": [
        {
          "source": "my-output-table.csv",
          "destination": "out.c-main.my-table-from-abs-workspace"
        },
        ...
      ],
      "files": [
        {
          "source": "my-file.txt",
          "tags": ["uploaded-from-abs-workspace"]
        }
      ]
    }
  },
  ...
}

Authorization

authorization section of the configuration file:

To connect to the ABS storage workspace you need to use the SAS connection string which is stored in the authorization section as shown below.

{
  "storage": {
    ...
  },
  "parameters": {
    ...
  },
  "authorization": {
    "workspace": {
      "container": "azure-storage-blob-container",
      "connectionString": "azure-storage-blob-SAS-connection-string",
    }
  }
}