Your Python Custom Science Application can be created in multiple ways (as described below). There are no known limitations to the architecture of your Python code. We recommend that you use our library. It provides useful functions for working with our environment. Please note:
To install a custom package, use e.g:
Here is our current
list of pre-installed packages.
You can use those with
import. If you know of another useful standard package to pre-install, we would like
to hear about it.
Tables from Storage are imported to the Python script from CSV files. The CSV files can be read by standard python
functions from the csv packages. You can read the CSV files either
to vectors (numbered columns) or to dictionaries (named columns). The directory structure follows our general
Docker interface - so
input tables are stored as CSV files in
in/tables/, output tables are stored in
It is recommended to explicitly specify the CSV formatting options.
Below is the code for basic reading and writing files, it is also available in our
The above example shows how to process the file line-by-line; this is the most memory-efficient way which
allows you to process data files of any size. The expression
lazy_lines = (line.replace('\0', '') for line in in_file) is a
Generator which makes sure that
Null characters are properly handled.
It is also important to use
encoding='utf-8' when reading and writing files.
To test the above code, you can use a sample source table in Storage and the following runtime configuration:
The KBC Python extension package provides functions to:
Additionally, it also defines the KBC CSV dialect
to shorten up the CSV manipulation code.
The library is a standard Python package that is available by default in the production environment.
It is available on Github, so it can be installed
pip install git+git://github.com/keboola/python-docker-application.git.
A generated documentation
is available for the package, actual working example can be found in our
Also note that the library does no special magic, it is just a mean to simplify things a bit for you.
To read the user-supplied configuration parameter ‘myParameter’, use the following code:
The library contains a single class
Config; a parameter of the constructor is the path to the data directory.
The above would read the
myParameter parameter from the user-supplied configuration:
An example of the above approach is available in our repository.
Note that we have also simplified reading and writing of the CSV files using
dialect='kbc' option. The dialect is
registered automatically when the
Config class is initialized.
You can test the code with the following runtime configuration:
And with the following parameters:
In the Quick start tutorial and the above examples, we have shown applications which have names of their input/output tables hard-coded. The following example shows how to read an input and output mapping specified by the end-user, which is accessible in the configuration file. It demonstrates how to read and write tables and table manifests. File manifests are handled the same way. For a full authoritative list of items returned in table list and manifest contents, see the specification
Note that the
destination label in the script refers to the destination from the mappers perspective.
The input mapper takes
source tables from user’s storage, and produces
destination tables that become
the input of your extension. The output tables of your extension are consumed by the output mapper
destination are the resulting tables in Storage.
The above code is located in a sample repository, so you can use it with the runtime settings. Supply any number of input tables.
To test the code, set an arbitrary number of input/output mapping tables. Keep in mind to set the same number of inputs and outputs. The names of the CSV files are arbitrary.
An important part of the application is handling errors. By
the specification, we assume that command return
code: 0 = no error, 1 = user error (shown to the end-user in KBC), > 1 = application error
(the end-user will receive only a generic message). To implement this, you should wrap your
entire script in
In this case, we consider everything derived from
ValueError to be an error which should be shown to the end-user.
Every other error will lead to a generic message and only developers will see the details. You can, of
course, modify this logic to your liking.
Since the organization of your code is completely up to you, nothing prevents you from wrapping the entire application in a class.
Actually, we do recommend this approach since it makes the code more contained and allows for greater testability.
To wrap the code in a class, create a file
Then create a runner
main.py for your application:
sample_application is the name of the file in which the application class is defined and
is the name of the actual class. This also takes care of handling errors and keeps the
main.py runner clean.
You can look at the sample application or
at a more complicated TextSpliter application.
Once you have your script contained, use standard Python testing methods such as unittest. It is important to run tests automatically. Set them up to run every time you push a commit into your repository.
Travis offers an easy way to setup a continuous integration with Github repositories. To setup the integration,
.travis.yml file in the root of your repository and
then link the repository to Travis.
Travis offers Python support, only add the KBC Package, if using it, and set the data directory using
KBC_DATADIR environment variable, which will be automatically picked up by the KBC package:
The above option is easy to set up, but it has two disadvantages:
To fix both, take an advantage of the fact that we will run your application code in a Docker container. By using Docker Compose you can set the testing environment in the exact same manner as the production environment. Have a look at the sample repository described below.
To run your tests in our Docker Container, create a
docker-compose.yml file in the root of your repository:
image option defines what Docker Image is used for running the tests –
quay.io/keboola/docker-custom-python:latest refers to
our image we use to run Custom Science extensions on
our production servers. The
latest part refers to an image tag, which points to the highest stable version.
volumes option defines that the current directory will be mapped to the
/src/ directory inside the container.
test/data/ directory will be mapped to
/data/ directory inside the container (this replicates the production
command option defines the command for running the
python -m unittest discover tests. It will be run inside the Docker container, so you do not need to have python shell available on your machine. The above command will run the tests with unittest. Don’t forget that the
/src/ directory maps to the root directory of your repository (we have defined this in
KBC_DATADIR=/src/test/data/ option sets the
KBC_DATADIR environment variable to the data directory so that it refers to the
in the root of your repository.
To run the tests in the Docker container, install Docker on your machine, and execute the following command line (in the root of your repository):
docker-compose run --rm tests
Docker-compose will process the
docker-compose.yml and execute the
tests service as defined on
its 4th line. This service will take our
docker-custom-python image and map the current directory into a
/src/ directory inside the created container. Then it will execute the
python -m unittest discover script
inside that container. This will check your Python class.
To run the tests in a Docker container automatically, automate them, again, to run on every push to your git repository. Now, you are not limited to
CI services with Python support, but you can use any CI service with Docker support. You can also use Travis, as we will show in the following
docker-compose run --rm tests command is the most important as it actually runs docker-compose, and subsequently
all the tests (it is the same command you can use locally).
All the above configuration is available in the sample repository.