Custom Science R

Your R Custom Science Application can be created in multiple ways (as described below). There are no known limitations to the architecture of your R code. We recommend that you use our library. It provides useful functions for working with our environment. Please note:

  • The repository of your R application must always contain the main.R script.
  • All result .csv files must be written with the row.names = FALSE option (otherwise KBC cannot read the file because it contains unnamed column).
  • If set, your application must always produce the tables and files listed in the output mapping (even if the files were empty).

Packages

To install a package, use install.packages('packageName'). It is not necessary to specify the repository. If you wish to install a package from source, use devtools::install_github() (and friends). The R version is the same as for R transformations.

Here is our current list of pre-installed packages. You can load them with library() command. If you know of another useful standard package to pre-install, we would like to hear about it.

Using the KBC Package

The KBC R extension package provides functions to:

  • Read and parse the configuration file and parameters - configData property and getParameters() method.
  • List input files and tables - getInputFiles(), getInputTables() methods.
  • Work with manifests containing table and file metadata - getTableManifest(), getFileManifest(), writeTableManifest(), writeFileManifest() methods.
  • List expected outputs - getExpectedOutputFiles() and getExpectedOutputTables() methods.

The library is a standard R package that is available by default in the production environment. It is available on Github, so it can be installed locally with devtools::install_github('keboola/r-docker-application', ref = 'master').

To use the library to read the user-supplied configuration parameter ‘myParameter’:

library(keboola.r.docker.application)
# initialize application
app <- keboola.r.docker.application::DockerApplication$new('/data/')
app$readConfig()

# access the supplied value of 'myParameter'
app$getParameters()$myParameter

The library contains a single RC class DockerApplication; a parameter of the constructor is the path to the data directory. Call readConfig() to actually read and parse the configuration file. The above would read the myParameter parameter from the user-supplied configuration:

{
    "myParameter": "myValue"
}

You can obtain inline help and the list of library functions by running the ?DockerApplication command.

Dynamic Input/Output Mapping

In the Quick start tutorial, we have shown applications which have names of their input/output tables hard-coded. This example shows how to read the input and output mapping specified by the end-user, which is accessible in the configuration file. It demonstrates how to read and write tables and table manifests. File manifests are handled the same way. For a full authoritative list of items returned in table list and manifest contents, see the specification

Note that the destination label in the script refers to the destination from the mappers perspective. The input mapper takes source tables from user’s storage, and produces destination tables that become the input of the extension. The output tables of the extension are consumed by the output mapper whose destination are the resulting tables in Storage.

# initialize application
app <- DockerApplication$new('/data/')
app$readConfig()

# get list of input tables
tables <- app$getInputTables()
for (i in 1:nrow(tables)) {
    # get csv file name
    name <- tables[i, 'destination']

    # get csv full path and read table data
    data <- read.csv(tables[i, 'full_path'])

    # read table metadata
    manifest <- app$getTableManifest(name)
    if ((length(manifest$primary_key) == 0) && (nrow(data) > 0)) {
        # no primary key present, create one
        data[['primary_key']] <- seq(1, nrow(data))
    } else {
        data[['primary_key']] <- NULL
    }


    # do something clever
    names(data) <- paste0('batman_', names(data))

    # get csv file name with full path from output mapping
    outName <- app$getExpectedOutputTables()[i, 'full_path']
    # get file name from output mapping
    outDestination <- app$getExpectedOutputTables()[i, 'destination']

    # write output data
    write.csv(data, file = outName, row.names = FALSE)

    # write table metadata - set new primary key
    app$writeTableManifest(outName, destination = outDestination, primaryKey = c('batman_primary_key'))
}

The above code is located in a sample repository, so you can use it with the runtime settings. Supply any number of input tables.

  • Repository: https://github.com/keboola/docs-custom-science-example-dynamic.git
  • Version: 0.0.1

To test the code, set an arbitrary number of input/output mapping tables. Keep in mind to set the same number of inputs and outputs. The names of the CSV files are arbitrary.

Dynamic mapping screenshot

KBC Package Integration Options

Simple Example

In the simplest case, you can use the code from an R transformation to create a simple R script. It must be named main.R. To see a sample R script, go to our repository. Despite the fact that this approach is the simplest and quickest to do, it offers limited options for testing and is generally good only for one-liners (i.e. you have an existing library which does all the work, all you need to do is execute it). In the example below, we supply value /data/ to the constructor as the data directory, as that will be always true in our production environment.

library('keboola.r.docker.application')

# initialize application
app <- DockerApplication$new('/data/')
app$readConfig()

# read input
data <- read.csv("/data/in/tables/source.csv");

# do something
data['double_number'] <- data['number'] * app$getParameters()$multiplier

# write output
write.csv(data, file = "/data/out/tables/result.csv", row.names = FALSE)

Package Example

This example shows how an R package can be made in order to interact with our environment, the code is available in a git repository. We strongly recommend this approach over the previous simple example.

Wrapping the application logic into an R package makes testing and portability much easier, specifically:

Code

The application EntryPoint is main.R in the package root folder.

devtools::load_all('/home/')
library(keboola.r.custom.application)
doSomething(Sys.getenv("KBC_DATADIR"))

This installs the package from the /home/ directory. It includes the package defined in the DESCRIPTION file and calls the doSomething() function. The package name is arbitrary, but it must match the one defined in the DESCRIPTION file. The availability of the doSomething() function is determined by the contents of the NAMESPACE file. The NAMESPACE file is generated automatically by Roxygen when you Check the package in RStudio.

With this approach, you can organize your code and name your functions as you please. In the sample repository, the actual code is contained in the doSomething() function in the R/myPackage.R file. The code itself is identical to the previous example.

Test the sample code with this runtime setting:

  • Repository: https://github.com/keboola/docs-custom-science-example-r-package.git
  • Version: 0.0.5

Tests

Tests are organized in the /tests/ directory which contains:

  • Subdirectory data/ which contains pregenerated sample data from sandbox.
  • Optional config.R file which can be used to set environment for running the tests; it can be created by copying config_template.R
  • Subdirectory test_that/ which contains the actual testthat tests

You can run the tests locally from RStudio,

RStudio tests

or you can set them to run automatically using the Travis continuous integration server every time you push into your git repository. For that use the provided travis.yml file. See below for more information about continuous integration.

For a more thorough tutorial on developing R packages, see the R packages book.

Subclass Example

This example defines a subclass of the DockerApplication RC class from the KBC R package. RC classes are a type of classes in R. This approach is fully comparable with the previous package example. There are no major differences or (dis)advantages. The repository, again, has to have the file main.R in its root. The difference is that we create the RS class CustomApplicationExample and call its run() method.

devtools::load_all('/home/')
library(keboola.r.custom.application.subclass)
app <- CustomApplicationExample$new(Sys.getenv("KBC_DATADIR"))
app$run()

The name of the class CustomApplicationExample is completely arbitrary and is defined in `R/myApp.R’. The application code itself is formally different as all the methods are in the class. So, instead of

app <- DockerApplication$new(datadir)
app$readConfig()
data['double_number'] <- data['number'] * app$getParameters()$multiplier

use

readConfig()
data['double_number'] <- data['number'] * getParameters()$multiplier

within the body of CustomApplicationExample’s run method.

Test the sample code with this runtime setting:

  • Repository: https://github.com/keboola/docs-custom-science-example-r-subclass.git
  • Version: 0.0.4

Continuous Integration and Testing

When using the Package or Subclass approach, you can use standard R testing methods. We like the testthat package. Since it is important to run tests automatically, set them up to run every time you push a commit into your repository.

Integration with Travis

Travis offers an easy way of setting up a continuous integration with Github repositories. To setup the integration, create a .travis.yml file in the root of your repository, and then link the repository to Travis. Travis offers R support. Only add the KBC Package, if using it, and set the data directory using the KBC_DATADIR environment variable, which will be automatically picked up by the KBC package:

language: r

sudo: required

# Be strict when checking our package
warnings_are_errors: true

# Install KBC Package
r_github_packages:
 - keboola/r-application
 - keboola/r-docker-application

# Set the data directory
before_install:
 - export KBC_DATADIR=$TRAVIS_BUILD_DIR/tests/data/

Integration Using Docker Compose

The above option is easy to set up, but it has two disadvantages:

  • it is specific to Travis CI (Continuous Integration) service, and
  • it does not run your application in the exact same environment as the production code.

To fix both, take an advantage of the fact that we will run your application code in a Docker container. By using Docker Compose, you can set the testing environment in the exact same way as the production environment. Take a look at the sample repository described below.

Configuration

To run your tests in our Docker Container, you need to create a docker-compose.yml file in the root of your repository:

version: "2"

services:
  tests:
    image: quay.io/keboola/docker-custom-r:1.0.4
    tty: true
    stdin_open: true
    volumes:
      - ./:/src/
    command: /bin/sh /src/tests.sh

The image option defines what Docker Image is used for running the tests – quay.io/keboola/docker-custom-r:1.0.4 refers to our image we use to run Custom Science extensions on our production servers. The 1.0.4 part refers to an image tag, which changes from time to time. You should generally use the highest version. The volumes option defines that the current directory will be mapped to /src/ directory inside the image. The command option defines the command for running the tests /bin/sh /src/tests.sh. It will be run inside the Docker image, so you don’t need to have shell available on your machine. This leads us to the tests.sh file, which should be also created in the root of your repository:

#!/bin/sh

R CMD build /src/
R CMD check --as-cran /src/

The above simple Shell script will first try to build your package using R CMD build and then check it, meaning running the tests, with R CMD check. This assumes you are using the package approach. If using other approaches, modify these commands to run your tests. Don’t forget that the /src/ directory maps to the root directory of your repository (we have defined this in docker-compose.yml).

Running

To run the tests in the Docker container, have Docker installed on your machine, and execute the following command line (in the root of your repository):

docker-compose run --rm -e KBC_DATADIR=/src/tests/data/ tests

Docker-compose will process the docker-compose.yml and execute the tests service as defined on its 4th line. This service will take our Docker docker-custom-r image and map the current directory into a /src/ directory inside the image. Then it will execute the shell script /src/tests.sh inside that image. Where /src/tests.sh refers to the tests.sh script in the root of your repository. This will build and check the R package. The option -e KBC_DATADIR=/src/tests/data/ sets environment variable KBC_DATADIR to the data directory, so that it refers to the tests/data/ directory in the root of your repository.

Running on Travis

To run the tests in a Docker container automatically, automate them, again, to run on every push to your git repository. Now, you are not limited to the CI services with R support, but you can use any CI service with Docker support. You can also use Travis, as we will show you in the following .travis.yml configuration:

sudo: required

services:
  - docker

before_script:
  - docker -v
  - docker-compose -v
  - docker-compose build tests

script
  - docker-compose run --rm -e KBC_DATADIR=/src/tests/data/ tests

Most of the configuration is related to setting up Docker, the only important part is two last lines. docker-compose build tests will build the Docker image, which will be skipped in case you are not using your own Dockerfile. The docker-compose run --rm -e KBC_DATADIR=/src/tests/data/ tests command is the most important as it actually runs docker-compose and, subsequently, all the tests. It is the same command you can use locally.

Also, create an .Rbuildignore file to avoid receiving warnings for unrecognized files in the root of your package repository:

^.*\.Rproj$
^\.Rproj\.user$
^main\.R$
^\.travis\.yml$
^docker-compose\.yml$
^tests\.sh$
^\.git$
^\.gitignore$

All the above configuration is available in the sample repository.