Here are some good practices in developing component code. They’re best to be followed across all components, especially if you want your component to be published. We also recommend that you check our component templates.
Developing a component is a challenging task. To maximize your efficiency, follow these basic rules:
Before you create any complex components, be sure to read about configurations and processors as they can substantially simplify your component code. We also recommend that you use our common interface library, which is available for Python, R, and PHP.
You may use any Docker image you see fit. We recommend to base your images on those from an official repository because they are the most stable ones.
We publicly provide images for transformations and sandboxes. They both share the same common ancestor image with a couple of pre-installed packages (that saves a lot of time when building the image yourself). This means that the images for R and Python share the same common code base and always use the exact same version of R and Python respectively.
All of the repositories use Semantic versioning tags. These are always fixed to a specific image build.
latest tag is available and always points to the latest tagged build. That means that the
can be used safely (though it refers to different versions over time).
KBC components can be used to process substantial amounts of data (i.e., dozens of gigabytes), which are not going to fit into memory. Every component should therefore be written so that it processes data in chunks of a limited size (typically rows of a table). Many of the KBC components run with less than 100MB memory. While the KBC platform is capable of running jobs with ~8GB of memory without problems, we are not particularly happy to allow it, and we certainly do not want to allow components where the amount of used memory depends on the size of the processed data.
Depending on the component exit code, the component job is marked as successful or failed.
exit code = 0 The job is considered successful.
exit code = 1 The job fails with a user error.
exit code > 1 The job fails with an application error.
During a component execution, all the output sent to STDOUT is captured and sent live to job events. The output to STDERR is captured too, and in case the job is successful or fails with a user error, it is displayed as the last event of the job. In case the job ends with an application error, the entire contents of STDERR is hidden from the end user and sent only to vendor internal logs. The end user will see only a canned response (‘An application error occurred’) with the option to contact our support.
This means that you do not have to worry about the internals of your component leaking to the end user provided that the component exit code is correct. On the other hand, the user error is supposed to be solvable by the end user. When creating an error message, stick by the following rules:
Also keep in mind that the output of the components (job events) serve to pass only informational and error messages; no data can be passed through. The event message size is limited (about 64KB). If the limit is exceeded, the message will be trimmed. If the component produces obscene amount (dozens of MBs) of output in a very short time, it may be terminated with an internal error. Also make sure your component does not use any output buffering, otherwise all events will be cached after the application finishes.
Processors allow the end user to customize the input to the component and the output from it. That means that many custom requirements can be solved by processors, keeping the component code general.
Choosing whether to implement a specific feature as a processor or as part of your component may be difficult. A processor might be a good solution if the following are true:
The first condition is especially important. Another way to read it is that a processor must never supply a function expected from the component. In other words: Each component should be able to consume/generate a valid input/output without any processors. For example, if an extractor can produce tables without any further processing, good, let it be tables, but if can not, it should output only files and processors should do the rest. If processors are used together with configuration rows, the last condition is weakened, because a different set of processors may be applied to each configuration row.
Implementing a processor is in principle the same as implementing any other component. However, processors are designed to be single responsibility components. This means, for example, that processors should require no or very little configuration, should not communicate over a network and should be fast.
Processors take data from the
in data folders and
store it in the
out data folders as any other components. Keep in mind, however,
that any files not copied to the
out folders will be ignored (i.e., lost). That means if a processor is supposed to
“not touch” something, it actually has to copy that something to the
The processors should be aware of manifest files. This means that the processor:
in folder, modify and store it in the
out folder). Typical example is modification of table columns which must be reflected in the manifest.
Keep in mind that processors can be chained; you can, for example, rely on
If the above conditions are not met, then another processor should be added before yours. I.e. you should keep the processor simple and delegate the assumptions to other processors (and document them). If possible the processor should also assume that the CSV files are headless and stored in arbitrary sub-folders. When implemented with this assumption the processor will support sliced tables.
The process of processor registration is the same as publishing any other component. However, many of the fields do not apply, because processors have no UI. The following fields are important: