Most operations, such as extracting data or running an application are executed in Keboola as background, asynchronous jobs. When an operation is triggered, for example, you run an extractor, a job is created. The job starts executing or waits in the queue until it can start executing. The job execution and queuing are fully automatic. The job execution is asynchronous, so you need to
The core API for working with jobs is the Queue API. It provides operations for running/creating, terminating and listing jobs. Components differ in their upper limits on how long a job can be executing and how much memory it is allowed to consume. These limits are set by the component developer and act primarily as a safeguard.
When you create a job it automatically transitions through states until it reaches some of the final states. When you create or retrieve a job, you’ll obtain a JSON with Job object, whose properties are described below in more detail.
A job can have different values for status
:
created
(the job is created, but has not started executing yet)waiting
(the job is waiting for other jobs to finish)processing
(job stuff is being done)success
(the job is finished)error
(the job is finished)warning
(the job is finished, but one of its child jobs failed)terminating
(the user has requested to abort the job)cancelled
(the job was created, but it was aborted before its execution actually began)terminated
(the job was created and it was aborted in the middle of its execution)When you create a job it is in the created
state. In a success scenario it will transition to a processing
state and when the actual work is done, to the
success
state. If you change your mind and terminate a job, it will enter terminating
state and then ends with either terminated
(execution terminated)
or cancelled
(execution did not actually start). The difference is that you can be sure that a cancelled
job did absolutely no operations,
whereas a terminated job, could’ve done even all of the work it was supposed to do.
If a job cannot be executed, it will enter the waiting
state. The waiting state means that the job cannot be executed due to reasons on the Keboola
project side. This means that the reasons for waiting jobs lie solely in what jobs are already running in the given project. There are three core reasons for waiting jobs:
waiting
state.processing
state and 8 of them will immediately enter the waiting
state.If a job cannot be run due to platform reasons (e.g. insufficient resources, platform outage), it will remain in the created
state. In rare situations
(e.g. hardware failure), the job may return back to created state. Moving the job out of the created
state is out of the control of the end-user.
Of all the states a job can be in, only the state processing
is considered to be job runtime (see durationSeconds
field) and therefore billable.
That means waiting
or created
jobs do not have any costs associated with them, they represent a plan of what is going to happen.
The states terminated
, cancelled
, success
and error
are final and end the job transitions. When a job is in final state, the isFinished
flag is true.
The job object is both immutable and eventually consistent. Once you create a job, you cannot change any of it’s properties. Any changing properties are
self-modifying and they will stop modifying once the job reaches one of the final states.
Apart from the status
field, the job also has desiredStatus
field. This is either processing
or terminating
. The desired status is
processing until a job termination is requested. This changes the desired status to terminating
. Other changes are not permitted.
When a job is created, an id
and runId
and optionally parentRunId
are assigned to it. The runId
and parentRunId
represent
parent-child relationship between jobs. Parent-child hierarchical relationship can be defined via the X-KBC-RunId
header, when used the
newly created job will become child of the job with the provided RunId
.
The runId
field contains job id
s with representing the job hierarchy. If there is no hierarchy, then runId
is equal to id
. If there is
hierarchy then runId
is parentId
concatenated with id
. The hierarchy delimiter is dot .
. Examples:
id=123
, runId=123
, parentRunId=null
– Job has no parentid=345
, runId=123.345
, parentRunId=123
– Job is a child of job 123
.id=678
, runId=123.345.678
, parentRunId=123.345
– Job is a child of job 345
which in turn is a child of job 123
.Jobs may be nested without limits. The parent-child relationship itself is a weak relationship. By itself it does not mean anything special outside of UI grouping and the function that terminating a parent job issues a termination request to all its children. Running a job as a child of another job does not by itself cause the parent to wait for child completion or any other added functionality. Such functionality is implemented in specific components (e.g. Orchestrator) or for specific job types.
To create a job, you must provide the configuration to run. A configuration is always tied to a specific component.
A configuration can be provided in multiple ways. The easiest is to provide a reference to
a stored configuration ID using the config
field as shown above. Configurations can be stored and listed using the
Component Configurations API endpoint.
When using a configuration which contains Configuration Rows, the job can optionally execute
only certain rows. Use the configRowIds
field to list row IDs to execute. Note that if you do not list any rows, then all rows will be executed except
for disabled rows. When you enumerate rows to execute, then the enumerated rows will be executed even if they are disabled. To run a job of
a configuration in a branch, provide the branch ID
in the branchId
field. If you do not provide branchId
, then the default branch is used.
Take care that only the combination of component ID, configuration ID and branch ID is unique. It is possible for two configurations with the
same ID to exist (either for different component or for a different branch).
Another option is to provide the entire configuration in the configData
field. In that case the whole configuration data
has to be provided in the request. If you are retrieving a
stored configuration, take
note that the configuration data is the contents of the configuration
node and not the entire
response. When using the configData
field, the configRowIds
and branchId
values are ignored. When using the configData
field the config
field
is ignored for the purpose of reading the configuration, but may still be required in case the component is using
Default Bucket. In that case, the
configuration referenced in config
is used to generate the name of the output bucket. It still holds that configuration data is not read from it.
That means that configData
always fully overrides the config
field.
When creating a job, you need to provide mode
. This can be one of run
, forceRun
and debug
. The basic mode
choice is run
.
Use the forceRun
mode to run a configuration that is disabled. The debug
can be used during Component Development & Debugging.
You may provide runtime settings for a job. Runtime settings do not affect what the job does, it affects how the job does it. Current
available runtime settings are backend
and parallelism
. The backend
parameter defines what Snowflake warehouse is used for the job and currently
affects mostly Snowflake transformations. Available values for backend size come from Workspace Create API call and are small
, medium
, large
.
The parallelism
allows to run Configuration Rows (if present in the configuration) in
parallel. Allowed values are integer values and infinity
which runs all rows in parallel. When parallelism is not specified, the rows are run sequentially.
You may also specify tag
if you want to run the component with a specific version of code. This is mostly used during component development, testing
and debugging.
Runtime parameters can be specified on various levels. The values can be specified in the component configuration. They can also be specified when creating a job, in which case it overrides the configuration. It may also be specified for an orchestration, in which case it overrides what is specified in individual jobs of that orchestration.
Job can be of one of the four types standard
, container
, phaseContainer
and orchestrationContainer
. The standard
is something which does actual work.
Only standard jobs consume billable time and are counted towards consumption of any resources. Other job types are virtual containers encapsulating standard jobs.
The container
job represent a job containing parallel executions
of configuration rows. phaseContainer
type contains standard jobs in a single
phase of an orchestration. orchestrationContainer
job type represents an orchestration and
contains phase jobs of that orchestration. What these job types have in common is a strong
parent-child relationship. This means for example that when a child job fails, the container fails too. The
behavior can be further controlled by the onError
setting. You cannot specify job type when creating a job, it is selected automatically as needed.
The main API to run the jobs is Job Queue API. There are some API calls from other services which might be useful when working with jobs:
The component jobs are asynchronous operations, this means that you create it and then you have to actively wait for the result. Note that there are other unrelated cases of asynchronous operations in Keboola Platform which are in principle the same, but may differ in little details. The most common one is: Storage Jobs, triggered, for instance, by asynchronous imports or exports
You need to know the component Id and configuration Id to create a job. You can get these from the UI links. To use the API to obtain a list of all components available in the project, and their configuration, you can use the Get components. See an example. A snippet of the response is below:
From there, the important part is the id
field and configurations.id
field. For instance, in the
above, there is a database extractor with the id
keboola.ex-db-snowflake
and a
configuration with the id 554424643
.
Then use the create a job API call and pass the configuration ID and component ID in request body:
{
"component": "keboola.ex-db-snowflake",
"config": "554424643",
"mode": "run"
}
See an example. When a job is created, you will obtain a response similar to this:
{
"id": "807932655",
"runId": "807932655",
"parentRunId": "",
"project": {
"id": "7150",
"name": "Sandbox"
},
"token": {
"id": "199182",
"description": "ondrej.popelka@keboola.com"
},
"status": "created",
"desiredStatus": "processing",
"mode": "run",
"component": "keboola.ex-db-snowflake",
"config": "554424643",
"configData": [],
"configRowIds": [],
"tag": "5.5.0",
"createdTime": "2022-01-25T16:34:40+00:00",
"startTime": null,
"endTime": null,
"durationSeconds": 0,
"result": [],
"usageData": [],
"isFinished": false,
"url": "https://queue.keboola.com/jobs/807932655",
"branchId": null,
"variableValuesId": null,
"variableValuesData": {
"values": []
},
"backend": [],
"metrics": [],
"behavior": {
"onError": null
},
"parallelism": null,
"type": "standard"
}
This means that the job was created
and will automatically start executing.
From the above response, the most important part is url
, which gives you the URL of the resource for
Job status polling.
If you want to get the actual job result, poll the Job API for the current state of the job. See an example.
You will receive a response in the same format as when you crated the job:
{
"id": "807933826",
"runId": "807933826",
"parentRunId": "",
"project": {
"id": "7150",
"name": "7150"
},
"token": {
"id": "199182",
"description": "ondrej.popelka@keboola.com"
},
"status": "processing",
"desiredStatus": "processing",
"mode": "run",
"component": "keboola.ex-db-snowflake",
"config": "554424643",
"configData": [],
"configRowIds": [],
"tag": "5.5.0",
"createdTime": "2022-01-25T16:41:12+00:00",
"startTime": "2022-01-25T16:41:22+00:00",
"endTime": null,
"durationSeconds": 0,
"result": [],
"usageData": [],
"isFinished": false,
"url": "https://queue.keboola.com/jobs/807933826",
"branchId": null,
"variableValuesId": null,
"variableValuesData": {
"values": []
},
"backend": [],
"metrics": [],
"behavior": {
"onError": null
},
"parallelism": null,
"type": "standard"
}
From the above response, the most important part is the status
field (processing
, in this case).
To obtain the Job result, periodically send the above API call until the job status changes
to one of the finished states or until isFinished
is true.
To run a debug job, use debug
for the mode. Optionally you can provide the component version which should run
to live test an image.
{
"component": "keboola.ex-db-snowflake",
"config": "554424643",
"mode": "debug",
"tag": "5.5.0"
}
The debug mode creates a job that prepares the data folder including the serialized configuration files. Then it compresses the data folder and uploads it to your project’s Files in Storage. This way you will get a snapshot of what the data folder looked like before the component started. If processors are used, a snapshot of the data folder is created before each processor. After the entire component finishes, another snapshot is made. For example, if you run component A with processor B and C in the after section, you will receive:
stage_0
file with contents of the data folder before component A was runstage_1
file with contents of the data folder before processor B was runstage_2
file with contents of the data folder before processor C was runstage_output
file with contents of the data folder before output mapping was about to be performed (after C finished).If configuration rows are used, then the above is repeated for each configuration row. If the job finishes with and error, only the stages before the error are uploaded.
This API call does not upload any tables or files to Storage. I.e. when the component finishes, its output is discarded and the output mapping to storage is not performed. This makes this API call generally very safe to call, because it cannot break the Keboola project in any way. However, keep in mind, that if the component has any outside side effects, these will get executed. This applies typically to writers which will write the data into the external system even with this debug API call.
Note that the snapshot archive will contain all files in the data folder including any temporary files produced be the component. The snapshot will not contain the output state.json file. This is because the snapshot is made before a component is run where the out state of the previous component is not available any more. Also note that all encrypted values are removed from the configuration file and there is no way to retrieve them. It is also advisable to run this command with limited input mapping so that you don’t end up with gigabyte size archives.