If new to Generic Extractor, learn about pagination in our tutorial first. Use Parameter Map to help you navigate among various configuration options.
Pagination, or paging, describes how an API splits a large list of items into separate pages. Pagination may also be called scrolling or traversing (scrolling through a large result set). Sometimes it is also referred to as setting a cursor (pointing to a current result).
Almost every API has some form of pagination because returning extensive lists of large results is impractical for many reasons, such as memory overflow issues and long transfer and processing times. So, unless you only want to do an ad-hoc query to extract thousands of items at most, setting pagination is important.
When configuring Generic Extractor, there is a slight distinction between pagination and scrolling:
As long as the API uses the same pagination method for all resources, there is no need to distinguish between the two. Setting up pagination for Generic Extractor boils down to two crucial questions:
An example pagination configuration looks like this:
Generic Extractor supports the following paging strategies (scrollers); they are configured
using the method
option:
response.url
— uses a URL provided in the response.offset
— uses the page size (limit) and item offset (like in SQL).pagenum
— uses the page size (limit) and page number.response.param
— uses a specific value (token) provided in the response.cursor
— uses the identifier of the item in response to maintain a scrolling cursor.multiple
— allows to set different scrollers for different API endpoints.If the API responses contain direct links to the next set of results, use the
response.url
method.
This applies to the APIs following the JSON API specification. The response usually
contains a links
section:
If the API response contains a parameter used to obtain the next page, use the
response.param
method.
It is preferred to use an
authoritative value provided by the API than any of the following methods.
This can be some kind of scrolling token or even a page number of the next page, for example:
If the API does not provide a scrolling hint within the response, use one of the
offset
, pagenum
or cursor
methods:
pagenum
method if the API expects the page
number/index. For example, /users?page=2
retrieves the 2nd page regardless of how many items the page contains.offset
method if the API expects the item
number/index. For example, /users?startWith=20
retrieves the 20th and following items.cursor
method if the API expects an item identifier.
For example, /users?startWith=20
retrieves an item with ID 20 and the following items.If the API uses different paging methods for different endpoints, use the
multiple
method together with
any of the above methods.
Generic Extractor stops scrolling
nextPageFlag
condition configuration.forceStop
condition configuration.limitStop
condition configuration.Apart from those, each pagination method may have its own stopping strategy.
The same result condition deals with the situation when there is no clear limit to stop the scrolling. Generic Extractor keeps requesting higher and higher pages from the API. Let’s say that there are 150 pages of results in total. When Generic Extractor asks for page 151, various situations can arise:
pagenum
and
offset
methods will stop, and other methods will probably stop
too (depends on how empty the response is).nextFlag
or
forceStop
has to be used.Less common — API keeps returning the last page, the extraction is stopped when a page is obtained twice (see example [041]). If the API returns the last page and it is the same as the previous page, the extraction is stopped. You will see this in the Generic Extractor logs as the following message:
Job '1234567890' finished when last response matched the previous!
users?offset=6&limit=2
. Then the result
is the same as the previous page, the same check kicks in and the extraction is stopped too. However, the results from the first
page will be duplicated.The above describes automatic behavior of Generic Extractor regarding scrolling stopping. Using Next Page Flag allows you to do a manual setup of the stopping strategy: Generic Extractor analyzes the response, looks for a particular field (the flag) and decides whether to continue scrolling based on the value or presence of that flag.
Next Page Flag is configured using three options:
field
(required) — name of a field containing any value. The field must be in the root of the response.
It will be converted to boolean.stopOn
(required) — value to which the field will be compared to. When the values are equal, the scrolling stops.ifNotSet
— assumed value of the field
in case it is not present in the response. It defaults to the stopOn
value.The boolean conversion has the following rules:
false
, 0
, null
, string "0"
, empty array []
is false
.true
.Example nextPageFlag
setting:
See our Next Page Flag Examples.
Force stop configuration allows you to stop scrolling when some extraction limits are hit. The supported options are:
pages
— maximum number of pages to extracttime
— maximum number of seconds the extraction should runvolume
— maximum number of bytes which can be extractedThis is an example or the forceStop
setting:
The volume of the response is measured as number of bytes in compressed JSON. Therefore the response
is compressed (minified) to:
which makes it 69 bytes long.
The following is a force stop example configuration that will stop scrolling after extracting two pages of results, or after extracting 69 bytes of minified JSON data (whichever comes first).
See example [EX048] and example [EX116] (combining multiple conditions) and example [EX140] (combining with child jobs).
Limit stop configuration allows you to stop scrolling when a specified number of items is extracted. The supported options are:
count
(required, integer) — total number of items to extractfield
(required, string) — path to the key which contains the value with total number of itemsThe two options are mutually exclusive, but one of them is required. In both cases, the total number of items may not be
honored exactly. If the total amount is not divisible by the page size, then the leftover from the last page (if any)
is extracted too (see example EX127 and EX138 (combining with child jobs)).
This is an example or the limitStop
setting:
The above configuration will search the response for the key count
inside the key items
. The obtained value is
expected to be the total number of items to extract. In the sample response below, it will be 4
:
Note that if the field does not exist in the response (e.g., you misspell it in the configuration), paging stops after the first page.
See example [EX126]
(a modified version of EX049.
For count
configuration, see example [EX127]
(a modified version of EX051.
All stopping strategies are evaluated simultaneously and for the scrolling to continue, none of the stopping conditions must be met. In other words, the scrolling continues until any of the stopping conditions is true. To this you need to account specific stopping strategies for each scroller. For example, the scrolling of this configuration:
will stop if any of the following is true:
offset
scroller specific).offset
scroller specific).forceStop
).isLast
field is present in the response and is true (nextPageFlag
).isLast
field is not present in the response.In this section, we want to show you the following examples of the Next Page Flag stopping strategy:
Assume that the API returns a response which contains a hasMore
field. The field is present in
every response and has always the value true
except for the last response where it is false
.
The following pagination configuration can be used to configure the stopping strategy:
It means that the scrolling will continue till the field hasMore
is present in the response and true.
In this case, setting ifNotSet
is not necessary.
See example [EX045] and example [EX139]] (combining with child jobs).
Assume that the API returns a response which contains a hasMore
field. The field is present only in the
last response and has the value "no"
there.
The following pagination configuration can be used to configure the stopping strategy:
The configuration:
means that the scrolling will continue until the field hasMore
is present. This takes advantage of the
boolean conversion which converts the value "no"
to true. If the field hasMore
is not present, it defaults
to false. In this case setting ifNotSet
is mandatory.
See example [EX046].
Assume that the API returns a response which contains an isLast
field. The field is present only in the
last response and has the value true
there.
The following pagination configuration can be used to configure the stopping strategy:
The scrolling will stop when the field isLast
is present in the response and true.
Because the field isLast
is not present at all times, the ifNotSet
configuration is required.
See example [EX047].