Here we document the differences between this new API and CKAN version
By default all parameters and results from the new API are JSON, this differs from CKAN in which the get call parameters are non standard but have an ad-hoc syntax.
The new API is entirely based on a GraphQL DB querying system, this changes the ways in which we can make queries.
CKAN targets a Postgresql DB, the difference with a GraphQL DB is that the queries that can be implemented are quite different.
GraphQL queries need to be explicit, including even the fields ("columns" in a traditional DB), which means knowing the database and "table" schema before making the query to be able to ask for the needed fields. This means that for every query that requests for "all" the fields there are two needed queries behind the scenes. An optimization would be to cache the schema.
This section discusses each CKAN parameter and its implementation (or not) in the new API;
resource_id (string) – id or alias of the resource to be searched against. Mandatory parameter, implemented
filters (dictionary) – matching conditions to select, e.g {“key1”: “a”, “key2”: “b”} (optional). Optional parameter, use q query instead in the New API.
q (string or dictionary) – full text query. If it’s a string, it’ll search on all fields on each row. If it’s a dictionary as {“key1”: “a”, “key2”: “b”}, it’ll search on each specific field (optional)
This field is different in the new API, the main difference is that the new API only receives JSON. Current New API implementationis equivalent to filters.
distinct (bool) – return only distinct rows (optional, default: false)
This parameter is vastly different from the previous CKAN implementation. In GraphQL there is no notion of row (due to the simple fact that graphs do not have rows), this means that the option distinct does not mean the same for the new API and that we can also implement a new idea.
Due to the differences and to make it evident that the API is not the same the new implementation is called distinct_on and needs a list of fields to test for differences, so the new implementation can check for differences for each field.
The New API does also implement (for backwards compatibility) a boolean value where it will query the graph schema and ask for different in every field of the schema, which would be equivalent to the CKAN distinct implementation.
plain (bool) – treat as plain text query (optional, default: true) language (string) – language of the full text query (optional, default: english)
Full text search was discussed to not be implemented in the github issue
limit (int) – maximum number of rows to return (optional, default: 100, unless set in the site’s configuration ckan.datastore.search.rows_default, upper limit: 32000 unless set in site’s configuration ckan.datastore.search.rows_max)
There is no difference in the implementation of this parameter
offset (int) – offset this number of rows (optional)
This parameter that implies pagination to the response has not been completely implemented in the new API. The reason for this is that in GraphQL there are 2 different ways of implementing pagination and this needs to be analyzed more in depth.
fields (list or comma separated string) – fields to return (optional, default: all fields in original order)
List of fields to return to the caller. The only difference is that in the New API the input is a list (JSON)
sort (string) – comma separated field names with ordering e.g.: “fieldname1, fieldname2 desc”
Not included yet in the New API, and the only difference with the new api is that it will be implemented with a JSON input instead of a comma separated field names. The input parameter will have the following format:
{ fieldname1: order, fieldname2: order} where order = [asc|desc]
include_total (bool) – True to return total matching record count (optional, default: true)
Not Implemented in the New API, this is CPU intensive and needs an extra query or process the result to count the number of resulting elements.
total_estimation_threshold (int or None) – If “include_total” is True and “total_estimation_threshold” is not None and the estimated total (matching record count) is above the “total_estimation_threshold” then this datastore_search will return an estimate of the total, rather than a precise one. This is often good enough, and saves computationally expensive row counting for larger results (e.g. >100000 rows). The estimated total comes from the PostgreSQL table statistics, generated when Express Loader or DataPusher finishes a load, or by autovacuum. NB Currently estimation can’t be done if the user specifies ‘filters’ or ‘distinct’ options. (optional, default: None)
This is not feasible with the current graphql DB except for the global statistics on the schema. To be able to estimate the count we would need some other aggregation statistics on the different kind of queries.
records_format (controlled list) – the format for the records return value: ‘objects’ (default) list of {fieldname1: value1, …} dicts, ‘lists’ list of [value1, value2, …] lists, ‘csv’ string containing comma-separated values with no header, ‘tsv’ string containing tab-separated values with no header Setting the plain flag to false enables the entire PostgreSQL full text search query language.
The result format in the New API is JSON, in the future and if needed another return format can be implemented
The result returned to the caller is a JSON containing the following:
{
schema: {JSON schema definition},
data: [JSON list of elements]
}
The following list shows the CKAN return elements and the implementation in the New API
-
fields (list of dictionaries) – fields/columns and their extra metadata
- This is not returned as in a JSON format it is not needed (it is already present in the return ) and the extra metadata is present in the schema field
-
offset (int) – query offset value
- Not currently implemented as a return value
-
limit (int) – queried limit value (if the requested limit was above the ckan.datastore.search.rows_max value then this response limit will be set to the value of ckan.datastore.search.rows_max)
- Not currently implemented as a return value
-
filters (list of dictionaries) – query filters
- Not currently implemented as a return value
-
total (int) – number of total matching records
- Not currently implemented as a return value. Needs more analysis and development to be implemented.
-
total_was_estimated (bool) – whether or not the total was estimated
- Will not implement
-
records (depends on records_format value passed) – list of matching results
- This one now is named data
Depending on the response size the data transfer can have ill network effects or overloading the server connections. This is why asynchronous and streaming response data will be needed.
The fact that the current implementation responds in JSON by default adds a data overload that is simply not there in formats like CSV which do not repeat the field names for every response element.
Another option for big size response is to return a SFTP URL (or other secure file transfer protocol) instead. This URL will contain a file with the response data once the operation is complete. This way the connection can be freed and the client can poll for the file.
This solution has two extra advantages:
- the file acts like a cache and can be used in intermediate computations
- if there is any problem during the file transfer the result download can be restarted without recomputing the response and partial downloads are supported in most file transfer protocols.
Security needs to be implemented, one option is JWT which allows for signed requests in the GET query.
JWT has the advantage of already being compatible with the current JSON parameter implementation in the New API.