Add `nango.paginate` helper for sync/action scripts #1103

motnyk · 2023-10-13T08:52:54Z

Closes #856

Problem

A common task developers face in these scripts is retrieving extensive lists of records from external APIs. These records often reside behind paginated endpoints. Given the variability in how different APIs handle pagination (some use cursors, others links, and so on), developers find themselves re-writing detailed pagination logic for each individual script.
This is error-prone, time-consuming, and raises an entry bar for anyone who wants to write a script.

Solution

nango.paginate helper that will do everything under the hood and just give a user batches of records to work with.

Examples of migrated integration templates

https://github.com/NangoHQ/nango/compare/migrate-templates?expand=1

Interface

nango.paginate is just another proxy-like method that accepts ProxyConfiguration as input handles the pagination params, and passes the request on to nango.proxy. At this point, I implemented the support of 3 pagination types:

Cursor-based.
Offset-based.
Link-based. This one includes link to the next page in a form of a full URL(https://example.com/issues?cursor=123) or path(/issues?cursor=123). The link can be fetched from body or link header.

There are two sides of the interface here - providers.yaml and action/sync script. Configs in provider.yaml should cover 95% of use-cases and for the rest, the user can pass overrides from scripts(partial overrides allowed so no need to pass all the values every time). Supported configs for both of those interfaces:
Common

type - pagination type that is used to define the pagination algorithm under the hood.
limit - batch size for each request. By default, we rely on the provider default value to avoid setting arbitrary limits on provider level in providers.yaml. Override can still be passed from the script.
limit_parameter_name- different providers have different names for the limit(e.g. limit, per_page, etc.). This parameter defines the name of the limit param.
response_data_path - sometimes batch APIs return an object with batch of records + some meta info like next pagination cursor. In this cases, user can use response_data_path to tell us how the batch should be extracted from the response body.

{
  "results": [...],
  "next_cursor": '123'
}

Cursor

next_cursor_parameter_path - same as response_data_path but for finding next cursor value.
cursor_parameter_name - the idea is the same as limit_parameter_name. Different providers call this differently(e.g. offset, cursor, etc.)

Link

link_rel - if this parameter is passed, we will extract the next URL from link header trying to extract value with specific relationship(in most cases it's next). See GH pagination docs for example.
link_body_parameter_path - if this parameter is passed, we will extract the next URL body.

Offset

offset_parameter_name - same as cursor_parameter_name but for offset value.

Script interface example

    const endpoint: string = `/repos/${owner}/${repo}/git/trees/${branch}`;
    const proxyConfig = {
        endpoint,
        params: { recursive: '1' },
        paginate: { response_data_path: 'tree' } // overrides
    };

    await nango.log(`Fetching files from endpoint ${endpoint}.`);

    for await (const fileBatch of nango.paginate(proxyConfig)) {
        const blobFiles = fileBatch.filter((item: any) => item.type === 'blob');
        count += blobFiles.length;
        await nango.batchSave(blobFiles.map(mapToFile), Models.GithubRepoFile);
    }

WIP

Resolve minor TODOs.
Figure out if there is a need to return the full response body from paginate in some cases. For example, some APIs return 200 OK + error response in body. Would it be useful to have access to those? How critical is this?

We are going to rely on API defaults and only allow passing limit as an override

motnyk · 2023-10-13T09:08:11Z

I'm working on the build fix.

bastienbeurier · 2023-10-13T10:01:20Z

packages/shared/lib/models/Provider.ts

@@ -27,6 +28,7 @@ export interface Template {
            after?: string;
        };
        decompress?: boolean;
+        paginate?: NextUrlPagination | CursorPagination | OffsetPagination;


I would make the Next implicit to get to UrlPagination.

Just FYI, NextUrlPagination is an old name that I didn't change before pushing a PR.
I decided to update the type from next_url to link last minute to express the fact that it isn't only the URL that is supported by this pagination method but also path links. I haven't updated this class name yet. It's gonna be LinkPagination soon.

Essentially, NextUrlPagination currently handles two types of links:

Full URL https://example.com/issues?offset=1231312

Path /issues?offset=1231312

I feel like LinkPagination is more or less universal term that can be used in both cases IMO. Another option I had in mind is LocationPagination but it's less common and obvious.

How do you feel about UrlPagination vs LinkPagination?

I agree LinkPagination work well!

bastienbeurier · 2023-10-13T10:02:06Z

packages/shared/lib/sdk/sync.ts

+interface Pagination {
+    type: string;
+    limit?: number;
+    response_data_path?: string;


Would make response_path, IMO data doesn't make it more specific.

Or records_path_in_response or data_path_in_response. Not sure what I like best.

bastienbeurier · 2023-10-13T10:04:55Z

packages/shared/lib/sdk/sync.ts

+    limit_parameter_name: string;
+}
+
+export interface CursorPagination extends Pagination {


The tricky thing to understand here IMO is that one is for the response, the other for the request.

Just a suggestion, but we could have:

cursor_path_in_response

cursor_name_in_request

I would also imply the next_ here.

In any case, I would brainstorm about the best names here because they will likely stay a while (unless we break the interface).

In my current scheme, I use path for something that should be fetched from the response and name for something that is set on the request later on. I agree that in_request/in_response are a bit more explicit about the way these parameters are going to be used so I like this approach.

I would also imply the next_ here.

👍

@khaliqgant curious to have your opinion on naming!

bastienbeurier · 2023-10-13T10:06:28Z

packages/shared/lib/sdk/sync.ts

+    cursor_parameter_name: string;
+}
+
+export interface NextUrlPagination extends Pagination {


Here I would suggest (but please provide feedback), since we call the NextUrlPagination or UrlPagination:

link_path_in_response

link_name_in_request

Both of these params define the path in response.
If link_rel is passed, we will extract the next URL from link header and then we will extract value with specific relationship(in most cases it's next). See GH pagination docs for example.
if link_body_parameter_path is passed, we will extract next url from body. This is similar to cursor.

As I mentioned above, I do like in_response postfix more than path one so I'd update these like this:

link_path_in_response_headers

link_path_in_response/link_path_in_response_body

One thing I'd note is that link_path_in_response_headers is not like all the other path fields out there since it simply represents the relationship of the URL that we should get from link header. I guess it might also make sense to have it called link_rel_in_response_header since this name reveals the intent better.

Agreed, with all of that. I think it'd be great to have a quick sync on naming before shipping to harmonize everything.

Agreed, with all of that. I think it'd be great to have a quick sync on naming before shipping to harmonize everything.

Yep, I'll post the latest version summary once this PR has been approved or once it's close to that point so we all can take a final look.

bastienbeurier · 2023-10-13T10:07:31Z

packages/shared/lib/sdk/sync.ts

+}
+
+export interface OffsetPagination extends Pagination {
+    offset_parameter_name: string;


Could be offset_name_in_request

bastienbeurier · 2023-10-13T10:08:48Z

packages/shared/lib/sdk/sync.ts

+        const templatePaginationConfig: Pagination | undefined = template.proxy?.paginate;
+
+        if (!templatePaginationConfig) {
+            throw Error(`Pagination is not supported for '${providerConfigKey}'. Please, add pagination config to 'providers.yaml' file`);


Can't we suggest that they provide an override here too?

Adding things to providers.yaml is heavy and requires a code review. There's little chance they'll do it.

I have put this on my follow-up list.

Letting them do this is my ideal vision but I feel a bit hesitant about this change since there is no runtime type validation so I wanted to postpone this till we introduce something like zod. Basically, if I were to remove this check, the user won't really receive a useful error message from us if they supply a wrong/partial pagination config. This affects DX and potentially slightly increases support volume.

That said, I introduced this check back when there was no compile time check of the pagination config override so it should be safer now as we have Partial<CursorPagination> | Partial<NextUrlPagination> | Partial<OffsetPagination> check in place.

The tradeoff here is that we will temporarily force users to create PRs to update providers.yaml but reduce the chance of error.

I'm starting to lean toward removing this check. @khaliqgant @rguldener Any thoughts on this?

Wait a second, PR won't be enough. They will need to wait till the release of this change, right? I'm leaning toward opening this API even more now.

Sorry I didn't mean removing the check, I was thinking more of just changing the error message to suggest to change the providers.yaml or override the pagination config in code.

override the pagination config in code.

The current check doesn't really care if there is an override in the code or not. It throws anyway if it sees that there is no default config in providers.yaml. That's why I mentioned that we will have to update it to throw an error if there is no default config and the user didn't provide overrides.

bastienbeurier · 2023-10-13T10:10:24Z

packages/shared/lib/sdk/sync.ts

+
+                    nextCursor = this.getNestedField(response.data, cursorPagination.next_cursor_parameter_path);
+
+                    if (!nextCursor || nextCursor.trim().length === 0) {


Might be silly, but do we want to interrupt pagination if we get less objects than the limit? Same for the other pagination types?

For cursor-based pagination, we don't need to rely on limit since once we receive a response with smaller object count, we also get empty or no cursor in that same response.

Same ☝️ is true for link. There won't be any if there are no more records.

I will introduce this as a small optimization for offset-based pagination. We won't be able to always do this because we rely on default API endpoint limit, which we don't know. So, the optimization will only kick in if users explicitly pass limit override from the script.

This is dangerous approach for non-offset-based API in general because some of them specifically suggest that they could return a smaller number of objects in a batch. That's why they strongly suggest relying on their defined pagination mechanism instead of relying on the count of objects. Here is an example of this from Slack docs.

Understood, this sounds reasonable to me!

bastienbeurier · 2023-10-13T10:14:41Z

packages/shared/lib/sdk/sync.ts

+        }
+    }
+
+    private getNestedField(object: any, path: string, defaultValue?: any): any {


Isn't there a library for this? e.g. Lodash?

Example of using Lodash for this (from a previous Nango version):

// Cursor-based pagination (with cursor field in the metadata of the response). if ( sync.paging_cursor_metadata_response_path != null && _.get(res.data, sync.paging_cursor_metadata_response_path) && results.length < maxNumberOfRecords ) { pageCursor = _.get(res.data, sync.paging_cursor_metadata_response_path); continue; } // Cursor-based pagination (with cursor field in the last object of the response). if ( sync.paging_cursor_object_response_path != null && _.get(newResults[newResults.length - 1], sync.paging_cursor_object_response_path) && results.length < maxNumberOfRecords ) { pageCursor = _.get(newResults[newResults.length - 1], sync.paging_cursor_object_response_path); continue; } // URL-based pagination (with URL field in the metadata of the response). if (sync.paging_url_path != null && isValidHttpUrl(_.get(res.data, sync.paging_url_path)) && results.length < maxNumberOfRecords) { sync.url = _.get(res.data, sync.paging_url_path); continue; } // URL-based pagination (with URL field in the Link header). if (sync.paging_header_link_rel != null && res.headers['link'] != null) { let linkHeader = parseLinksHeader(res.headers['link']); if (linkHeader != null && sync.paging_header_link_rel in linkHeader && isValidHttpUrl(linkHeader[sync.paging_header_link_rel]['url'])) { let nextPageUrl = linkHeader[sync.paging_header_link_rel]['url']; sync.url = nextPageUrl; continue; } } // Offset + limit pagination if (sync.paging_offset_request_path != null && sync.paging_limit_request_path != null && sync.paging_limit != null) { if (newResults.length > 0) { offset += newResults.length; continue; } else { // The last page was empty, so we should not count it. page -= 1; } }

Yeah, I have a TODO inside this method to figure out how to use lodash. For some reason, I couldn't easily import so I created this helper method to move forward and get back to this later.

khaliqgant · 2023-10-19T13:52:19Z

Will integrate 671ef08 in a separate PR

## Describe your changes - Do not cast offset as `string` Not sure why it was done this way. I assume that those values were set to URL and supposedly be string only (#1103) but at the same time, it's set as `record<string, any` directly. To be cautious I have introduced an if.

omotnyk added 30 commits October 9, 2023 14:10

Add hardcoded offset-limit pagination helper

feaa5e9

Use generics in paginate helper

99f1792

Move pagination from proxy to sync

800c78c

Add pagination based on paginate param

f7598c1

Add GitHub pagination config to providers.yaml

2500967

Simplify ifs

4445f2a

Implement separate method for pagination helper

c0b9eaf

Support cursor based pagination

63dc50d

Refactor pagination interface

e22a6a6

Revert generator return types

23cc704

Prohibit passing boolean param for paginate

3733549

Specify default limit for Slack and HubSpot

daae1fe

Return batch from generator

a06baf6

Support passing pagination params in body

adb682a

Extract body/params update into helper method

8d1ef8c

Fix limit

9cf10aa

Allow specifying page parameter name

198653c

Simplify cursor type names

4d93b69

Fix config names & allow custom limit param name

9a3cbd1

Add support for header link rel pagination

6c8656a

Remove 'page' pagination support

7787ed5

Do not support limit parameter

cbf9b9a

We are going to rely on API defaults and only allow passing limit as an override

Support gettign next URL from reponse body

a4593ab

Add missing 'parse-link-header' dependency

d534f90

Rename params for clarity

6f993f6

Support offset(by count) pagination

969e487

Backfill pagination params in providers.yaml

91ee000

Add error logging in case next URL is invalid

5e8ef30

Unify pagination interfaces

cc01c72

Extact next URL extration into separate method

0bbf118

omotnyk added 5 commits October 12, 2023 18:32

Revert template updates

3e9fecc

Merge branch 'master' of github.com:NangoHQ/nango into pagination-helper

bec4745

Revert accidental changes

ffde95d

Update the interface for link pagination

bab9e62

Cover most of pagination with unit tests

3336fc9

motnyk requested review from rguldener, khaliqgant and bastienbeurier October 13, 2023 08:52

bastienbeurier reviewed Oct 13, 2023

View reviewed changes

omotnyk and others added 11 commits October 13, 2023 14:25

Update pagination interface

05d4b2d

Slightly optimize offset pagination

220c233

Replace mentions of url with link

aff3f00

Use lodash

5129e1b

bring in latest

216c062

formatting

29f3df4

Merge branch 'master' of github.com:NangoHQ/nango into pagination-helper

c691bda

initial minor clean up

ec7c58b

remove formatting update

8307e09

organizational refactor

b77b030

add notion and add validation

e02e11d

khaliqgant marked this pull request as ready for review October 19, 2023 13:22

khaliqgant changed the title ~~[WIP] Add nango.paginate helper for sync/action scripts~~ Add nango.paginate helper for sync/action scripts Oct 19, 2023

khaliqgant requested a review from bastienbeurier October 19, 2023 13:51

bastienbeurier approved these changes Oct 20, 2023

View reviewed changes

khaliqgant merged commit 6d3f311 into master Oct 20, 2023

khaliqgant deleted the pagination-helper branch October 20, 2023 13:05

bodinsamuel mentioned this pull request Nov 8, 2024

fix(pagination): offset type casting #2964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `nango.paginate` helper for sync/action scripts #1103

Add `nango.paginate` helper for sync/action scripts #1103

motnyk commented Oct 13, 2023 •

edited

Loading

motnyk commented Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023

bastienbeurier Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023

bastienbeurier Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023

bastienbeurier Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023 •

edited

Loading

motnyk Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023

bastienbeurier Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023

motnyk Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023 •

edited

Loading

bastienbeurier Oct 13, 2023

bastienbeurier Oct 13, 2023

motnyk Oct 13, 2023

bastienbeurier Oct 13, 2023

khaliqgant commented Oct 19, 2023


		nextCursor = this.getNestedField(response.data, cursorPagination.next_cursor_parameter_path);

		if (!nextCursor \|\| nextCursor.trim().length === 0) {

Add nango.paginate helper for sync/action scripts #1103

Add nango.paginate helper for sync/action scripts #1103

Conversation

motnyk commented Oct 13, 2023 • edited Loading

Problem

Solution

Examples of migrated integration templates

Interface

Script interface example

WIP

motnyk commented Oct 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

motnyk Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

motnyk Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khaliqgant commented Oct 19, 2023

Add `nango.paginate` helper for sync/action scripts #1103

Add `nango.paginate` helper for sync/action scripts #1103

motnyk commented Oct 13, 2023 •

edited

Loading

motnyk Oct 13, 2023 •

edited

Loading

motnyk Oct 13, 2023 •

edited

Loading