Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add nango.paginate helper for sync/action scripts #1103

Merged
merged 53 commits into from
Oct 20, 2023
Merged

Conversation

motnyk
Copy link
Contributor

@motnyk motnyk commented Oct 13, 2023

Closes #856

Problem

A common task developers face in these scripts is retrieving extensive lists of records from external APIs. These records often reside behind paginated endpoints. Given the variability in how different APIs handle pagination (some use cursors, others links, and so on), developers find themselves re-writing detailed pagination logic for each individual script.
This is error-prone, time-consuming, and raises an entry bar for anyone who wants to write a script.

Solution

nango.paginate helper that will do everything under the hood and just give a user batches of records to work with.

Examples of migrated integration templates

https://github.com/NangoHQ/nango/compare/migrate-templates?expand=1

Interface

nango.paginate is just another proxy-like method that accepts ProxyConfiguration as input handles the pagination params, and passes the request on to nango.proxy. At this point, I implemented the support of 3 pagination types:

  • Cursor-based.
  • Offset-based.
  • Link-based. This one includes link to the next page in a form of a full URL(https://example.com/issues?cursor=123) or path(/issues?cursor=123). The link can be fetched from body or link header.

There are two sides of the interface here - providers.yaml and action/sync script. Configs in provider.yaml should cover 95% of use-cases and for the rest, the user can pass overrides from scripts(partial overrides allowed so no need to pass all the values every time). Supported configs for both of those interfaces:
Common

  • type - pagination type that is used to define the pagination algorithm under the hood.
  • limit - batch size for each request. By default, we rely on the provider default value to avoid setting arbitrary limits on provider level in providers.yaml. Override can still be passed from the script.
  • limit_parameter_name- different providers have different names for the limit(e.g. limit, per_page, etc.). This parameter defines the name of the limit param.
  • response_data_path - sometimes batch APIs return an object with batch of records + some meta info like next pagination cursor. In this cases, user can use response_data_path to tell us how the batch should be extracted from the response body.
{
  "results": [...],
  "next_cursor": '123'
}

Cursor

  • next_cursor_parameter_path - same as response_data_path but for finding next cursor value.
  • cursor_parameter_name - the idea is the same as limit_parameter_name. Different providers call this differently(e.g. offset, cursor, etc.)

Link

  • link_rel - if this parameter is passed, we will extract the next URL from link header trying to extract value with specific relationship(in most cases it's next). See GH pagination docs for example.
  • link_body_parameter_path - if this parameter is passed, we will extract the next URL body.

Offset

  • offset_parameter_name - same as cursor_parameter_name but for offset value.

Script interface example

    const endpoint: string = `/repos/${owner}/${repo}/git/trees/${branch}`;
    const proxyConfig = {
        endpoint,
        params: { recursive: '1' },
        paginate: { response_data_path: 'tree' } // overrides
    };

    await nango.log(`Fetching files from endpoint ${endpoint}.`);

    for await (const fileBatch of nango.paginate(proxyConfig)) {
        const blobFiles = fileBatch.filter((item: any) => item.type === 'blob');
        count += blobFiles.length;
        await nango.batchSave(blobFiles.map(mapToFile), Models.GithubRepoFile);
    }

WIP

  • Resolve minor TODOs.
  • Figure out if there is a need to return the full response body from paginate in some cases. For example, some APIs return 200 OK + error response in body. Would it be useful to have access to those? How critical is this?

@motnyk
Copy link
Contributor Author

motnyk commented Oct 13, 2023

I'm working on the build fix.

@@ -27,6 +28,7 @@ export interface Template {
after?: string;
};
decompress?: boolean;
paginate?: NextUrlPagination | CursorPagination | OffsetPagination;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the Next implicit to get to UrlPagination.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, NextUrlPagination is an old name that I didn't change before pushing a PR.
I decided to update the type from next_url to link last minute to express the fact that it isn't only the URL that is supported by this pagination method but also path links. I haven't updated this class name yet. It's gonna be LinkPagination soon.

Essentially, NextUrlPagination currently handles two types of links:

  • Full URL https://example.com/issues?offset=1231312
  • Path /issues?offset=1231312

I feel like LinkPagination is more or less universal term that can be used in both cases IMO. Another option I had in mind is LocationPagination but it's less common and obvious.

How do you feel about UrlPagination vs LinkPagination?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree LinkPagination work well!

interface Pagination {
type: string;
limit?: number;
response_data_path?: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would make response_path, IMO data doesn't make it more specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or records_path_in_response or data_path_in_response. Not sure what I like best.

limit_parameter_name: string;
}

export interface CursorPagination extends Pagination {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky thing to understand here IMO is that one is for the response, the other for the request.

Just a suggestion, but we could have:

  • cursor_path_in_response
  • cursor_name_in_request

I would also imply the next_ here.

In any case, I would brainstorm about the best names here because they will likely stay a while (unless we break the interface).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my current scheme, I use path for something that should be fetched from the response and name for something that is set on the request later on. I agree that in_request/in_response are a bit more explicit about the way these parameters are going to be used so I like this approach.

I would also imply the next_ here.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaliqgant curious to have your opinion on naming!

cursor_parameter_name: string;
}

export interface NextUrlPagination extends Pagination {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would suggest (but please provide feedback), since we call the NextUrlPagination or UrlPagination:

  • link_path_in_response
  • link_name_in_request

Copy link
Contributor Author

@motnyk motnyk Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these params define the path in response.
If link_rel is passed, we will extract the next URL from link header and then we will extract value with specific relationship(in most cases it's next). See GH pagination docs for example.
if link_body_parameter_path is passed, we will extract next url from body. This is similar to cursor.

As I mentioned above, I do like in_response postfix more than path one so I'd update these like this:

  • link_path_in_response_headers
  • link_path_in_response/link_path_in_response_body

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd note is that link_path_in_response_headers is not like all the other path fields out there since it simply represents the relationship of the URL that we should get from link header. I guess it might also make sense to have it called link_rel_in_response_header since this name reveals the intent better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, with all of that. I think it'd be great to have a quick sync on naming before shipping to harmonize everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, with all of that. I think it'd be great to have a quick sync on naming before shipping to harmonize everything.

Yep, I'll post the latest version summary once this PR has been approved or once it's close to that point so we all can take a final look.

}

export interface OffsetPagination extends Pagination {
offset_parameter_name: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be offset_name_in_request

const templatePaginationConfig: Pagination | undefined = template.proxy?.paginate;

if (!templatePaginationConfig) {
throw Error(`Pagination is not supported for '${providerConfigKey}'. Please, add pagination config to 'providers.yaml' file`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we suggest that they provide an override here too?

Adding things to providers.yaml is heavy and requires a code review. There's little chance they'll do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put this on my follow-up list.

Letting them do this is my ideal vision but I feel a bit hesitant about this change since there is no runtime type validation so I wanted to postpone this till we introduce something like zod. Basically, if I were to remove this check, the user won't really receive a useful error message from us if they supply a wrong/partial pagination config. This affects DX and potentially slightly increases support volume.

That said, I introduced this check back when there was no compile time check of the pagination config override so it should be safer now as we have Partial<CursorPagination> | Partial<NextUrlPagination> | Partial<OffsetPagination> check in place.

The tradeoff here is that we will temporarily force users to create PRs to update providers.yaml but reduce the chance of error.

I'm starting to lean toward removing this check. @khaliqgant @rguldener Any thoughts on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a second, PR won't be enough. They will need to wait till the release of this change, right? I'm leaning toward opening this API even more now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't mean removing the check, I was thinking more of just changing the error message to suggest to change the providers.yaml or override the pagination config in code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override the pagination config in code.

The current check doesn't really care if there is an override in the code or not. It throws anyway if it sees that there is no default config in providers.yaml. That's why I mentioned that we will have to update it to throw an error if there is no default config and the user didn't provide overrides.


nextCursor = this.getNestedField(response.data, cursorPagination.next_cursor_parameter_path);

if (!nextCursor || nextCursor.trim().length === 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be silly, but do we want to interrupt pagination if we get less objects than the limit? Same for the other pagination types?

Copy link
Contributor Author

@motnyk motnyk Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For cursor-based pagination, we don't need to rely on limit since once we receive a response with smaller object count, we also get empty or no cursor in that same response.
  • Same ☝️ is true for link. There won't be any if there are no more records.
  • I will introduce this as a small optimization for offset-based pagination. We won't be able to always do this because we rely on default API endpoint limit, which we don't know. So, the optimization will only kick in if users explicitly pass limit override from the script.
  • This is dangerous approach for non-offset-based API in general because some of them specifically suggest that they could return a smaller number of objects in a batch. That's why they strongly suggest relying on their defined pagination mechanism instead of relying on the count of objects. Here is an example of this from Slack docs.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this sounds reasonable to me!

}
}

private getNestedField(object: any, path: string, defaultValue?: any): any {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a library for this? e.g. Lodash?

Example of using Lodash for this (from a previous Nango version):

// Cursor-based pagination (with cursor field in the metadata of the response).
            if (
                sync.paging_cursor_metadata_response_path != null &&
                _.get(res.data, sync.paging_cursor_metadata_response_path) &&
                results.length < maxNumberOfRecords
            ) {
                pageCursor = _.get(res.data, sync.paging_cursor_metadata_response_path);
                continue;
            }

            // Cursor-based pagination (with cursor field in the last object of the response).
            if (
                sync.paging_cursor_object_response_path != null &&
                _.get(newResults[newResults.length - 1], sync.paging_cursor_object_response_path) &&
                results.length < maxNumberOfRecords
            ) {
                pageCursor = _.get(newResults[newResults.length - 1], sync.paging_cursor_object_response_path);
                continue;
            }

            // URL-based pagination (with URL field in the metadata of the response).
            if (sync.paging_url_path != null && isValidHttpUrl(_.get(res.data, sync.paging_url_path)) && results.length < maxNumberOfRecords) {
                sync.url = _.get(res.data, sync.paging_url_path);
                continue;
            }

            // URL-based pagination (with URL field in the Link header).
            if (sync.paging_header_link_rel != null && res.headers['link'] != null) {
                let linkHeader = parseLinksHeader(res.headers['link']);

                if (linkHeader != null && sync.paging_header_link_rel in linkHeader && isValidHttpUrl(linkHeader[sync.paging_header_link_rel]['url'])) {
                    let nextPageUrl = linkHeader[sync.paging_header_link_rel]['url'];
                    sync.url = nextPageUrl;
                    continue;
                }
            }

            // Offset + limit pagination
            if (sync.paging_offset_request_path != null && sync.paging_limit_request_path != null && sync.paging_limit != null) {
                if (newResults.length > 0) {
                    offset += newResults.length;
                    continue;
                } else {
                    // The last page was empty, so we should not count it.
                    page -= 1;
                }
            }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have a TODO inside this method to figure out how to use lodash. For some reason, I couldn't easily import so I created this helper method to move forward and get back to this later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

@khaliqgant khaliqgant marked this pull request as ready for review October 19, 2023 13:22
@khaliqgant khaliqgant changed the title [WIP] Add nango.paginate helper for sync/action scripts Add nango.paginate helper for sync/action scripts Oct 19, 2023
@khaliqgant
Copy link
Member

Will integrate 671ef08 in a separate PR

@khaliqgant khaliqgant merged commit 6d3f311 into master Oct 20, 2023
@khaliqgant khaliqgant deleted the pagination-helper branch October 20, 2023 13:05
bodinsamuel added a commit that referenced this pull request Nov 8, 2024
## Describe your changes

- Do not cast offset as `string`
Not sure why it was done this way. I assume that those values were set
to URL and supposedly be string only
(#1103) but at the same time, it's
set as `record<string, any` directly.
To be cautious I have introduced an if.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pagination helper for sync scripts
4 participants