Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Pagination feature feedback #1052

Closed
sindresorhus opened this issue Feb 6, 2020 · 32 comments
Closed

Pagination feature feedback #1052

sindresorhus opened this issue Feb 6, 2020 · 32 comments

Comments

@sindresorhus
Copy link
Owner

sindresorhus commented Feb 6, 2020

Try out the new pagination feature and let us know how it works. We will improve it based on your feedback. The feature lets you more easily handle the case where you need to request multiple pages from an API.


Here are relevant issues:

@sindresorhus sindresorhus pinned this issue Feb 6, 2020
@andymantell
Copy link

andymantell commented Mar 13, 2020

Just used this with the WordPress Rest API and it just works. I've come back to the documentation to check because it was too easy and I don't know if I believe it! I feel like I've missed something - I didn't even have to specify any of the callbacks.

let posts = await got.paginate(`${process.env.CRAWL_URL}wp-json/wp/v2/posts`);

console.log(posts.length);  // outputs 152

for await (const post of posts) {
  ...
}

@szmarczak
Copy link
Collaborator

I think you confused got.paginate with got.paginate.all. The correct example is:

let posts = await got.paginate.all(`${process.env.CRAWL_URL}wp-json/wp/v2/posts`);

console.log(posts.length);  // outputs 152

for (const post of posts) {
  ...
}

Otherwise posts.length will be undefined as posts would be an iterator.

@andymantell
Copy link

Yes sorry you are right, the example I pasted isn't quite what I've ended up with - I trimmed it down for pasting into here and effectively broke it. got.paginate.all looks good though, I hadn't seen that.

Thanks!

@Slaventsiy

This comment has been minimized.

@fiznool
Copy link
Contributor

fiznool commented Apr 25, 2020

I've tried the pagination feature out today on a Google API and it works really nicely. Thank you for implementing this ❤️

I have a couple of bits of feedback:

  • It would be nice to somehow type the response body, as the current types are defaulting to Response<unknown>, meaning type casting is necessary inside the paginate and transform functions.
  • It's not really obvious what the difference is between pagination and pagination.all. After much trial and error I realised that the former is if you want to use for await on the returned generator, whereas the latter must be awaited directly and returns the regular array of results.

As a sidenote, I would imagine the major use case for the pagination feature is for pagination.all, i.e. get all of the results from the endpoint? for await syntax is not as well known, and (unless I'm missing something) will likely only be used in advanced scenarios. Hence, would it make sense for pagination.all to be the recommended way of paginating? And, if so, should it be used as the normal pagination function, with the for await version being renamed to (e.g.) pagination.each? I came to this expecting pagination to work just like pagination.all does today, given that most of the rest of got has this architecture (call the function, await it, get the result directly).

Edit: after refining the code, I am now using pagination directly with for await after all. However my comment above still stands, that I would expect the pagination API to work like the rest of the library by default, with an opt-in to the for await version.

@fiznool
Copy link
Contributor

fiznool commented Apr 25, 2020

For reference here's my code snippet for interfacing with the Google My Business API, using TypeScript.

interface Account {
  name: string;
  accountName: string;
}

interface AccountsListResponseBody {
  accounts: Account[];
  nextPageToken?: string;
}

async function fetchLocations(accessToken: string) {
  const myBusinessAPIClient = got.extend({
    prefixUrl: 'https://mybusiness.googleapis.com/v4',
    headers: {
      authorization: `Bearer ${accessToken}`,
    },
    responseType: 'json',
  });

  const accountsSearchParams = {
    pageSize: 20,
  };
  const accounts = await myBusinessAPIClient.paginate.all<Account>(
    'accounts',
    {
      searchParams: accountsSearchParams,
      pagination: {
        transform: (response) =>
          (response.body as AccountsListResponseBody).accounts,
        paginate: (response) => {
          const { nextPageToken } = response.body as AccountsListResponseBody;
          if (!nextPageToken) {
            return false;
          }

          return {
            searchParams: {
              ...accountsSearchParams,
              pageToken: nextPageToken,
            },
          };
        },
      },
    }
  );

  console.log(accounts);
}

@szmarczak
Copy link
Collaborator

It would be nice to somehow type the response body, as the current types are defaulting to Response, meaning type casting is necessary inside the paginate and transform functions.

You can (and should) cast it via a template:

got/source/types.ts

Lines 53 to 63 in 3da16e0

export interface GotPaginate {
<T>(url: string | URL, options?: OptionsWithPagination<T>): AsyncIterableIterator<T>;
all<T>(url: string | URL, options?: OptionsWithPagination<T>): Promise<T[]>;
// A bug.
// eslint-disable-next-line @typescript-eslint/adjacent-overload-signatures
<T>(options?: OptionsWithPagination<T>): AsyncIterableIterator<T>;
// A bug.
// eslint-disable-next-line @typescript-eslint/adjacent-overload-signatures
all<T>(options?: OptionsWithPagination<T>): Promise<T[]>;
}

It's not really obvious what the difference is between pagination and pagination.all. After much trial and error I realised that the former is if you want to use for await on the returned generator, whereas the latter must be awaited directly and returns the regular array of results.

all stands for all items, which obviously is an array.

As a sidenote, I would imagine the major use case for the pagination feature is for pagination.all, i.e. get all of the results from the endpoint? for await syntax is not as well known, and (unless I'm missing something) will likely only be used in advanced scenarios.

I strongly disagree. People may want to use pagination(...) instead of pagination.all(...) to transform data on the fly. Storing 10k items in memory is not a good idea.

for await version being renamed to (e.g.) pagination.each?

IMO that's a good idea. +1 on this one. You can open an issue about this so we don't forget.

I came to this expecting

That's what many people do. They expect their code to just work. If they don't use TypeScript, they should always read the docs unless they have memorized the API. Related with #920

@PopGoesTheWza
Copy link
Contributor

I am using paginate on Sugar CRM api with Typescript node applications.

First thing, it's a brilliant feature which would benefit from more documentation, example, etc. (then again it is still maturing)

Using got@10.x it works like a charm (aside some minor typing things maybe and possibly a heap crash when processing a large set of pages, but I have not fully investigate this yet)

Migrating to got@11.x is promising but currently broken for my usage. I opened several issues. Once I can have got@11.x to work, I will likely contribute a got template for Sugar CRM.

Thanks for the many great repos.

@fiznool
Copy link
Contributor

fiznool commented Apr 27, 2020

@szmarczak thank you for your comments.

You can (and should) cast it via a template

Yes, that is a good feature - however the casting is only for the list of results. In some cases (such as the Google My Business API), the response body is an object with a property containing the list, like so:

{
  "accounts": [...],
  "nextPageToken": "some_token"
}

I believe the typings currently account for this possible difference, but they use the Response type without a generic, and so the response body defaults to Response<unknown> - hence the need for the manual casting inside transform and paginate.

export interface PaginationOptions<T> {
pagination?: {
transform?: (response: Response) => Promise<T[]> | T[];
filter?: (item: T, allItems: T[], currentItems: T[]) => boolean;
paginate?: (response: Response, allItems: T[], currentItems: T[]) => Options | false;
shouldContinue?: (item: T, allItems: T[], currentItems: T[]) => boolean;
countLimit?: number;
};
}

That's what many people do. They expect their code to just work.

I understand where you are coming from, but I actually did read the documentation on the pagination API several times, and still didn't fully understand until I had stepped through the code to see what was going on. I think this is simply an issue with the documentation - I would be happy to submit a PR to improve this if it would be acceptable. As also discussed above, I think that having pagination.all and pagination.each as the two methods would also help to clear up any misunderstandings.

I hope my original comment didn't offend in any way - I think this is a fantastic feature which has saved me a lot of time, and for that I thank all of the contributors. 💜

@szmarczak
Copy link
Collaborator

In some cases (such as the Google My Business API), the response body is an object with a property containing the list, like so:

That's an interesting one. But I think you can just return response.body.accounts in transform() and before you do that just response.request.options.context = {token: nextPageToken};. Then you can access the token in the paginate() function.

hence the need for the manual casting inside transform and paginate.

Exactly!

I think this is simply an issue with the documentation - I would be happy to submit a PR to improve this if it would be acceptable.

A PR would be lovely, don't hestitate to send one ;)

I hope my original comment didn't offend in any way

No, it did not. You don't have anything to worry about. Actually your points are good!

@fiznool
Copy link
Contributor

fiznool commented Apr 27, 2020

But I think you can just return response.body.accounts in transform() and before you do that just response.request.options.context = {token: nextPageToken};. Then you can access the token in the paginate() function.

Indeed, this is totally possible, it's just that TypeScript complains as response.body is currently typed as unknown in the transform function, as there is no way to pass through the response body typings.

Here is a screenshot of VSCode complaining:

Screenshot 2020-04-27 at 10 03 41

Casting manually in transform fixes the complaint:

Screenshot 2020-04-27 at 10 04 02

I wonder if we could pass in a second (optional) generic to paginate, which could be passed down to the Response type? This would solve the issue I believe.

Something like:

export type OptionsWithPagination<T = unknown, B = unknown> = Merge<Options, PaginationOptions<T, B>>;

export interface GotPaginate { 
 	<T, B>(url: string | URL, options?: OptionsWithPagination<T, B>): AsyncIterableIterator<T>;
}

export interface PaginationOptions<T, B> { 
 	pagination?: { 
 		transform?: (response: Response<B>) => Promise<T[]> | T[]; 
 		paginate?: (response: Response<B>, allItems: T[], currentItems: T[]) => Options | false; 
 	}; 
 } 

A PR would be lovely, don't hestitate to send one ;)

I will try to put something together this week. 😄

@szmarczak
Copy link
Collaborator

Casting manually in transform fixes the complaint:

Can you try

transform: (response: Response<AccountsListResponseBody>) => {

I wonder if we could pass in a second (optional) generic to paginate, which could be passed down to the Response type? This would solve the issue I believe.

👍

@sinedied
Copy link

I'm trying to use this new feature but can't make it work, not sure what I'm doing wrong 😞
Some feedback:

  • paginate.all is not mentioned in the docs, and it's not clear that paginate returns an async iterator. Had to go through issues to discover that
  • The paginate example is wrong:
(async () => {
	const limit = 10;

	const items = got.paginate('https://example.com/items', {
		searchParams: {
			limit,
			offset: 0
		},
		pagination: {
			paginate: (response, allItems, currentItems) => {
				const previousSearchParams = response.request.options.searchParams;
				const {offset: previousOffset} = previousSearchParams;

				if (currentItems.length < limit) {
					return false;
				}

				return {
					searchParams: {
						...previousSearchParams,
						offset: previousOffset + limit,
					}
				};
			}
		}
	});
})();

2 issues here:

  • previousSearchParams is an object that does not work properly with destructuring (I get symbols but not the intended object)
  • previousOffset is a string and not a number, it needs to be parsed

Even though I fixed those 2 issues, the pagination does not work properly for me, search params keep getting appended each iteration, ie:

https://dev.to/api/articles/me/all?per_page=2&page=1
https://dev.to/api/articles/me/all?per_page=2&page=2&per_page=2&page=1
https://dev.to/api/articles/me/all?per_page=2&page=3&per_page=2&page=2&per_page=2&page=1
https://dev.to/api/articles/me/all?per_page=2&page=4&per_page=2&page=3&per_page=2&page=2&per_page=2&page=1
...

This is my code, any help appreciated here:

    const result = await got.paginate.all(`${apiUrl}/articles/me/all`, {
      searchParams: { per_page: paginationLimit, page: 1 },
      headers: { 'api-key': devtoKey },
      responseType: 'json',
      pagination: {
        paginate: (response, allItems, currentItems) => {
          const previousPage = Number(response.request.options.searchParams.get('page'));
          if (currentItems.length < paginationLimit) {
            return false;
          }
          return {
            searchParams: {
              per_page: paginationLimit,
              page: previousPage + 1
            }
          };
        }
      }
    });

@szmarczak
Copy link
Collaborator

paginate.all is not mentioned in the docs

Nice find. Please open an issue about this.

it's not clear that paginate returns an async iterator.

I disagree: https://github.com/sindresorhus/got#gotpaginateurl-options

@szmarczak
Copy link
Collaborator

2 issues here:

Another one 😅 Please make an issue for this too.

This is my code, any help appreciated here:

ATM I'm fixing bugs, will take a look later.

@sinedied
Copy link

sinedied commented Apr 27, 2020

I disagree: https://github.com/sindresorhus/got#gotpaginateurl-options

This is where I landed from a google search, and the first occurence of it in the docs: https://github.com/sindresorhus/got#pagination
Maybe adding a reference to the full function below would help.

@fiznool
Copy link
Contributor

fiznool commented Apr 30, 2020

@szmarczak

Can you try
transform: (response: Response) => {

This isn't the nicest, as the Response type is not exported from the default got export. Instead, you'd need to do

import { Response } from 'got/dist/source/core';

Totally do-able, just doesn't look very nice 😄

@szmarczak
Copy link
Collaborator

export interface Response<T = unknown> extends RequestResponse<T> {

export * from './types';

export * from './as-promise';

You sure?

@fiznool
Copy link
Contributor

fiznool commented Apr 30, 2020

I've hit another interesting use case today when working with the Facebook Graph API.

As per the Facebook docs, the API returns the entire 'next' page URL in the response body. The idea is that you only need this URL to fetch the next page of results, saving the hassle of constructing it yourself.

Taking the endpoint me/accounts as an example. A request can be made as follows:

const { body } = await got.get('https://graph.facebook.com/v6.0/me/accounts', {
  searchParams: {
    access_token: '<access_token>',
    fields: 'id,name',
    limit: 1
  },
  responseType: 'json'
});

Here is the response body for this request:

{
  "data": [
    {
      "id": "<id>",
      "name": "<name>"
    }
  ],
  "paging": {
    "cursors": {
      "before": "<cursor_before>",
      "after": "<cursor_after>"
    },
    "next": "https://graph.facebook.com/v6.0/<id>/accounts?access_token=<redacted>&fields=id%2Cname&limit=1&after=<cursor_after>"
  }
}

Notice that paging.next returns the entire URL that should be used to fetch the next set of data. This URL contains an after search param cursor which is used to paginate.

So now let's hook up the pagination API:

const accounts = await got.paginate.all('https://graph.facebook.com/v6.0/me/accounts', {
  searchParams: {
    access_token: '<access_token>',
    fields: 'id,name',
    limit: 1
  },
  responseType: 'json',
  pagination: {
    transform: (response) => response.body.data,
    paginate: (response) => {
      const { paging } = response.body;
      if(!paging.next) {
        // As per the Facebook docs, if there is no `next`, there is no more data.
        return false;
      }
      // Otherwise, we should return the whole URL.
      return {
        url: paging.next
      };
    }
  }
});

This almost works, but the issue is that the specified searchParams in the options takes precedence over the query string parameters in the paging.next url that is set in the returned url property. As a result, the after search param which is present in paging.next is ignored.

The workaround is to return both a url and a blank string as the searchParams from the paginate function:

      return {
        url: paging.next,
        searchParams: ''  // undefined / null / {} does not work here, it has to be a blank string
      };

This is also an issue for other options such as prefixUrl, which need to be set to the empty string to ensure that got does not include them in subsequent paged requests.

I wonder if there needs to be a way for got to accept a full URL as a return from the paginate function, and to use that URL as-is (with no merging of previous options). I imagine the system that Facebook has in place is not atypical.

@fiznool
Copy link
Contributor

fiznool commented Apr 30, 2020

You sure?

Sorry you are right. VSCode didn't seem to pick this up via auto importing. Thanks for clarifying 🙂

@szmarczak
Copy link
Collaborator

VSCode didn't seem to pick this up via auto importing

Sometimes I experience this too with other TS projects.

@szmarczak
Copy link
Collaborator

This almost works, but the issue is that the specified searchParams in the options takes precedence over the query string parameters in the paging.next url that is set in the returned url property.

That works as intended. It's in the docs. searchParams override the ones in the URL. But an empty string should still replace the query string in the URL. Setting that to undefined should fix this. If it doesn't, please make an issue about this.

This is also an issue for other options such as prefixUrl

That works as intended too. You can avoid this behavior by passing url: new URL(next).

@fiznool
Copy link
Contributor

fiznool commented Apr 30, 2020

Setting that to undefined should fix this. If it doesn't, please make an issue about this.

Strangely, setting to undefined causes the original searchParams to be added to the URL again for the second page. I don't have enough data to test against subsequent pages but I suspect the params would continue to be duplicated for each subsequent page. Setting to '' (empty string) was the only thing I could get to work.

I will submit an issue.

You can avoid this behavior by passing url: new URL(next)

Thanks for the tip! That works nicely.

@szmarczak
Copy link
Collaborator

There are two more relevant issues:

Any feedback is greatly appreciated ❤️

@yosiasz
Copy link

yosiasz commented Jul 24, 2020

For now I simply do the following until I wrap my head around pagination

      const clusters = await got.get(url, options);

      response.send(JSON.parse(clusters.body).value);

great module! thanks!

@PopGoesTheWza
Copy link
Contributor

PopGoesTheWza commented Mar 27, 2021

Would it make sense for got.paginate to act as an event emitter or a readable stream?

We use pagination to query various rest API which will typically return thousands of records, and most of the time the goal is "per record" processing/transformation. In such use cases, we would rather have got.paginate not to produce a final result but to consume records (or chunks of records) as they arrive.

One possible approach (and perhaps recommended) could be to use pagination.filter:

  • to emit an event or push into a readable stream
  • filter out the record (got.paginate needs not waste resources in stacking "processed" records)

Working at the pagination.transform level would allow emitting "chunks" of records (at this point I am unsure if working with chunks or single records would be more resource efficient)... but it feel messy/kludgy.

Is this use case meaningful to others? Should it be covered by the documentation with some sample code? Would extending the pagination API to propose event emitter or readable stream interface be a valuable feature request?

Edit

There is of cause the issue of closing the stream / emitting the done ending event which I have not yet fullythought over...

@szmarczak
Copy link
Collaborator

to act as an event emitter

You can use the functions in options.pagination. Currently I don't see any other usage, but if you see one, please share :)

or a readable stream?

Well, in that case you can just make your own loop. got.paginate is just a convenient wrapper around the Promise API.

@PopGoesTheWza
Copy link
Contributor

You can use the functions in options.pagination. Currently I don't see any other usage, but if you see one, please share :)

I will experiment on a large set of data this week, with custom pagination.filter to emit data to process and pagination.paginate to emit a done closing event

Well, in that case you can just make your own loop. got.paginate is just a convenient wrapper around the Promise API.

Would a convenient wrapper around the stream API make sense for got and be worth a PR? The idea came to me with how node-mssql allows the result of queries to be processed as streams. But maybe not everyone is interested ;)

@szmarczak
Copy link
Collaborator

I just noticed that shouldContinue executes only when filter passes. So something like this does not work:

import got from '../../dist/source/index.js';
import Bourne from '@hapi/bourne';

const max = Date.now() - 1000 * 86400 * 7;

const iterator = got.paginate('https://api.github.com/repos/sindresorhus/got/commits', {
	pagination: {
		transform: response => Bourne.parse(response.body),
		filter: ({item}) => {
			const date = new Date(item.commit.committer.date);
			return date.getTime() - max >= 0;
		},
		shouldContinue: ({item}) => {
			const date = new Date(item.commit.committer.date);
			return date.getTime() - max >= 0;
		},
		countLimit: 50,
		backoff: 100,
		requestLimit: 1_000,
		stackAllItems: false
	}
});

console.log('Last 50 commits from now to week ago:');
for await (const item of iterator) {
	console.log(item.commit.message.split('\n')[0]);
}

I mean it works but it makes 40 requests instead of 2. The workaround here is:

paginate: data => {
	const found = data.currentItems.find(item => {
			const date = new Date(item.commit.committer.date);
			return date.getTime() - max >= 0;
	});

	if (found) {
		return false;
	}

	return got.defaults.options.pagination.paginate(data);
}

We could make the first approach do only 2 requests if shouldContinue was executed on all items.

@sindresorhus What do you think?

@szmarczak
Copy link
Collaborator

szmarczak commented Jul 13, 2021

This is also an issue for other options such as prefixUrl, which need to be set to the empty string to ensure that got does not include them in subsequent paged requests.

@fiznool Indeed it makes sense to disable prefixUrl here. Will fix this now. Actually it has been fixed in the main branch (upcoming Got 12).

@sindresorhus
Copy link
Owner Author

We could make the first approach do only 2 requests if shouldContinue was executed on all items.

In what way would we do that? Would it be a breaking change? If so how?

@szmarczak
Copy link
Collaborator

Yes, it would be a breaking change. There is a scenario where filter may be used to validate responses and shouldContinue may be used on validated repsonses only. Let's leave it as it is for now.

If anybody is concerned about this issue, please open a new one. I'm closing this one as pagination got stable and v12 is being released very soon.

@szmarczak szmarczak unpinned this issue Jul 17, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants