Skip to content

Optimize our most time consuming query #1668

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Mar 20, 2019

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Mar 14, 2019

Our database spends more of its time processing /api/v1/crates with no
parameters other than pagination. This query is the main one hit by
crawlers, and it is taking over 100ms to run, so it's at the top of our
list (for posterity's sake, #2 is copying crate_downloads during
backups, #3 and #4 are the updates run from bin/update-downloads, and #5
is the query run from the download endpoint)

The query is having to perform the full join between crates and
recent_downloads, and then count the results of that. Since we have no
search parameters of any kind, this count is equivalent to just counting
the crates table, which we can do much more quickly. We still need to do
the count over the whole thing if there's any where clause, but we can
optimize the case where there's no search.

This implicitly relies on the fact that we're only changing the select
clause in branches where we're also setting a where clause. Diesel 2
will probably have a feature that lets us avoid this. We could also
refactor the "exact match" check to be client side instead of the DB and
get rid of all the cases where we modify the select clause.

Before:

 Limit  (cost=427.87..470.65 rows=100 width=877) (actual time=109.698..109.739 rows=100 loops=1)
   ->  WindowAgg  (cost=0.14..10119.91 rows=23659 width=877) (actual time=109.277..109.697 rows=1100 loops=1)
         ->  Nested Loop Left Join  (cost=0.14..9966.13 rows=23659 width=869) (actual time=0.051..85.429 rows=23659 loops=1)
               ->  Index Scan using index_crates_name_ordering on crates  (cost=0.08..7604.30 rows=23659 width=860) (actual time=0.037..34.975 rows=23659 loops=1)
               ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..0.10 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=23659)
                     Index Cond: (crate_id = crates.id)
 Planning time: 1.307 ms
 Execution time: 111.840 ms

After:

 Limit  (cost=1052.34..1094.76 rows=100 width=877) (actual time=11.536..12.026 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=627.96..627.96 rows=1 width=8) (actual time=4.966..4.966 rows=1 loops=1)
           ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..616.13 rows=23659 width=0) (actual time=0.015..3.513 rows=23659 loops=1)
                 Heap Fetches: 811
   ->  Subquery Scan on t  (cost=0.14..10037.11 rows=23659 width=877) (actual time=5.019..11.968 rows=1100 loops=1)
         ->  Nested Loop Left Join  (cost=0.14..9966.13 rows=23659 width=869) (actual time=0.051..6.831 rows=1100 loops=1)
               ->  Index Scan using index_crates_name_ordering on crates  (cost=0.08..7604.30 rows=23659 width=860) (actual time=0.038..3.331 rows=1100 loops=1)
               ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..0.10 rows=1 width=12) (actual time=0.003..0.003 rows=1 loops=1100)
                     Index Cond: (crate_id = crates.id)
 Planning time: 1.377 ms
 Execution time: 12.106 ms

Our database spends more of its time processing /api/v1/crates with no
parameters other than pagination. This query is the main one hit by
crawlers, and it is taking over 100ms to run, so it's at the top of our
list (for posterity's sake, rust-lang#2 is copying `crate_downloads` during
backups, rust-lang#3 and rust-lang#4 are the updates run from bin/update-downloads, and rust-lang#5
is the query run from the download endpoint)

The query is having to perform the full join between crates and
recent_downloads, and then count the results of that. Since we have no
search parameters of any kind, this count is equivalent to just counting
the crates table, which we can do much more quickly. We still need to do
the count over the whole thing if there's any where clause, but we can
optimize the case where there's no search.

This implicitly relies on the fact that we're only changing the select
clause in branches where we're also setting a where clause. Diesel 2
will probably have a feature that lets us avoid this. We could also
refactor the "exact match" check to be client side instead of the DB and
get rid of all the cases where we modify the select clause.

Before:

```
 Limit  (cost=427.87..470.65 rows=100 width=877) (actual time=109.698..109.739 rows=100 loops=1)
   ->  WindowAgg  (cost=0.14..10119.91 rows=23659 width=877) (actual time=109.277..109.697 rows=1100 loops=1)
         ->  Nested Loop Left Join  (cost=0.14..9966.13 rows=23659 width=869) (actual time=0.051..85.429 rows=23659 loops=1)
               ->  Index Scan using index_crates_name_ordering on crates  (cost=0.08..7604.30 rows=23659 width=860) (actual time=0.037..34.975 rows=23659 loops=1)
               ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..0.10 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=23659)
                     Index Cond: (crate_id = crates.id)
 Planning time: 1.307 ms
 Execution time: 111.840 ms
```

After:

```
 Limit  (cost=1052.34..1094.76 rows=100 width=877) (actual time=11.536..12.026 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=627.96..627.96 rows=1 width=8) (actual time=4.966..4.966 rows=1 loops=1)
           ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..616.13 rows=23659 width=0) (actual time=0.015..3.513 rows=23659 loops=1)
                 Heap Fetches: 811
   ->  Subquery Scan on t  (cost=0.14..10037.11 rows=23659 width=877) (actual time=5.019..11.968 rows=1100 loops=1)
         ->  Nested Loop Left Join  (cost=0.14..9966.13 rows=23659 width=869) (actual time=0.051..6.831 rows=1100 loops=1)
               ->  Index Scan using index_crates_name_ordering on crates  (cost=0.08..7604.30 rows=23659 width=860) (actual time=0.038..3.331 rows=1100 loops=1)
               ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..0.10 rows=1 width=12) (actual time=0.003..0.003 rows=1 loops=1100)
                     Index Cond: (crate_id = crates.id)
 Planning time: 1.377 ms
 Execution time: 12.106 ms
```
@jtgeibel
Copy link
Member

Looks great to me! It took me way too long to notice that the pagination helper was adding the total count for the original queries.

@bors r+

@bors
Copy link
Contributor

bors commented Mar 20, 2019

📌 Commit c9f4394 has been approved by jtgeibel

@bors
Copy link
Contributor

bors commented Mar 20, 2019

⌛ Testing commit c9f4394 with merge 907a2d4...

bors added a commit that referenced this pull request Mar 20, 2019
Optimize our most time consuming query

Our database spends more of its time processing /api/v1/crates with no
parameters other than pagination. This query is the main one hit by
crawlers, and it is taking over 100ms to run, so it's at the top of our
list (for posterity's sake, #2 is copying `crate_downloads` during
backups, #3 and #4 are the updates run from bin/update-downloads, and #5
is the query run from the download endpoint)

The query is having to perform the full join between crates and
recent_downloads, and then count the results of that. Since we have no
search parameters of any kind, this count is equivalent to just counting
the crates table, which we can do much more quickly. We still need to do
the count over the whole thing if there's any where clause, but we can
optimize the case where there's no search.

This implicitly relies on the fact that we're only changing the select
clause in branches where we're also setting a where clause. Diesel 2
will probably have a feature that lets us avoid this. We could also
refactor the "exact match" check to be client side instead of the DB and
get rid of all the cases where we modify the select clause.

Before:

```
 Limit  (cost=427.87..470.65 rows=100 width=877) (actual time=109.698..109.739 rows=100 loops=1)
   ->  WindowAgg  (cost=0.14..10119.91 rows=23659 width=877) (actual time=109.277..109.697 rows=1100 loops=1)
         ->  Nested Loop Left Join  (cost=0.14..9966.13 rows=23659 width=869) (actual time=0.051..85.429 rows=23659 loops=1)
               ->  Index Scan using index_crates_name_ordering on crates  (cost=0.08..7604.30 rows=23659 width=860) (actual time=0.037..34.975 rows=23659 loops=1)
               ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..0.10 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=23659)
                     Index Cond: (crate_id = crates.id)
 Planning time: 1.307 ms
 Execution time: 111.840 ms
```

After:

```
 Limit  (cost=1052.34..1094.76 rows=100 width=877) (actual time=11.536..12.026 rows=100 loops=1)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=627.96..627.96 rows=1 width=8) (actual time=4.966..4.966 rows=1 loops=1)
           ->  Index Only Scan using packages_pkey on crates crates_1  (cost=0.06..616.13 rows=23659 width=0) (actual time=0.015..3.513 rows=23659 loops=1)
                 Heap Fetches: 811
   ->  Subquery Scan on t  (cost=0.14..10037.11 rows=23659 width=877) (actual time=5.019..11.968 rows=1100 loops=1)
         ->  Nested Loop Left Join  (cost=0.14..9966.13 rows=23659 width=869) (actual time=0.051..6.831 rows=1100 loops=1)
               ->  Index Scan using index_crates_name_ordering on crates  (cost=0.08..7604.30 rows=23659 width=860) (actual time=0.038..3.331 rows=1100 loops=1)
               ->  Index Scan using recent_crate_downloads_crate_id on recent_crate_downloads  (cost=0.06..0.10 rows=1 width=12) (actual time=0.003..0.003 rows=1 loops=1100)
                     Index Cond: (crate_id = crates.id)
 Planning time: 1.377 ms
 Execution time: 12.106 ms
```
@bors
Copy link
Contributor

bors commented Mar 20, 2019

☀️ Test successful - checks-travis
Approved by: jtgeibel
Pushing 907a2d4 to master...

@bors bors merged commit c9f4394 into rust-lang:master Mar 20, 2019
@sgrif sgrif deleted the sg-query-optimization branch March 25, 2019 21:25
sgrif added a commit to sgrif/crates.io that referenced this pull request Aug 18, 2019
This continues the changes made in rust-lang#1763. Since that PR was merged, one
of the non-code steps has been taken care of -- All users hitting any
endpoint with `?page=20` (which is an arbitrary search pattern that
seemed high enough to give any crawlers going through pagination) have
been contacted about the change, with a PR opened against any that
included a repo. (Intersting aside, there are *zero* records of this for
any endpoint other than search, which perhaps implies we can get rid of
a few of these endpoints, but that's a separate discussion).

This PR does not change any functionality, but moves some code around
to better encapsulate things for upcoming changes. Specifically:

- Change our frontend to show "next/prev page" links on the all crates
  page
- Stop returning the "total" meta item when the next/prev page links
  will be cursor based (which I'd actually just like to start omitting
  in general)

The main goal of this change was to stop having any code outside of
`Paginated` (which has been renamed to `PaginatedQuery`, as there's a
new type called `Paginated`) care about how pagination occurs. This
means other code can't care about *how* pagination happens (with the
exception of `reverse_dependencies`, which uses raw SQL, and sorta has
to... That was a bit of a rabbit hole, see
diesel-rs/diesel#2150 for details. Given the
low traffic to that endpoint, I think we can safely ignore it).

The distribution of responsibilities is as follows:

- `PaginatedQuery<T>`: Given the query params, decides how to paginate
  things, generates appropriate SQL, loads a `Paginated<T>`.
- `Paginated<T>`: Handles providing an iterator to the records, getting
  the total count (to be removed in the near future), providing the
  next/prev page params
- `Request`: Takes the pagination related query params, turns that into
  an actual URL (note: Due to jankiness in our router, this will only
  ever be a query string, we have no way of getting the actual path)

The next step from here is to change our UI to stop showing page
numbers, and then remove the `total` field.

This PR will introduce a performance regression that was addressed by
 rust-lang#1668. That PR was addressing "this will become a problem in a future",
not "this is about to take the site down". Given the intent to remove
the `total` field entirely, I think it is fine to cause this regression
in the short term. If we aren't confident that the changes to remove
this field will land in the near future, or want to be conservative
about this, I can add some additional complexity/git churn to retain the
previous performance characteristics
bors added a commit that referenced this pull request Sep 5, 2019
Refactor pagination to prepare for cursor-based pagination.

This continues the changes made in #1763. Since that PR was merged, one
of the non-code steps has been taken care of -- All users hitting any
endpoint with `?page=20` (which is an arbitrary search pattern that
seemed high enough to give any crawlers going through pagination) have
been contacted about the change, with a PR opened against any that
included a repo. (Intersting aside, there are *zero* records of this for
any endpoint other than search, which perhaps implies we can get rid of
a few of these endpoints, but that's a separate discussion).

This PR does not change any functionality, but moves some code around
to better encapsulate things for upcoming changes. Specifically:

- Change our frontend to show "next/prev page" links on the all crates
  page
- Stop returning the "total" meta item when the next/prev page links
  will be cursor based (which I'd actually just like to start omitting
  in general)

The main goal of this change was to stop having any code outside of
`Paginated` (which has been renamed to `PaginatedQuery`, as there's a
new type called `Paginated`) care about how pagination occurs. This
means other code can't care about *how* pagination happens (with the
exception of `reverse_dependencies`, which uses raw SQL, and sorta has
to... That was a bit of a rabbit hole, see
diesel-rs/diesel#2150 for details. Given the
low traffic to that endpoint, I think we can safely ignore it).

The distribution of responsibilities is as follows:

- `PaginatedQuery<T>`: Given the query params, decides how to paginate
  things, generates appropriate SQL, loads a `Paginated<T>`.
- `Paginated<T>`: Handles providing an iterator to the records, getting
  the total count (to be removed in the near future), providing the
  next/prev page params
- `Request`: Takes the pagination related query params, turns that into
  an actual URL (note: Due to jankiness in our router, this will only
  ever be a query string, we have no way of getting the actual path)

The next step from here is to change our UI to stop showing page
numbers, and then remove the `total` field.

This PR will introduce a performance regression that was addressed by
 #1668. That PR was addressing "this will become a problem in a future",
not "this is about to take the site down". Given the intent to remove
the `total` field entirely, I think it is fine to cause this regression
in the short term. If we aren't confident that the changes to remove
this field will land in the near future, or want to be conservative
about this, I can add some additional complexity/git churn to retain the
previous performance characteristics
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants