Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Speed up table counts when displaying index and database pages #2398

Closed
simonw opened this issue Aug 20, 2024 · 12 comments
Closed

Speed up table counts when displaying index and database pages #2398

simonw opened this issue Aug 20, 2024 · 12 comments

Comments

@simonw
Copy link
Owner

simonw commented Aug 20, 2024

Original title: The query that counts all matching rows on table page should have a shorter time limit

I thought it did already, but apparently not.

@simonw
Copy link
Owner Author

simonw commented Aug 20, 2024

This code here:

# Otherwise run a select count(*) ...
if count_sql and count is None and not nocount:
try:
count_rows = list(await db.execute(count_sql, from_sql_params))
count = count_rows[0][0]
except QueryInterrupted:
pass
return count

Currently it can take as long as the default SQL time limit, which means that if a table is huge you could wait a second or more for the table page to fail to count the rows.

@simonw
Copy link
Owner Author

simonw commented Aug 20, 2024

There are other things that could happen here, for example:

  1. If count times out, link to a SQL query page that runs the full count
  2. Counts occur via a fetch() call and show a loading indicator while they are running
  3. Caching! Counts could be cached for the given query filters
  4. Super-fancy caching: any time the table is modified (captured via triggers) the count cache is invalidated, using a mechanism like simonw/sqlite-chronicle

I probably won't implement any of the fancy options as part of this initial issue.

@simonw
Copy link
Owner Author

simonw commented Aug 20, 2024

What should the count timeout be? I'm inclined to add a new setting for it, to join sql_time_limit_ms and facet_time_limit_ms and facet_suggest_time_limit_ms

Could call it sql_count_time_limit_ms.

@simonw
Copy link
Owner Author

simonw commented Aug 20, 2024

We do have a count time limit elsewhere, but that's for listing tables:

async def table_counts(self, limit=10):
if not self.is_mutable and self.cached_table_counts is not None:
return self.cached_table_counts
# Try to get counts for each table, $limit timeout for each count
counts = {}
for table in await self.table_names():
try:
table_count = (
await self.execute(
f"select count(*) from [{table}]",
custom_time_limit=limit,
)
).rows[0][0]
counts[table] = table_count
# In some cases I saw "SQL Logic Error" here in addition to
# QueryInterrupted - so we catch that too:
except (QueryInterrupted, sqlite3.OperationalError, sqlite3.DatabaseError):
counts[table] = None
if not self.is_mutable:
self._cached_table_counts = counts
return counts

@simonw
Copy link
Owner Author

simonw commented Aug 20, 2024

Currently we have a ?_nocount=1 query string option which can be used to suppress that count query - e.g. compare:

https://datasette.io/content/assets?_sort=id&size__gt=100

CleanShot 2024-08-20 at 09 46 17@2x

To https://datasette.io/content/assets?_sort=id&_nocount=1&size__gt=100

CleanShot 2024-08-20 at 09 46 43@2x

@simonw
Copy link
Owner Author

simonw commented Aug 20, 2024

I think that second one should say "Rows where size > 100", not just "where size > 100".

@simonw
Copy link
Owner Author

simonw commented Aug 20, 2024

Maybe add a ?_count=1 option which forces the count to happen, which we can then link to for cases where the count timed out.

@simonw
Copy link
Owner Author

simonw commented Aug 21, 2024

Another option here: I could have the count limit to a max of 1000 using this pattern:

with original as (select ...),
truncated as (select * from original limit 1001)
select count(*) from truncated;

Demo: https://datasette.io/content?sql=with+original+as+%28select+*+from+releases%29%2C%0D%0Atruncated+as+%28select+*+from+original+limit+101%29%0D%0Aselect+count%28*%29+from+truncated%3B

Then if that comes back as 1001 I can show >1,000. This should have a pretty tight limit on how much work the count operation ends up doing, even for brute force table scans as those would only have to consult 1,000 records max.

Would need to change the JSON API to include a count_truncated field if you request ?_extra=count and it got truncated. Maybe have a ?_extra=full_count that ignores that logic?

@simonw
Copy link
Owner Author

simonw commented Aug 21, 2024

This tiny diff makes a HUGE difference all over the place:

diff --git a/datasette/database.py b/datasette/database.py
index c761dad7..428d8703 100644
--- a/datasette/database.py
+++ b/datasette/database.py
@@ -376,7 +376,7 @@ class Database:
             try:
                 table_count = (
                     await self.execute(
-                        f"select count(*) from [{table}]",
+                        f"select count(*) from (select * from [{table}] limit 10001)",
                         custom_time_limit=limit,
                     )
                 ).rows[0][0]

@simonw simonw changed the title The query that counts all matching rows on table page should have a shorter time limit Speed up table counts when displaying index and database pages Aug 21, 2024
@simonw
Copy link
Owner Author

simonw commented Aug 21, 2024

I've decided to go with a limit of 10,000, which seems to make everything feel really snappy while still showing useful results for most databases.

I'm now showing "Many rows" if the time limit was hit, ">10,000 rows" if the row count was 10001 and the actual number of rows otherwise.

@simonw
Copy link
Owner Author

simonw commented Aug 21, 2024

I'm going to deploy this branch to Datasette Cloud and see how it feels.

@simonw
Copy link
Owner Author

simonw commented Aug 21, 2024

It feels really good. I'm happy with this change.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

1 participant