Skip to content

🗃 Remove visible when acking jobs #16

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Feb 7, 2025
Merged

🗃 Remove visible when acking jobs #16

merged 1 commit into from
Feb 7, 2025

Conversation

alecgibson
Copy link
Collaborator

This is a non-breaking change that forms part of the migration path to improving our performance in #15

When acking a job, we now unset the visible property, which makes visible and deleted mutually exclusive properties, so that the presence of one implies the absence of the other.

We can do this, because the only time we query on visible is when we also query for deleted: {$exists: false}. Similarly, the only times we query for deleted: {$exists: true} are times when we don't query visible.

This is a non-breaking change that forms part of the migration path to
improving our performance in #15

When acking a job, we now unset the `visible` property, which makes
`visible` and `deleted` mutually exclusive properties, so that the
presence of one implies the absence of the other.

We can do this, because the only time we query on `visible` is when we
*also* query for `deleted: {$exists: false}`. Similarly, the only times
we query for `deleted: {$exists: true}` are times when we don't query
`visible`.
alecgibson added a commit that referenced this pull request Feb 6, 2025
This is a **BREAKING CHANGE** that aims to improve database performance
by changing the queries and indexes used to perform operations. There
will be no changes visible to consumers of the JavaScript API.

The break will:

 - require consumers to run a migration query before upgrading
 - add new indexes
 - allow dropping of an old index

More details on migration are at the bottom of this commit message.

Motivation
----------
This library seems to have been written with the assumption that its
collections are small. However, we have hundreds of thousands of jobs in
various queues on Production, which causes some slow queries because of
the design choices made in the schema of this library.

In particular, we aim to address two issues:

 1. `{deleted: {$exists: boolean}}` calls are inefficient, but used by
    practically every query in this library
 2. counting inflight jobs has awful performance when there are many
    available jobs

The result is that no further filtering is needed beyond the index on
any of the issued queries.

`$exists`
---------
MongoDB's `$exists` operator has [tricky index performance][1]. In the
best cases, `$exists: true` can use the index, but only if it's sparse,
and `$exists: false` can **never** just use the index: it always needs
to fetch documents.

In order to avoid the constant use of `$exists` in this library, we rely
on a logical paradigm shift: we `$unset` `visible` when acking the job,
so that `visible` and `deleted` are mutually exclusive fields. Therefore
we can:

 - add a sparse index for both of these fields
 - query on the field we care about, and **know** that it implies the
   absence of the other field, allowing the removal of `$exists`
   assertions

It should be noted that in local testing, I observed
[strange behaviour][2] when trying to use this partial index: we have to
use `ack: {$gt: ''}` instead of `ack: {$exists: true}` to get MongoDB to
leverage this index for some reason.

`inFlight()`
------------
The existing `inFlight()` query has particularly bad performance in
cases where there are many (hundreds of thousands) of jobs available to
pick up.

This is because the existing query uses the `deleted_1_visible_1` index,
but even after filtering by `deleted` and `visible` with the index, the
database will need to fetch every single job that could be picked up,
and check for `ack`, which is very slow.

We improve the performance here by:

 - the removal of the `$exists` query (see above)
 - the addition of a partial index that only contains unacked jobs that
   have been retrieved at some point by `get()`. We can then filter
   these by the current time to find in-flight jobs

Migration path
--------------
This performance improvement is built upon a shift in the assumptions
made about underlying job structure: namely, that `deleted` and
`visible` are now mutually exclusive properties (which was not true
before).

 1. Bump patch version to [`7.1.1`][3]: this will start removing the
    `visible` property from acked jobs in a non-breaking way
 2. Deploy the patch to Production
 3. Update any existing documents to match this new schema:

    ```js
    db.collection.updateMany(
      {deleted: {$exists: true}},
      {$unset: {visible: 1}},
    )
    ```

 4. Bump major version to `8.0.0` and deploy
 5. Drop old index `delted_1_visible_1`, which is no longer used

[1]: https://www.mongodb.com/docs/manual/reference/operator/query/exists/#use-a-sparse-index-to-improve--exists-performance
[2]: https://www.mongodb.com/community/forums/t/partial-index-is-not-used-during-search/290507/2
[3]: #16
@alecgibson alecgibson mentioned this pull request Feb 6, 2025
1 task
@alecgibson alecgibson merged commit 3b121da into main Feb 7, 2025
2 checks passed
@alecgibson alecgibson deleted the remove-visible branch February 7, 2025 09:30
alecgibson added a commit that referenced this pull request Feb 10, 2025
This is a **BREAKING CHANGE** that aims to improve database performance
by changing the queries and indexes used to perform operations. There
will be no changes visible to consumers of the JavaScript API.

The break will:

 - require consumers to run a migration query before upgrading
 - add new indexes
 - allow dropping of an old index

More details on migration are at the bottom of this commit message.

Motivation
----------
This library seems to have been written with the assumption that its
collections are small. However, we have hundreds of thousands of jobs in
various queues on Production, which causes some slow queries because of
the design choices made in the schema of this library.

In particular, we aim to address two issues:

 1. `{deleted: {$exists: boolean}}` calls are inefficient, but used by
    practically every query in this library
 2. counting inflight jobs has awful performance when there are many
    available jobs

The result is that no further filtering is needed beyond the index on
any of the issued queries.

`$exists`
---------
MongoDB's `$exists` operator has [tricky index performance][1]. In the
best cases, `$exists: true` can use the index, but only if it's sparse,
and `$exists: false` can **never** just use the index: it always needs
to fetch documents.

In order to avoid the constant use of `$exists` in this library, we rely
on a logical paradigm shift: we `$unset` `visible` when acking the job,
so that `visible` and `deleted` are mutually exclusive fields. Therefore
we can:

 - add a sparse index for both of these fields
 - query on the field we care about, and **know** that it implies the
   absence of the other field, allowing the removal of `$exists`
   assertions

It should be noted that in local testing, I observed
[strange behaviour][2] when trying to use this partial index: we have to
use `ack: {$gt: ''}` instead of `ack: {$exists: true}` to get MongoDB to
leverage this index for some reason.

`inFlight()`
------------
The existing `inFlight()` query has particularly bad performance in
cases where there are many (hundreds of thousands) of jobs available to
pick up.

This is because the existing query uses the `deleted_1_visible_1` index,
but even after filtering by `deleted` and `visible` with the index, the
database will need to fetch every single job that could be picked up,
and check for `ack`, which is very slow.

We improve the performance here by:

 - the removal of the `$exists` query (see above)
 - the addition of a partial index that only contains unacked jobs that
   have been retrieved at some point by `get()`. We can then filter
   these by the current time to find in-flight jobs

Migration path
--------------
This performance improvement is built upon a shift in the assumptions
made about underlying job structure: namely, that `deleted` and
`visible` are now mutually exclusive properties (which was not true
before).

 1. Bump patch version to [`7.1.1`][3]: this will start removing the
    `visible` property from acked jobs in a non-breaking way
 2. Deploy the patch to Production
 3. Update any existing documents to match this new schema:

    ```js
    db.collection.updateMany(
      {deleted: {$exists: true}},
      {$unset: {visible: 1}},
    )
    ```

 4. Bump major version to `8.0.0` and deploy
 5. Drop old index `delted_1_visible_1`, which is no longer used

[1]: https://www.mongodb.com/docs/manual/reference/operator/query/exists/#use-a-sparse-index-to-improve--exists-performance
[2]: https://www.mongodb.com/community/forums/t/partial-index-is-not-used-during-search/290507/2
[3]: #16
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants