Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: don't store duplicate events in the notice queue #1372

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

tonyandrewmeyer
Copy link
Contributor

@tonyandrewmeyer tonyandrewmeyer commented Sep 13, 2024

When a new event is emitted, check if there is already an exact duplicate (the same observer, same event type, and same event snapshot) in the storage from a deferral in a previous run. If a duplicate does exist, then don't store a new notice and snapshot.

There are performance implications:

  • For every event emitted, there's a full iteration through the notices queue. If there are no deferred events, then the notices queue will be empty, so in the majority of cases this should only be one additional storage query (to get the empty result). If the deferral queue is very large, then this may be noticeable, although the point of the change is to reduce the length of the queue, because there are already issues when it's large.
  • For each notice in the queue (again: normally none) the snapshot is loaded (a storage query; currently all snapshots are quite small in size) for any events that have matching observers and kinds. If the queue had a lot of events of the same type, it's likely the observer would match, and if the snapshots were different (for example, many secret-changed events but for different secrets) then there is a cost to doing the comparison.
  • In cases where the queue currently builds up with a lot of duplicates, there will be a significant performance improvement, because only one of the (notice+snapshot) events will be processed each run.

There is also a (deliberate) behaviour change that does impact event ordering. For example, consider this sequence of events:

  1. config-changed deferred
  2. secret-changed deferred
  3. config-changed

Currently, this would result in:

  1. ConfigChangedEvent
  2. ConfigChangedEvent, SecretChangedEvent
  3. ConfigChangedEvent, SecretChangedEvent, ConfigChangedEvent

With this change, this would result in:

  1. ConfigChangedEvent
  2. ConfigChangedEvent, SecretChangedEvent
  3. ConfigChangedEvent, SecretChangedEvent

More generally, there could currently be duplicate (notice+snapshot) events intermixed throughout the queue, and each run they will be re-emitted in the order in which they originally occurred. With this change, they will be re-emitted in the order in which they originally occurred, except if they have already emitted this run. The particularly noticeable change is that the Juju event that triggered the run may not be the last event (if it was a duplicate of one in the queue).

We could potentially do this differently - for example, updating the sequence so that when a duplicate occurs it moves the event to the end of the queue (by dropping and adding the notice+snapshot, or explicitly setting the sequence field for SQL and by just reordering the list for Juju). This would add complexity and have a performance penalty, however, and it seems more correct to have the original order.

For unit tests:

  • Harness: this change is incompatible with some Harness use - specifically, if the test code emits the same event more than once, where it's deferred at least once, there is no reemit() call like there would be in production, and the test expects the handler to be called more than once. For this reason, the skipping is disabled for Harness.
  • Scenario: Scenario is more explicit with deferred events, so if you want to have had the 'skipping' behaviour occur before the event you are ctx.runing then you need to manage that in the list of deferred events passed into the State. We could add a consistency check to alert if there are duplicates in that list (this would be easier to do when the Scenario code is in this repo). However, the Scenario behaviour does still change: if the event is deferred in the ctx.run and there's already a match in the state's deferred list, then the new (deferred) event does not get added, and the output state doesn't change (which is what we want). We get this behaviour automatically because Scenario mimics the runtime behaviour more closely, actually running the framework emitting/reemitting. So: Scenario tests are both safer, and can be used to match the new behaviour.

This can be tested manually with a charm that optionally defers events, e.g. with the code below.

charm.py and charmcraft.yaml content
class NoticeQueueCharm(ops.CharmBase):
    def __init__(self, framework: ops.Framework):
        super().__init__(framework)
        framework.observe(self.on.config_changed, self._on_config_changed)
        framework.observe(self.on.secret_changed, self._on_secret_changed)

    def _on_config_changed(self, event):
        logger.info("Running config-changed")
        if self.config.get("secretopt"):
            secret = self.model.get_secret(id=self.config["secretopt"])
            # Get the content so we are 'subscribed' to the updates.
            secret.get_content()
        if self.config.get("secretopt2"):
            secret = self.model.get_secret(id=self.config["secretopt2"])
            # Get the content so we are 'subscribed' to the updates.
            secret.get_content()
        if self.config.get("opt", "").startswith("defer"):
            event.defer()

    def _on_secret_changed(self, event):
        logger.info("Running secret-changed")
        if self.config.get("opt", "").startswith("defer"):
            event.defer()
config:
  options:
    opt:
      description: dummy option to trigger config-changed
    secretopt:
      type: secret
      description: a user secret
    secretopt2:
      type: secret
      description: a user secret

If you want to see the queue while you're doing this, you can use code like this:

        store = self.framework._storage
        for event_path, observer_path, method_name in store.notices(None):
            handle = ops.Handle.from_path(event_path)
            snapshot_data = store.load_snapshot(handle.path)
            logger.info(
                "event_path: %s, observer_path: %s, method_name: %s, snapshot data: %r",
                event_path, observer_path, method_name, snapshot_data,
            )

If you set opt to anything not starting with "defer" then you should get a config-changed event every time. If you set it to something starting with "defer", then it will run exactly once each time you set the config (remember to change the value, or Juju will skip the event) - with ops@main you'll instead get a config-changed event for every time that you change the config, every time you change it (ie. the queue will build up).

You can also check it with an event that has a snapshot, such as secret-changed. If the config is set to defer, every time you change the content of the first secret, you'll get one secret-changed event (but with ops@main, each time you'll get multiple, depending on how many times you've done it). If you also change the second secret, you'll get two secret-changed events, one for each secret (because the snapshots differ).

You can intermix the different events and should always have exactly zero or one (event type+snapshot) in the queue. If you change the opt value back to something not starting with "defer", then you should see all the events complete and have an empty queue.

Scenario test that shows the behaviour
import ops
from ops import testing

class MyCharm(ops.CharmBase):
    def __init__(self, framework):
        super().__init__(framework)
        framework.observe(self.on.secret_changed, self._on_sc)
        framework.observe(self.on.update_status, self._on_us)

    def _on_us(self, event):
        print("update-status", event)

    def _on_sc(self, event):
        print("secret-changed", event)
        event.defer()


ctx = testing.Context(MyCharm, meta={"name": "foo"})
secret = testing.Secret({"foo": "bar"})
devent1 = ctx.on.update_status().deferred(MyCharm._on_us)
devent2 = ctx.on.secret_changed(secret).deferred(MyCharm._on_sc)
state_in = testing.State(secrets={secret}, deferred=[devent1, devent2])
state_out = ctx.run(ctx.on.secret_changed(secret), state_in)

assert state_out.unit_status == testing.UnknownStatus()
print(state_out.deferred)

Note that this requires fixing a small bug in Secret snapshoting (PR). That's unrelated to this change - it also occurs if you use deferred secrets in main.

Fixes #935

@tonyandrewmeyer
Copy link
Contributor Author

Although this seems like a fairly small change, and it's in one very specific piece of the code, it's at the heart of running every event, so very thorough review is appreciated. I'm also happy to discuss the performance implications as well as the implications of the behaviour change.

@tonyandrewmeyer
Copy link
Contributor Author

Will move back out of draft once I've investigated the charm failures - no point reviewing until then.

@tonyandrewmeyer tonyandrewmeyer marked this pull request as draft September 13, 2024 05:52
@tonyandrewmeyer
Copy link
Contributor Author

Will move back out of draft once I've investigated the charm failures - no point reviewing until then.

Code like this doesn't work with Harness:

class MyCharm(ops.CharmBase):
    ...
    def _on_config_changed(self, event):
        if self.something():
            event.defer()
        self.do_something()

def test_config_changed():
    h = testing.Harness()
    h.begin()
    MyCharm.something = lambda _: False
    h.charm.on.config_changed.emit()
    MyCharm.something = lambda _: True
    h.charm.on.config_changed.emit()
    assert MyCharm.do_something.called()  # Assume this was patched somewhere

The framework skips the second emit. Arguably, the test should clean up in some way because it's actually going to run the config-changed handler three times (once deferred, then twice when it doesn't defer, once for the deferred notice and once for the new one) - it could be broken into multiple tests or use a fresh Harness or could just call reemit() rather than emit() to get the deferred event rather than a new one.

It seems likely (I'll do a quick check to be sure) that this pattern is too common to have the change in this way - I don't think we can expect people to update their tests to account for this, since it's not really a bug, just not the ideal way to write the test. Maybe the 'skip duplicate defers' will need to be explicitly enabled?

@tonyandrewmeyer
Copy link
Contributor Author

tonyandrewmeyer commented Oct 4, 2024

I did a super-tox (-e unit) run and there were no new failures ... however, since this behaviour isn't active when using Harness, that's not as informative as it would be. I guess I could try doing an -e integration although I'm not sure how much coverage that provides (particularly for deferring).

@tonyandrewmeyer
Copy link
Contributor Author

Ok, this should be good for an initial review now.

@tonyandrewmeyer tonyandrewmeyer marked this pull request as ready for review October 4, 2024 05:50
Copy link
Collaborator

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good to me, and it's a nice one to get fixed. Could we do a couple more tests before merging, just in case:

  1. You've called out the performance changes, but for a sanity check, could you please do a couple of measurements? What's a typical time for load the notices given and empty queue? And then if the queue is non-empty, what's the time to load each snapshot? I'm guessing sub-milliseconds in both cases, but would be good to confirm.
  2. I think trying to run super-tox -e integration will be too slow / problematic, but could we at least please spot check tox -e integration on a couple of charms that use defer?

Copy link
Contributor

@dimaqq dimaqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I may not have thought this through.

Is it possible to get 2 distinct events from Juju that only differ in arguments?

For example, 2 secrets are now available, the secret id is different, but everything else is same (event name, observer function, human name if any).

Comment on lines +860 to +864
except NoSnapshotError:
existing_event_data = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code branch feels wrong.
If an event cannot be loaded, should it not be discarded instead?
Or do we have saved snapshots with no data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are definitely empty snapshots - for example, start, config-changed, and stop all have empty snapshots. At the moment, we store an empty snapshot in the table in those cases, but it seems more resilient to not assume that will always be the case. This isn't failing to load the event, it's failing to find a snapshot that goes with the notice.

ops/framework.py Show resolved Hide resolved
Comment on lines +497 to +555
# Emit an event, which will be deferred.
pub.d.emit('foo')
notices = tuple(framework._storage.notices())
assert len(notices) == 2 # One per observer.
assert framework._storage.load_snapshot(notices[0][0]) == {'data': 'foo'}

# Emit the same event, and we'll still just have the single notice.
pub.d.emit('foo')
assert len(tuple(framework._storage.notices())) == 2

# Emit the same event kind but with a different snapshot, and we'll get a new notice.
pub.d.emit('bar')
notices = tuple(framework._storage.notices())
assert len(notices) == 4
assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'}

# Emit a totally different event, and we'll get a new notice.
pub.n.emit()
notices = tuple(framework._storage.notices())
assert len(notices) == 6
assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'}
assert framework._storage.load_snapshot(notices[4][0]) == {}

# Even though these events are far back in the queue, since they're
# duplicates, they will get skipped.
pub.d.emit('foo')
pub.d.emit('bar')
pub.n.emit()
assert len(tuple(framework._storage.notices())) == 6

def notices_for_observer(n: int):
return [
notice for notice in framework._storage.notices() if notice[1].endswith(f'[{n}]')
]

# Stop deferring on the first observer, and all those events will be
# completed and the notices removed, while the second observer will
# still have them queued.
pub.r.emit()
assert len(tuple(framework._storage.notices())) == 6
pub.n.emit()
framework.reemit()
assert len(notices_for_observer(1)) == 0
assert len(notices_for_observer(2)) == 3

# Without the defer active, the first observer always ends up with an
# empty queue, while the second observer's queue continues to skip
# duplicates and add new events.
pub.d.emit('foo')
pub.d.emit('foo')
pub.d.emit('bar')
pub.n.emit()
pub.d.emit('foo')
pub.d.emit('bar')
pub.n.emit()
pub.d.emit('baz')
framework.reemit()
assert len(notices_for_observer(1)) == 0
assert len(notices_for_observer(2)) == 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider refactoring the setup and splitting this flow into separate tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that what this is testing is the behaviour over multiple types of events, some deferred, some not, some running previously run deferred events. It's all about checking whether the state is correct over a long sequence of changes.

I could split it - for example, each comment could be the start of a new test. But the problem there is that each test would either have to do everything in the previous test to get the database into the correct state (which means it'd just be the same plus multiple shorter versions as well) or I'd need to manually set up the database in all but the first test, to have the expected data. That seems worse and more fragile to me.

I could split it into two tests, one for each type of observer, but part of the test is making sure that when there are two observers that are properly handled separately, and it feels like splitting the test would make it easier to break that.

I definitely agree that the test is very long, and I don't like that.

Would it help if there were some extra helper functions so that something like:

pub.d.emit('foo')
pub.d.emit('foo')
pub.d.emit('foo')
pub.n.emit()

Was something like this?

emit([(d, 'foo')], repeat=3)
emit([(n)])

It would reduce the number of lines, but my feeling is that it would end up less understandable.

@tonyandrewmeyer
Copy link
Contributor Author

Is it possible to get 2 distinct events from Juju that only differ in arguments?

Yes. For example, you can get a secret-removed for two different secrets. That's why we need to compare the snapshots as well as the event itself.

@tonyandrewmeyer
Copy link
Contributor Author

  1. You've called out the performance changes, but for a sanity check, could you please do a couple of measurements? What's a typical time for load the notices given and empty queue? And then if the queue is non-empty, what's the time to load each snapshot? I'm guessing sub-milliseconds in both cases, but would be good to confirm.

With sqlite3, loading empty notices is very fast, yes, even on my fairly slow local machine:

>>> timeit.timeit('list(s.notices())', "import ops\ns=ops.storage.SQLiteStorage('/tmp/empty-unit-state.db')", number=1000000) / 1000000
2.4867908030282704e-06

Loading a snapshot is also fast:

timeit.timeit('s.load_snapshot("NoticeQueueCharm/on/secret_changed[3551]")', "import ops\ns=ops.storage.SQLiteStorage('.unit-state.db')", number=1000000) / 1000000
4.108476707013324e-06

When using Juju for storage, I expect there is considerable variation depending on your Juju setup. But with my local one, it is noticeably slower (because of all the state-get calls presumably):

>>> timeit.timeit('list(s.notices())', "import ops\ns=ops.storage.JujuStorage()", number=1000) / 1000
0.0052693093249108645
>>> timeit.timeit('s.load_snapshot("NoticeQueueCharm/on/secret_changed[3551]")', "import ops\ns=ops.storage.JujuStorage()", number=1000) / 1000
0.005267057369928807

So, on my machine/my Juju, there would be an unnoticeable penalty using sqlite3, and a ~5ms penalty using Juju (which we do discourage). In exchange, performance when there are a lot of deferred events is improved.

  1. I think trying to run super-tox -e integration will be too slow / problematic,

Yes, I've tried this before and don't have a solution for it yet.

but could we at least please spot check tox -e integration on a couple of charms that use defer?

I'm working on this, but nearly all the charms that are using defer() are data platform ones, and running the integration tests (both locally and in GitHub CI) is unfortunately complicated.

@benhoyt
Copy link
Collaborator

benhoyt commented Dec 10, 2024

@tonyandrewmeyer It looks like this PR needs to be merged/rebased, as the diff currently shows 7500 LoC with all of Scenario being added.

@tonyandrewmeyer
Copy link
Contributor Author

@tonyandrewmeyer It looks like this PR needs to be merged/rebased, as the diff currently shows 7500 LoC with all of Scenario being added.

I'll rebase once I've figured out how to test this against some charm integration tests and have also addressed the outstanding review comment.

tonyandrewmeyer and others added 7 commits December 12, 2024 12:58
Docs: Ops: Add top-level intro, clean up ops.module intro in light of
that, flesh out ops.pebble and ops.testing intros a bit more

https://warthogs.atlassian.net/browse/CHARMTECH-183

- [x] "home page" library reference
- [x] ops module
- [x] ops.main entry point
- [x] ops.pebble module
- [x] ops.testing module
…names in ops.testing (canonical#1381)

Add a new optional install `testing`, e.g. `pip install ops[testing]`.
This pulls in a compatible version of `ops-scenario`, and exposes the
Scenario names in the `ops.testing` namespace, alongside Harness.

`pip install ops[harness]` is also supported to ease the (far off,
presumably 3.0) transition to Harness being moved out of the base
install. It currently installs no extra dependencies, so is the same as
`pip install ops` but a forward-looking charm would use `pip install
ops[harness]` (or `pip install ops[harness, testing]`) if using Harness.

Requires ops-scenario 7.0.5, which has the required adjustments to
support insertion into `ops.testing`.

The `ActionFailed` name exists in both `ops._private.harness` and
`scenario.context`. This is handled by adjusting the Harness class to
support the functionality of both and monkeypatching that into Scenario
until Scenario starts using it. It's compatible with both Harness and
Scenario, but will have empty data in an attribute (which attribute
depends on which framework is used).

The `Container` name in `ops.testing`, which is only present for
backwards compatibility, is also overwritten if ops-scenario is
installed. If anyone is actually using `ops.testing.Container` instead
of `ops.Container` then they'll need to fix their code before using
`ops[testing]` (or ops-scenario combined with the release of ops with
this change).

A very basic unit test is added to make sure that Scenario tests work
(all the actual Scenario tests are in the ops-scenario repo) if
ops-scenario/ops[testing] is installed (this is the case for `tox -e
unit`). A test is also added to ensure that all of the Scenario names
are documented, since `automodule` isn't used any more.

Also adjusts the documentation to include the new framework in
ops.testing.

---------

Co-authored-by: Ben Hoyt <benhoyt@gmail.com>
Co-authored-by: Dima Tisnek <dimaqq@gmail.com>
@tonyandrewmeyer
Copy link
Contributor Author

tonyandrewmeyer commented Dec 19, 2024

but could we at least please spot check tox -e integration on a couple of charms that use defer?

hydra-operator has 16 defer() calls, and integration tests pass with this branch. I'm still working on getting integrations tests running from another charm (almost certainly a data platform one, as most of the defer use is there).

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent the deferred events queue to contain duplicates
3 participants