feat: don't store duplicate events in the notice queue #1372

tonyandrewmeyer · 2024-09-13T05:48:21Z

When a new event is emitted, check if there is already an exact duplicate (the same observer, same event type, and same event snapshot) in the storage from a deferral in a previous run. If a duplicate does exist, then don't store a new notice and snapshot.

There are performance implications:

For every event emitted, there's a full iteration through the notices queue. If there are no deferred events, then the notices queue will be empty, so in the majority of cases this should only be one additional storage query (to get the empty result). If the deferral queue is very large, then this may be noticeable, although the point of the change is to reduce the length of the queue, because there are already issues when it's large.
For each notice in the queue (again: normally none) the snapshot is loaded (a storage query; currently all snapshots are quite small in size) for any events that have matching observers and kinds. If the queue had a lot of events of the same type, it's likely the observer would match, and if the snapshots were different (for example, many secret-changed events but for different secrets) then there is a cost to doing the comparison.
In cases where the queue currently builds up with a lot of duplicates, there will be a significant performance improvement, because only one of the (notice+snapshot) events will be processed each run.

There is also a (deliberate) behaviour change that does impact event ordering. For example, consider this sequence of events:

config-changed deferred
secret-changed deferred
config-changed

Currently, this would result in:

ConfigChangedEvent
ConfigChangedEvent, SecretChangedEvent
ConfigChangedEvent, SecretChangedEvent, ConfigChangedEvent

With this change, this would result in:

ConfigChangedEvent
ConfigChangedEvent, SecretChangedEvent
ConfigChangedEvent, SecretChangedEvent

More generally, there could currently be duplicate (notice+snapshot) events intermixed throughout the queue, and each run they will be re-emitted in the order in which they originally occurred. With this change, they will be re-emitted in the order in which they originally occurred, except if they have already emitted this run. The particularly noticeable change is that the Juju event that triggered the run may not be the last event (if it was a duplicate of one in the queue).

We could potentially do this differently - for example, updating the sequence so that when a duplicate occurs it moves the event to the end of the queue (by dropping and adding the notice+snapshot, or explicitly setting the sequence field for SQL and by just reordering the list for Juju). This would add complexity and have a performance penalty, however, and it seems more correct to have the original order.

For unit tests:

Harness: this change is incompatible with some Harness use - specifically, if the test code emits the same event more than once, where it's deferred at least once, there is no reemit() call like there would be in production, and the test expects the handler to be called more than once. For this reason, the skipping is disabled for Harness.
Scenario: Scenario is more explicit with deferred events, so if you want to have had the 'skipping' behaviour occur before the event you are ctx.runing then you need to manage that in the list of deferred events passed into the State. We could add a consistency check to alert if there are duplicates in that list (this would be easier to do when the Scenario code is in this repo). However, the Scenario behaviour does still change: if the event is deferred in the ctx.run and there's already a match in the state's deferred list, then the new (deferred) event does not get added, and the output state doesn't change (which is what we want). We get this behaviour automatically because Scenario mimics the runtime behaviour more closely, actually running the framework emitting/reemitting. So: Scenario tests are both safer, and can be used to match the new behaviour.

This can be tested manually with a charm that optionally defers events, e.g. with the code below.

charm.py and charmcraft.yaml content

class NoticeQueueCharm(ops.CharmBase):
    def __init__(self, framework: ops.Framework):
        super().__init__(framework)
        framework.observe(self.on.config_changed, self._on_config_changed)
        framework.observe(self.on.secret_changed, self._on_secret_changed)

    def _on_config_changed(self, event):
        logger.info("Running config-changed")
        if self.config.get("secretopt"):
            secret = self.model.get_secret(id=self.config["secretopt"])
            # Get the content so we are 'subscribed' to the updates.
            secret.get_content()
        if self.config.get("secretopt2"):
            secret = self.model.get_secret(id=self.config["secretopt2"])
            # Get the content so we are 'subscribed' to the updates.
            secret.get_content()
        if self.config.get("opt", "").startswith("defer"):
            event.defer()

    def _on_secret_changed(self, event):
        logger.info("Running secret-changed")
        if self.config.get("opt", "").startswith("defer"):
            event.defer()

config:
  options:
    opt:
      description: dummy option to trigger config-changed
    secretopt:
      type: secret
      description: a user secret
    secretopt2:
      type: secret
      description: a user secret

If you want to see the queue while you're doing this, you can use code like this:

        store = self.framework._storage
        for event_path, observer_path, method_name in store.notices(None):
            handle = ops.Handle.from_path(event_path)
            snapshot_data = store.load_snapshot(handle.path)
            logger.info(
                "event_path: %s, observer_path: %s, method_name: %s, snapshot data: %r",
                event_path, observer_path, method_name, snapshot_data,
            )

If you set opt to anything not starting with "defer" then you should get a config-changed event every time. If you set it to something starting with "defer", then it will run exactly once each time you set the config (remember to change the value, or Juju will skip the event) - with ops@main you'll instead get a config-changed event for every time that you change the config, every time you change it (ie. the queue will build up).

You can also check it with an event that has a snapshot, such as secret-changed. If the config is set to defer, every time you change the content of the first secret, you'll get one secret-changed event (but with ops@main, each time you'll get multiple, depending on how many times you've done it). If you also change the second secret, you'll get two secret-changed events, one for each secret (because the snapshots differ).

You can intermix the different events and should always have exactly zero or one (event type+snapshot) in the queue. If you change the opt value back to something not starting with "defer", then you should see all the events complete and have an empty queue.

Scenario test that shows the behaviour

import ops
from ops import testing

class MyCharm(ops.CharmBase):
    def __init__(self, framework):
        super().__init__(framework)
        framework.observe(self.on.secret_changed, self._on_sc)
        framework.observe(self.on.update_status, self._on_us)

    def _on_us(self, event):
        print("update-status", event)

    def _on_sc(self, event):
        print("secret-changed", event)
        event.defer()


ctx = testing.Context(MyCharm, meta={"name": "foo"})
secret = testing.Secret({"foo": "bar"})
devent1 = ctx.on.update_status().deferred(MyCharm._on_us)
devent2 = ctx.on.secret_changed(secret).deferred(MyCharm._on_sc)
state_in = testing.State(secrets={secret}, deferred=[devent1, devent2])
state_out = ctx.run(ctx.on.secret_changed(secret), state_in)

assert state_out.unit_status == testing.UnknownStatus()
print(state_out.deferred)

Note that this requires fixing a small bug in Secret snapshoting (PR). That's unrelated to this change - it also occurs if you use deferred secrets in main.

Fixes #935

tonyandrewmeyer · 2024-09-13T05:49:13Z

Although this seems like a fairly small change, and it's in one very specific piece of the code, it's at the heart of running every event, so very thorough review is appreciated. I'm also happy to discuss the performance implications as well as the implications of the behaviour change.

tonyandrewmeyer · 2024-09-13T05:52:53Z

Will move back out of draft once I've investigated the charm failures - no point reviewing until then.

ops/framework.py

tonyandrewmeyer · 2024-09-23T00:48:23Z

Will move back out of draft once I've investigated the charm failures - no point reviewing until then.

Code like this doesn't work with Harness:

class MyCharm(ops.CharmBase):
    ...
    def _on_config_changed(self, event):
        if self.something():
            event.defer()
        self.do_something()

def test_config_changed():
    h = testing.Harness()
    h.begin()
    MyCharm.something = lambda _: False
    h.charm.on.config_changed.emit()
    MyCharm.something = lambda _: True
    h.charm.on.config_changed.emit()
    assert MyCharm.do_something.called()  # Assume this was patched somewhere

The framework skips the second emit. Arguably, the test should clean up in some way because it's actually going to run the config-changed handler three times (once deferred, then twice when it doesn't defer, once for the deferred notice and once for the new one) - it could be broken into multiple tests or use a fresh Harness or could just call reemit() rather than emit() to get the deferred event rather than a new one.

It seems likely (I'll do a quick check to be sure) that this pattern is too common to have the change in this way - I don't think we can expect people to update their tests to account for this, since it's not really a bug, just not the ideal way to write the test. Maybe the 'skip duplicate defers' will need to be explicitly enabled?

tonyandrewmeyer · 2024-10-04T05:07:01Z

I did a super-tox (-e unit) run and there were no new failures ... however, since this behaviour isn't active when using Harness, that's not as informative as it would be. I guess I could try doing an -e integration although I'm not sure how much coverage that provides (particularly for deferring).

tonyandrewmeyer · 2024-10-04T05:50:47Z

Ok, this should be good for an initial review now.

benhoyt

Thanks, this looks good to me, and it's a nice one to get fixed. Could we do a couple more tests before merging, just in case:

You've called out the performance changes, but for a sanity check, could you please do a couple of measurements? What's a typical time for load the notices given and empty queue? And then if the queue is non-empty, what's the time to load each snapshot? I'm guessing sub-milliseconds in both cases, but would be good to confirm.
I think trying to run super-tox -e integration will be too slow / problematic, but could we at least please spot check tox -e integration on a couple of charms that use defer?

dimaqq

Sorry I may not have thought this through.

Is it possible to get 2 distinct events from Juju that only differ in arguments?

For example, 2 secrets are now available, the secret id is different, but everything else is same (event name, observer function, human name if any).

dimaqq · 2024-10-08T04:19:44Z

ops/framework.py

+            except NoSnapshotError:
+                existing_event_data = {}


This code branch feels wrong.
If an event cannot be loaded, should it not be discarded instead?
Or do we have saved snapshots with no data?

There are definitely empty snapshots - for example, start, config-changed, and stop all have empty snapshots. At the moment, we store an empty snapshot in the table in those cases, but it seems more resilient to not assume that will always be the case. This isn't failing to load the event, it's failing to find a snapshot that goes with the notice.

ops/framework.py

dimaqq · 2024-10-08T04:23:08Z

test/test_framework.py

+        # Emit an event, which will be deferred.
+        pub.d.emit('foo')
+        notices = tuple(framework._storage.notices())
+        assert len(notices) == 2  # One per observer.
+        assert framework._storage.load_snapshot(notices[0][0]) == {'data': 'foo'}
+
+        # Emit the same event, and we'll still just have the single notice.
+        pub.d.emit('foo')
+        assert len(tuple(framework._storage.notices())) == 2
+
+        # Emit the same event kind but with a different snapshot, and we'll get a new notice.
+        pub.d.emit('bar')
+        notices = tuple(framework._storage.notices())
+        assert len(notices) == 4
+        assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'}
+
+        # Emit a totally different event, and we'll get a new notice.
+        pub.n.emit()
+        notices = tuple(framework._storage.notices())
+        assert len(notices) == 6
+        assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'}
+        assert framework._storage.load_snapshot(notices[4][0]) == {}
+
+        # Even though these events are far back in the queue, since they're
+        # duplicates, they will get skipped.
+        pub.d.emit('foo')
+        pub.d.emit('bar')
+        pub.n.emit()
+        assert len(tuple(framework._storage.notices())) == 6
+
+        def notices_for_observer(n: int):
+            return [
+                notice for notice in framework._storage.notices() if notice[1].endswith(f'[{n}]')
+            ]
+
+        # Stop deferring on the first observer, and all those events will be
+        # completed and the notices removed, while the second observer will
+        # still have them queued.
+        pub.r.emit()
+        assert len(tuple(framework._storage.notices())) == 6
+        pub.n.emit()
+        framework.reemit()
+        assert len(notices_for_observer(1)) == 0
+        assert len(notices_for_observer(2)) == 3
+
+        # Without the defer active, the first observer always ends up with an
+        # empty queue, while the second observer's queue continues to skip
+        # duplicates and add new events.
+        pub.d.emit('foo')
+        pub.d.emit('foo')
+        pub.d.emit('bar')
+        pub.n.emit()
+        pub.d.emit('foo')
+        pub.d.emit('bar')
+        pub.n.emit()
+        pub.d.emit('baz')
+        framework.reemit()
+        assert len(notices_for_observer(1)) == 0
+        assert len(notices_for_observer(2)) == 4


Please consider refactoring the setup and splitting this flow into separate tests.

The problem here is that what this is testing is the behaviour over multiple types of events, some deferred, some not, some running previously run deferred events. It's all about checking whether the state is correct over a long sequence of changes.

I could split it - for example, each comment could be the start of a new test. But the problem there is that each test would either have to do everything in the previous test to get the database into the correct state (which means it'd just be the same plus multiple shorter versions as well) or I'd need to manually set up the database in all but the first test, to have the expected data. That seems worse and more fragile to me.

I could split it into two tests, one for each type of observer, but part of the test is making sure that when there are two observers that are properly handled separately, and it feels like splitting the test would make it easier to break that.

I definitely agree that the test is very long, and I don't like that.

Would it help if there were some extra helper functions so that something like:

pub.d.emit('foo') pub.d.emit('foo') pub.d.emit('foo') pub.n.emit()

Was something like this?

emit([(d, 'foo')], repeat=3) emit([(n)])

It would reduce the number of lines, but my feeling is that it would end up less understandable.

tonyandrewmeyer · 2024-12-05T07:12:21Z

Is it possible to get 2 distinct events from Juju that only differ in arguments?

Yes. For example, you can get a secret-removed for two different secrets. That's why we need to compare the snapshots as well as the event itself.

tonyandrewmeyer · 2024-12-09T23:28:40Z

You've called out the performance changes, but for a sanity check, could you please do a couple of measurements? What's a typical time for load the notices given and empty queue? And then if the queue is non-empty, what's the time to load each snapshot? I'm guessing sub-milliseconds in both cases, but would be good to confirm.

With sqlite3, loading empty notices is very fast, yes, even on my fairly slow local machine:

>>> timeit.timeit('list(s.notices())', "import ops\ns=ops.storage.SQLiteStorage('/tmp/empty-unit-state.db')", number=1000000) / 1000000
2.4867908030282704e-06

Loading a snapshot is also fast:

timeit.timeit('s.load_snapshot("NoticeQueueCharm/on/secret_changed[3551]")', "import ops\ns=ops.storage.SQLiteStorage('.unit-state.db')", number=1000000) / 1000000
4.108476707013324e-06

When using Juju for storage, I expect there is considerable variation depending on your Juju setup. But with my local one, it is noticeably slower (because of all the state-get calls presumably):

>>> timeit.timeit('list(s.notices())', "import ops\ns=ops.storage.JujuStorage()", number=1000) / 1000
0.0052693093249108645

>>> timeit.timeit('s.load_snapshot("NoticeQueueCharm/on/secret_changed[3551]")', "import ops\ns=ops.storage.JujuStorage()", number=1000) / 1000
0.005267057369928807

So, on my machine/my Juju, there would be an unnoticeable penalty using sqlite3, and a ~5ms penalty using Juju (which we do discourage). In exchange, performance when there are a lot of deferred events is improved.

I think trying to run super-tox -e integration will be too slow / problematic,

Yes, I've tried this before and don't have a solution for it yet.

but could we at least please spot check tox -e integration on a couple of charms that use defer?

I'm working on this, but nearly all the charms that are using defer() are data platform ones, and running the integration tests (both locally and in GitHub CI) is unfortunately complicated.

benhoyt · 2024-12-10T01:29:54Z

@tonyandrewmeyer It looks like this PR needs to be merged/rebased, as the diff currently shows 7500 LoC with all of Scenario being added.

tonyandrewmeyer · 2024-12-10T03:48:58Z

@tonyandrewmeyer It looks like this PR needs to be merged/rebased, as the diff currently shows 7500 LoC with all of Scenario being added.

I'll rebase once I've figured out how to test this against some charm integration tests and have also addressed the outstanding review comment.

Docs: Ops: Add top-level intro, clean up ops.module intro in light of that, flesh out ops.pebble and ops.testing intros a bit more https://warthogs.atlassian.net/browse/CHARMTECH-183 - [x] "home page" library reference - [x] ops module - [x] ops.main entry point - [x] ops.pebble module - [x] ops.testing module

…names in ops.testing (canonical#1381) Add a new optional install `testing`, e.g. `pip install ops[testing]`. This pulls in a compatible version of `ops-scenario`, and exposes the Scenario names in the `ops.testing` namespace, alongside Harness. `pip install ops[harness]` is also supported to ease the (far off, presumably 3.0) transition to Harness being moved out of the base install. It currently installs no extra dependencies, so is the same as `pip install ops` but a forward-looking charm would use `pip install ops[harness]` (or `pip install ops[harness, testing]`) if using Harness. Requires ops-scenario 7.0.5, which has the required adjustments to support insertion into `ops.testing`. The `ActionFailed` name exists in both `ops._private.harness` and `scenario.context`. This is handled by adjusting the Harness class to support the functionality of both and monkeypatching that into Scenario until Scenario starts using it. It's compatible with both Harness and Scenario, but will have empty data in an attribute (which attribute depends on which framework is used). The `Container` name in `ops.testing`, which is only present for backwards compatibility, is also overwritten if ops-scenario is installed. If anyone is actually using `ops.testing.Container` instead of `ops.Container` then they'll need to fix their code before using `ops[testing]` (or ops-scenario combined with the release of ops with this change). A very basic unit test is added to make sure that Scenario tests work (all the actual Scenario tests are in the ops-scenario repo) if ops-scenario/ops[testing] is installed (this is the case for `tox -e unit`). A test is also added to ensure that all of the Scenario names are documented, since `automodule` isn't used any more. Also adjusts the documentation to include the new framework in ops.testing. --------- Co-authored-by: Ben Hoyt <benhoyt@gmail.com> Co-authored-by: Dima Tisnek <dimaqq@gmail.com>

tonyandrewmeyer · 2024-12-19T01:45:00Z

but could we at least please spot check tox -e integration on a couple of charms that use defer?

hydra-operator has 16 defer() calls, and integration tests pass with this branch. I'm still working on getting integrations tests running from another charm (almost certainly a data platform one, as most of the defer use is there).

tonyandrewmeyer requested review from dimaqq and benhoyt September 13, 2024 05:48

tonyandrewmeyer marked this pull request as draft September 13, 2024 05:52

dimaqq reviewed Sep 19, 2024

View reviewed changes

ops/framework.py Show resolved Hide resolved

dimaqq reviewed Sep 19, 2024

View reviewed changes

ops/framework.py Show resolved Hide resolved

tonyandrewmeyer force-pushed the skip-defer-dupes-935 branch from e8d1ec8 to 1701f87 Compare October 4, 2024 05:48

tonyandrewmeyer marked this pull request as ready for review October 4, 2024 05:50

benhoyt approved these changes Oct 6, 2024

View reviewed changes

dimaqq reviewed Oct 8, 2024

View reviewed changes

tonyandrewmeyer and others added 7 commits December 12, 2024 12:58

Avoid storing duplicate notice+snapshots in the queue.

48e4ce2

Disable the skip behaviour when using Harness.

3412af7

Rename variable to make codespell happier.

79fab7a

Fix bad rebase.

20786fa

Add comment, per review.

8220ff9

tonyandrewmeyer force-pushed the skip-defer-dupes-935 branch from 84b2830 to 8220ff9 Compare December 11, 2024 23:58

Remove built files.

8b4c84b

tonyandrewmeyer mentioned this pull request Dec 18, 2024

tests: relation_broken.emit() takes a relation, not a relation ID canonical/grafana-k8s-operator#369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: don't store duplicate events in the notice queue #1372

feat: don't store duplicate events in the notice queue #1372

tonyandrewmeyer commented Sep 13, 2024 •

edited

Loading

tonyandrewmeyer commented Sep 13, 2024

tonyandrewmeyer commented Sep 13, 2024

tonyandrewmeyer commented Sep 23, 2024

tonyandrewmeyer commented Oct 4, 2024 •

edited

Loading

tonyandrewmeyer commented Oct 4, 2024

benhoyt left a comment

dimaqq left a comment

dimaqq Oct 8, 2024

tonyandrewmeyer Dec 5, 2024

dimaqq Oct 8, 2024

tonyandrewmeyer Dec 18, 2024

tonyandrewmeyer commented Dec 5, 2024

tonyandrewmeyer commented Dec 9, 2024

benhoyt commented Dec 10, 2024

tonyandrewmeyer commented Dec 10, 2024

tonyandrewmeyer commented Dec 19, 2024 •

edited

Loading

feat: don't store duplicate events in the notice queue #1372

Are you sure you want to change the base?

feat: don't store duplicate events in the notice queue #1372

Conversation

tonyandrewmeyer commented Sep 13, 2024 • edited Loading

tonyandrewmeyer commented Sep 13, 2024

tonyandrewmeyer commented Sep 13, 2024

tonyandrewmeyer commented Sep 23, 2024

tonyandrewmeyer commented Oct 4, 2024 • edited Loading

tonyandrewmeyer commented Oct 4, 2024

benhoyt left a comment

Choose a reason for hiding this comment

dimaqq left a comment

Choose a reason for hiding this comment

dimaqq Oct 8, 2024

Choose a reason for hiding this comment

tonyandrewmeyer Dec 5, 2024

Choose a reason for hiding this comment

dimaqq Oct 8, 2024

Choose a reason for hiding this comment

tonyandrewmeyer Dec 18, 2024

Choose a reason for hiding this comment

tonyandrewmeyer commented Dec 5, 2024

tonyandrewmeyer commented Dec 9, 2024

benhoyt commented Dec 10, 2024

tonyandrewmeyer commented Dec 10, 2024

tonyandrewmeyer commented Dec 19, 2024 • edited Loading

tonyandrewmeyer commented Sep 13, 2024 •

edited

Loading

tonyandrewmeyer commented Oct 4, 2024 •

edited

Loading

tonyandrewmeyer commented Dec 19, 2024 •

edited

Loading