-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: don't store duplicate events in the notice queue #1372
base: main
Are you sure you want to change the base?
feat: don't store duplicate events in the notice queue #1372
Conversation
Although this seems like a fairly small change, and it's in one very specific piece of the code, it's at the heart of running every event, so very thorough review is appreciated. I'm also happy to discuss the performance implications as well as the implications of the behaviour change. |
Will move back out of draft once I've investigated the charm failures - no point reviewing until then. |
Code like this doesn't work with Harness: class MyCharm(ops.CharmBase):
...
def _on_config_changed(self, event):
if self.something():
event.defer()
self.do_something()
def test_config_changed():
h = testing.Harness()
h.begin()
MyCharm.something = lambda _: False
h.charm.on.config_changed.emit()
MyCharm.something = lambda _: True
h.charm.on.config_changed.emit()
assert MyCharm.do_something.called() # Assume this was patched somewhere The framework skips the second emit. Arguably, the test should clean up in some way because it's actually going to run the config-changed handler three times (once deferred, then twice when it doesn't defer, once for the deferred notice and once for the new one) - it could be broken into multiple tests or use a fresh Harness or could just call It seems likely (I'll do a quick check to be sure) that this pattern is too common to have the change in this way - I don't think we can expect people to update their tests to account for this, since it's not really a bug, just not the ideal way to write the test. Maybe the 'skip duplicate defers' will need to be explicitly enabled? |
I did a super-tox ( |
e8d1ec8
to
1701f87
Compare
Ok, this should be good for an initial review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good to me, and it's a nice one to get fixed. Could we do a couple more tests before merging, just in case:
- You've called out the performance changes, but for a sanity check, could you please do a couple of measurements? What's a typical time for load the notices given and empty queue? And then if the queue is non-empty, what's the time to load each snapshot? I'm guessing sub-milliseconds in both cases, but would be good to confirm.
- I think trying to run super-tox
-e integration
will be too slow / problematic, but could we at least please spot checktox -e integration
on a couple of charms that use defer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I may not have thought this through.
Is it possible to get 2 distinct events from Juju that only differ in arguments?
For example, 2 secrets are now available, the secret id is different, but everything else is same (event name, observer function, human name if any).
except NoSnapshotError: | ||
existing_event_data = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code branch feels wrong.
If an event cannot be loaded, should it not be discarded instead?
Or do we have saved snapshots with no data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are definitely empty snapshots - for example, start
, config-changed
, and stop
all have empty snapshots. At the moment, we store an empty snapshot in the table in those cases, but it seems more resilient to not assume that will always be the case. This isn't failing to load the event, it's failing to find a snapshot that goes with the notice.
# Emit an event, which will be deferred. | ||
pub.d.emit('foo') | ||
notices = tuple(framework._storage.notices()) | ||
assert len(notices) == 2 # One per observer. | ||
assert framework._storage.load_snapshot(notices[0][0]) == {'data': 'foo'} | ||
|
||
# Emit the same event, and we'll still just have the single notice. | ||
pub.d.emit('foo') | ||
assert len(tuple(framework._storage.notices())) == 2 | ||
|
||
# Emit the same event kind but with a different snapshot, and we'll get a new notice. | ||
pub.d.emit('bar') | ||
notices = tuple(framework._storage.notices()) | ||
assert len(notices) == 4 | ||
assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'} | ||
|
||
# Emit a totally different event, and we'll get a new notice. | ||
pub.n.emit() | ||
notices = tuple(framework._storage.notices()) | ||
assert len(notices) == 6 | ||
assert framework._storage.load_snapshot(notices[2][0]) == {'data': 'bar'} | ||
assert framework._storage.load_snapshot(notices[4][0]) == {} | ||
|
||
# Even though these events are far back in the queue, since they're | ||
# duplicates, they will get skipped. | ||
pub.d.emit('foo') | ||
pub.d.emit('bar') | ||
pub.n.emit() | ||
assert len(tuple(framework._storage.notices())) == 6 | ||
|
||
def notices_for_observer(n: int): | ||
return [ | ||
notice for notice in framework._storage.notices() if notice[1].endswith(f'[{n}]') | ||
] | ||
|
||
# Stop deferring on the first observer, and all those events will be | ||
# completed and the notices removed, while the second observer will | ||
# still have them queued. | ||
pub.r.emit() | ||
assert len(tuple(framework._storage.notices())) == 6 | ||
pub.n.emit() | ||
framework.reemit() | ||
assert len(notices_for_observer(1)) == 0 | ||
assert len(notices_for_observer(2)) == 3 | ||
|
||
# Without the defer active, the first observer always ends up with an | ||
# empty queue, while the second observer's queue continues to skip | ||
# duplicates and add new events. | ||
pub.d.emit('foo') | ||
pub.d.emit('foo') | ||
pub.d.emit('bar') | ||
pub.n.emit() | ||
pub.d.emit('foo') | ||
pub.d.emit('bar') | ||
pub.n.emit() | ||
pub.d.emit('baz') | ||
framework.reemit() | ||
assert len(notices_for_observer(1)) == 0 | ||
assert len(notices_for_observer(2)) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider refactoring the setup and splitting this flow into separate tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that what this is testing is the behaviour over multiple types of events, some deferred, some not, some running previously run deferred events. It's all about checking whether the state is correct over a long sequence of changes.
I could split it - for example, each comment could be the start of a new test. But the problem there is that each test would either have to do everything in the previous test to get the database into the correct state (which means it'd just be the same plus multiple shorter versions as well) or I'd need to manually set up the database in all but the first test, to have the expected data. That seems worse and more fragile to me.
I could split it into two tests, one for each type of observer, but part of the test is making sure that when there are two observers that are properly handled separately, and it feels like splitting the test would make it easier to break that.
I definitely agree that the test is very long, and I don't like that.
Would it help if there were some extra helper functions so that something like:
pub.d.emit('foo')
pub.d.emit('foo')
pub.d.emit('foo')
pub.n.emit()
Was something like this?
emit([(d, 'foo')], repeat=3)
emit([(n)])
It would reduce the number of lines, but my feeling is that it would end up less understandable.
Yes. For example, you can get a secret-removed for two different secrets. That's why we need to compare the snapshots as well as the event itself. |
With sqlite3, loading empty notices is very fast, yes, even on my fairly slow local machine: >>> timeit.timeit('list(s.notices())', "import ops\ns=ops.storage.SQLiteStorage('/tmp/empty-unit-state.db')", number=1000000) / 1000000
2.4867908030282704e-06 Loading a snapshot is also fast: timeit.timeit('s.load_snapshot("NoticeQueueCharm/on/secret_changed[3551]")', "import ops\ns=ops.storage.SQLiteStorage('.unit-state.db')", number=1000000) / 1000000
4.108476707013324e-06 When using Juju for storage, I expect there is considerable variation depending on your Juju setup. But with my local one, it is noticeably slower (because of all the >>> timeit.timeit('list(s.notices())', "import ops\ns=ops.storage.JujuStorage()", number=1000) / 1000
0.0052693093249108645 >>> timeit.timeit('s.load_snapshot("NoticeQueueCharm/on/secret_changed[3551]")', "import ops\ns=ops.storage.JujuStorage()", number=1000) / 1000
0.005267057369928807 So, on my machine/my Juju, there would be an unnoticeable penalty using sqlite3, and a ~5ms penalty using Juju (which we do discourage). In exchange, performance when there are a lot of deferred events is improved.
Yes, I've tried this before and don't have a solution for it yet.
I'm working on this, but nearly all the charms that are using defer() are data platform ones, and running the integration tests (both locally and in GitHub CI) is unfortunately complicated. |
@tonyandrewmeyer It looks like this PR needs to be merged/rebased, as the diff currently shows 7500 LoC with all of Scenario being added. |
I'll rebase once I've figured out how to test this against some charm integration tests and have also addressed the outstanding review comment. |
Docs: Ops: Add top-level intro, clean up ops.module intro in light of that, flesh out ops.pebble and ops.testing intros a bit more https://warthogs.atlassian.net/browse/CHARMTECH-183 - [x] "home page" library reference - [x] ops module - [x] ops.main entry point - [x] ops.pebble module - [x] ops.testing module
…names in ops.testing (canonical#1381) Add a new optional install `testing`, e.g. `pip install ops[testing]`. This pulls in a compatible version of `ops-scenario`, and exposes the Scenario names in the `ops.testing` namespace, alongside Harness. `pip install ops[harness]` is also supported to ease the (far off, presumably 3.0) transition to Harness being moved out of the base install. It currently installs no extra dependencies, so is the same as `pip install ops` but a forward-looking charm would use `pip install ops[harness]` (or `pip install ops[harness, testing]`) if using Harness. Requires ops-scenario 7.0.5, which has the required adjustments to support insertion into `ops.testing`. The `ActionFailed` name exists in both `ops._private.harness` and `scenario.context`. This is handled by adjusting the Harness class to support the functionality of both and monkeypatching that into Scenario until Scenario starts using it. It's compatible with both Harness and Scenario, but will have empty data in an attribute (which attribute depends on which framework is used). The `Container` name in `ops.testing`, which is only present for backwards compatibility, is also overwritten if ops-scenario is installed. If anyone is actually using `ops.testing.Container` instead of `ops.Container` then they'll need to fix their code before using `ops[testing]` (or ops-scenario combined with the release of ops with this change). A very basic unit test is added to make sure that Scenario tests work (all the actual Scenario tests are in the ops-scenario repo) if ops-scenario/ops[testing] is installed (this is the case for `tox -e unit`). A test is also added to ensure that all of the Scenario names are documented, since `automodule` isn't used any more. Also adjusts the documentation to include the new framework in ops.testing. --------- Co-authored-by: Ben Hoyt <benhoyt@gmail.com> Co-authored-by: Dima Tisnek <dimaqq@gmail.com>
84b2830
to
8220ff9
Compare
hydra-operator has 16 |
When a new event is emitted, check if there is already an exact duplicate (the same observer, same event type, and same event snapshot) in the storage from a deferral in a previous run. If a duplicate does exist, then don't store a new notice and snapshot.
There are performance implications:
secret-changed
events but for different secrets) then there is a cost to doing the comparison.There is also a (deliberate) behaviour change that does impact event ordering. For example, consider this sequence of events:
config-changed
deferredsecret-changed
deferredconfig-changed
Currently, this would result in:
ConfigChangedEvent
ConfigChangedEvent
,SecretChangedEvent
ConfigChangedEvent
,SecretChangedEvent
,ConfigChangedEvent
With this change, this would result in:
ConfigChangedEvent
ConfigChangedEvent
,SecretChangedEvent
ConfigChangedEvent
,SecretChangedEvent
More generally, there could currently be duplicate (notice+snapshot) events intermixed throughout the queue, and each run they will be re-emitted in the order in which they originally occurred. With this change, they will be re-emitted in the order in which they originally occurred, except if they have already emitted this run. The particularly noticeable change is that the Juju event that triggered the run may not be the last event (if it was a duplicate of one in the queue).
We could potentially do this differently - for example, updating the sequence so that when a duplicate occurs it moves the event to the end of the queue (by dropping and adding the notice+snapshot, or explicitly setting the
sequence
field for SQL and by just reordering the list for Juju). This would add complexity and have a performance penalty, however, and it seems more correct to have the original order.For unit tests:
reemit()
call like there would be in production, and the test expects the handler to be called more than once. For this reason, the skipping is disabled for Harness.ctx.run
ing then you need to manage that in the list of deferred events passed into the State. We could add a consistency check to alert if there are duplicates in that list (this would be easier to do when the Scenario code is in this repo). However, the Scenario behaviour does still change: if the event is deferred in thectx.run
and there's already a match in the state's deferred list, then the new (deferred) event does not get added, and the output state doesn't change (which is what we want). We get this behaviour automatically because Scenario mimics the runtime behaviour more closely, actually running the framework emitting/reemitting. So: Scenario tests are both safer, and can be used to match the new behaviour.This can be tested manually with a charm that optionally defers events, e.g. with the code below.
charm.py and charmcraft.yaml content
If you want to see the queue while you're doing this, you can use code like this:
If you set
opt
to anything not starting with "defer" then you should get aconfig-changed
event every time. If you set it to something starting with "defer", then it will run exactly once each time you set the config (remember to change the value, or Juju will skip the event) - with ops@main you'll instead get aconfig-changed
event for every time that you change the config, every time you change it (ie. the queue will build up).You can also check it with an event that has a snapshot, such as
secret-changed
. If the config is set to defer, every time you change the content of the first secret, you'll get onesecret-changed
event (but with ops@main, each time you'll get multiple, depending on how many times you've done it). If you also change the second secret, you'll get twosecret-changed
events, one for each secret (because the snapshots differ).You can intermix the different events and should always have exactly zero or one (event type+snapshot) in the queue. If you change the
opt
value back to something not starting with "defer", then you should see all the events complete and have an empty queue.Scenario test that shows the behaviour
Note that this requires fixing a small bug in Secret snapshoting (PR). That's unrelated to this change - it also occurs if you use deferred secrets in main.
Fixes #935