Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Prevent the deferred events queue to contain duplicates #935

Open
PietroPasotti opened this issue May 23, 2023 · 10 comments · May be fixed by #1372
Open

Prevent the deferred events queue to contain duplicates #935

PietroPasotti opened this issue May 23, 2023 · 10 comments · May be fixed by #1372
Assignees
Labels
25.04 feature New feature or request

Comments

@PietroPasotti
Copy link
Contributor

This is a bit of a large topic, but the broad question is: suppose my charm has deferred a foo event, should we disallow (or enable people to guard against) a situation where two or more foo deferred events are in the queue?

A typical use case for event deferrals is a 'missing preconditions' case: an event can't be processed because some preconditions are missing. If the same event is triggered again before the preconditions are met, we now have two copies of that event in the deferral queue. When at some point in the future the precondition is eventually met and the queue flushes, we'll be executing that handler multiple times quickly in succession, that may result in the worst case into bugs (or downtime) such as when a flurry of pebble service restarts hits a workload container.

Main options I see for addressing this, from more to less aggressive, are:

  • raise an AlreadyDeferredError from Event.defer() if the same event was already deferred (on the same handler).
  • (or silently skip adding the duplicate event to the queue, but not a big fan of that idea)
  • allow the charm to inspect the deferral queue: the typical pattern would be if not self.framework.has_deferred(event): event.defer()

Do we have a 'use case' for not guarding against this, i.e. for leaving things as they are, and allow the same event to occur multiple times in a deferral queue?

@jameinel
Copy link
Member

jameinel commented May 23, 2023 via email

@benhoyt
Copy link
Collaborator

benhoyt commented May 24, 2023

At first glance what John is saying sounds good, however, @jameinel, wouldn't that approach also be nudging people to an "on any event, reconcile everything" approach to charming? Because you'd have to check for all preconditions on every event. Whereas with defer you get the specific event that was interrupted.

In any case, it makes sense to me to avoid duplicates in the queue. I don't see why it would ever be a good thing to defer two of the same event (unless the charm is written in a very odd way). I think we should probably skip adding the deferred event, but log that it happened. It seems like this wouldn't be too hard to fix (though the _reemit code is not easy to follow!).

@PietroPasotti Out of interest, did you run into this issue in "real life" because a charm was queueing up multiple duplicate events and that was causing problems? Or is it more of a "this could be an issue in theory" kind of thing?

@benhoyt benhoyt added feature New feature or request small item and removed small item labels May 24, 2023
@benhoyt
Copy link
Collaborator

benhoyt commented May 24, 2023

Per voice discussion with John, usually in cases where you'd defer, you need to check all the preconditions anyway. For example, if you need both relation data and config data to write out your workload's configuration file, you have to check for both of those things.

@jameinel
Copy link
Member

jameinel commented May 24, 2023 via email

@PietroPasotti
Copy link
Contributor Author

@PietroPasotti https://github.com/PietroPasotti Out of interest, did you run into this issue in "real life" because a charm was queueing up multiple duplicate events and that was causing problems? Or is it more of a "this could be an issue in theory" kind of thing?

Someone came to me with this piece of code:

def _on_config_changed(self, event) -> None:
     <do something>
      if self.controller_container.can_connect():
           self._restart_controller_service()

      else:
          self.unit.status = MaintenanceStatus("Waiting for kserve-controller to be reachable")
          event.defer()

and I saw a risk that if the config were to change multiple times before pebble was ready, we might queue multiple deferred config-changed and end up with a flurry of container service restarts as soon as the container can connect.

@sed-i
Copy link
Contributor

sed-i commented May 28, 2023

I agree with @jameinel.
Iirc, we do not (should not) use defer() because:

  1. Deferral isn't a juju concept so on upgrade all deferred events are lost (all your pending logic is lost). For example, the upgrade sequence does not emit any relation events, so charm authors would need to remember to redo all the logic that is exercised by deferrable events, every upgrade.
  2. Events are, by definition, most relevant when they are emitted. Deferring an event introduces a somewhat non-deterministic delay in processing: in the worst case it will be re-emitted on the next update status, which can be anywhere between 10sec and 1hr (I think).
  3. If your charm is at the mercy of update-status anyway, then you could put your logic there, instead of defer(), which would be functionally equivalent.
  4. The availability of the defer() functionality encourages charm authors to use defer() for inter-event flow control, which, as mentioned above, breaks on upgrade.
  5. A deferred event is always the first to be run next charm re-init, so it is not possible to have any new logic run between defer() and re-emit, making deferred events not as useful.
  6. Very quickly a defer() pattern puts a charm in error state because "two objects claiming to be X have been created".

@benhoyt
Copy link
Collaborator

benhoyt commented Sep 29, 2023

We could add checking for this and eliminate the duplicate event, but I'm not sure it's worth it, given we're planning to re-evaluate or even deprecate defer (part of next cycle's roadmap). So I'm closing this specific issue and we'll focus on #966.

@benhoyt benhoyt closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2023
@benhoyt benhoyt changed the title (should we) prevent the deferred events queue to contain duplicates? Prevent the deferred events queue to contain duplicates May 17, 2024
@benhoyt benhoyt added the 24.10 label May 17, 2024
@benhoyt
Copy link
Collaborator

benhoyt commented May 17, 2024

We didn't end up deprecating defer, and this wasn't resolved in #1205, so re-opening this. This would be done internally by, when defer() is called, checking that the existing deferral queue doesn't contain that event. And the content compared would be the event key (observer "path" or whatever) as well as the data ("snapshot").

@benhoyt benhoyt reopened this May 17, 2024
@tonyandrewmeyer
Copy link
Contributor

tonyandrewmeyer commented Sep 6, 2024

The complicated part of this is that calling defer() doesn't save the event+snapshot to the queue. All events get saved to the queue as part of emit(), and then each notice is dropped if the event+handler was not deferred, and the snapshot dropped if no event+handler was deferred.

This is unchanged from the "initial import" commit. It seems like this is done for consistency, rather than to protect against the charm crashing and losing an event (Juju will re-emit the event in that case and since we don't commit the storage it's not saved anyway).

We could skip storing the notice and snapshot if there's already one in storage - but we'll need to do that when the following Juju event comes along, rather than on defer(), and it will require some adjustments (loading whatever snapshot is in the storage to compare against, looping through the notices to see whether one is already there) that will end up applying to all events, which raises performance questions.

Alternatively, we could do a more significant adjustment; for example, having a "happy path" that when there is no defer() completely skipping the notice+snapshot system, and only doing a notice+snapshot store on defer(). This seems more straightforward and would loosen some of the constraints that the framework has (if you don't defer) and should be a performance improvement (no need to write to the sqlite database for every event, for one thing), but would make defer even less consistent than it currently is and would need consideration for Scenario, which uses the notice+snapshot system to trigger events.

@benhoyt
Copy link
Collaborator

benhoyt commented Sep 27, 2024

@tonyandrewmeyer is close to making the draft PR ready for review.

@benhoyt benhoyt added 25.04 and removed 24.10 labels Jan 9, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
25.04 feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants