Worker Autotuning #88

Sushisource · 2024-02-24T00:37:54Z

Rendered: https://github.com/Sushisource/proposals/blob/worker-autotune/all-sdk/autotuning.md

Quinn-With-Two-Ns · 2024-02-24T00:43:17Z

all-sdk/autotuning.md

+* The current number of active polls for a task type must be <= the number of free slots for that task type at all
+  times. Otherwise, we can end up receiving a task that won't have anywhere to go until a slot frees. For example,
+  in Core this is expressed by using a permit reservation system.
+* Max outstanding workflow tasks must be <= max cached workflows.


I don't think this is true? You can run a worker with no cache today

Yeah that's true - but practically speaking it means your cache's peak size will equal the number of concurrent workflow tasks. In a real-world setting you probably care about cache size as it relates to peak memory usage, and since users are (should) always using the cache, the constraint makes sense in that context.

all-sdk/autotuning.md

Quinn-With-Two-Ns · 2024-02-24T00:55:41Z

all-sdk/autotuning.md

+
+## Open questions
+
+### Async/fallible `mark_slot_used` and `release_slot`


Is this questions relevant to all SDKs or just a Rust SDK implementation of a SlotSupplier?

All of them

Quinn-With-Two-Ns · 2024-02-24T00:56:57Z

all-sdk/autotuning.md

+}
+impl<T: SlotKind> PauseableSlotSupplier<T> {
+  pub fn new(supplier: impl SlotSupplier<SlotKind=T>) -> Self { /* ... */ }
+  pub fn pause(&self) { /* ... */ }


There is a interesting question on when pause returns. Do you return when you stop handing out new slots or return when you have no more slots in "inflight" ie reserved but not marked in use

IMO it should return immediately, pause is a state not an actionable request for the most part.

Yeah, i am imagining it returns immediately. You're saying "become paused" (like initiating shutdown of a worker)

Quinn-With-Two-Ns · 2024-02-24T01:05:58Z

all-sdk/autotuning.md

+A possible future:
+
+```rust
+trait WorkflowCacheSizer {


I would like us to explore the idea of fully pluggable workflow cache instead of just a pluggable cache sizer. I suspect there is some performance wins for certain workflows to use different caching strategy then we currently use. I see you cover some of it with the eviction_hint , but I think just letting users provide a cache implementations would be simpler .

sorry if I'm late to the game here, but what do you mean by "fully pluggable"? Do you (basically) mean user-written, or that we should provide multiple implementations? If so could you give a hint what a different (from current) version might look like?

By fully pluggable I mean user written and we provide a few implementation. Right now we have a bounded LRU cache, a different version I would like to experiment with is a cache that accounts for how much work it would take to replay the workflow. One way I image it could do that is when it needs to evict a workflow weight the history size in what workflow we evict.

Concur, the difference between providing a cache interface and a can-cache interface is minimal, but the former has lots of upside. Let's just expose a get/put/delete cache interface that also has a "can allow" call.

For core langs however, they will only expose the "can" (and eviction callbacks if they want) because the stateful cache impl has to be in Rust IMO w/ just hooks for checking.

I don't want a provided implementation to be literally responsible for storing the data, though, that seems potentially easy to create confusing-to-diagnose bugs & more work for the implementor.

What would actual storage provide that eviction_hint doesn't? The example you gave @Quinn-With-Two-Ns seems to be serviced by the eviction hint.

The only reason is flexibility. It's basically a question of allowing replacement of the cache impl vs just the cache size. The former is more flexible I think and easy enough to build a sizer on top of, but I guess you may have to abstract the sizer anyways then. So a double-abstraction could exist: a worker accepts a custom cache implementation, and the LRU implementation could accept a dynamic sizer.

I don't have a strong opinion and can understand why just the sizer is enough. But I would change the names on some of these things a bit and I question the "hint" call (bringing up in different comment).

I agree there is a concern around making users store the data and potential bugs, but I think interceptors and data converters have equal if not more potential for introducing bugs.

I think you could get something similar with eviction_hint , but it would be less flexible and efficient since you can't utilize a custom data structure to store the cached workflows

but I think interceptors and data converters have equal if not more potential for introducing bugs.

This is true, but, also doesn't supply a reason to introduce additional dangerous things. Those are necessary, but the user actually storing the cached data for us certainly isn't.

all-sdk/autotuning.md

Quinn-With-Two-Ns · 2024-02-24T19:37:03Z

all-sdk/autotuning.md

+}
+
+struct WorkflowSlotInfo {
+  workflow_type: String,


Would we expose if these are eager workflows/activities here? I think it would be useful for some metrics

Yeah, certainly. We can put whatever we like in here.

dandavison · 2024-02-24T19:56:34Z

I really appreciate how well-written this proposal is. (I'm just personally lacking a bit of necessary background.)

paulnpdev · 2024-02-24T23:39:05Z

all-sdk/autotuning.md

+runtime. Autotuning of these values seeks to allow them to change at runtime according to available resources on
+the worker. It is not feasible for us to choose *optimal* values for these at runtime, because we cannot know a-priori
+about how much resources any given workflow or activity task will consume - only the user can (maybe) know this.
+However, we can provide reasonable default autotuning implementations while also allowing users to provide their own.


It might be nice to say something here about what this might look like. For instance, we could take the approach that we tune for maximum concurrency (maintaining OOM or other error avoidance), or total throughput per worker, or minimum latency (for some measure of latency and some form of "minimum"). Any (or even even each) of these seem like valid strategies, as might some utility function that combined them. Do we plan to offer the user a choice? Pick one of these (or something I didn't mention) and do just that for now? What's our true goal here?

Even though I have placed my comment here in the slot/cache tuning section, it's really more of a global comment. We can probably do some stuff that improves worker performance in ways that are pareto optimal ... i.e. improve some dimension of performance without any cost in other dimensions. Maybe that should be our first goal. But most (I might even say all) performance tuning eventually requires a definition of performance that expresses the specific goal directly because there are always tradeoffs.

Yeah I'll add a little more context.

But, yes, I think the idea here is we can add some default implementations that are a nice improvement for most people most of the time over the current fixed implementations. Beyond that, I agree, it starts to get specific, and people with those needs can implement exactly what they want.

Note that the "default implementations" section does make some reference to those.

all-sdk/autotuning.md

paulnpdev · 2024-02-24T23:49:59Z

all-sdk/autotuning.md

+A possible future:
+
+```rust
+trait WorkflowCacheSizer {


sorry if I'm late to the game here, but what do you mean by "fully pluggable"? Do you (basically) mean user-written, or that we should provide multiple implementations? If so could you give a hint what a different (from current) version might look like?

paulnpdev · 2024-02-24T23:52:47Z

all-sdk/autotuning.md

+
+This proposal has focused specifically on worker autotuning and user customization thereof. However, it's worth
+considering how this effort will fit in to future work we know we want to do, namely around autoscaling and on-demand
+workers.


Another future option worth considering (imo) would be where some workers (at least new ones) would learn from already running workers about their learned configuration to speed up the learning process.

Yeah, that'd be neat. I can make reference to that here but certainly something for later.

cretz

Left some comments. Like with many of these proposals, two recommendations (take em or leave em):

Setup meeting to knock out each contentious point
Close (without merge) early, noisier PRs where there's a lot of chatter once many things are resolved and make a final PR for final +1's for merge

all-sdk/autotuning.md

cretz · 2024-02-26T13:54:08Z

all-sdk/autotuning.md

+  /// Blocks until a slot is available.
+  /// May also return an error if the backing user implementation encounters an error.
+  /// Returned errors will bubble up and cause the worker to shut down.
+  async fn reserve_slot(&self) -> Result<(), anyhow::Error>;


While I know this is just explanatory, I think a "slot info" should be provided with contextual information about the reservation request. Even if it's empty today (though unlikely to be, it should at least have namespace and task queue and worker build ID and maybe other worker identifiable things), it'll help future proof the API.

Sure, that's a good point. It can't (well, shouldn't) be exactly the same type as in the other places though, as it'd mean 90% of the fields on it become optional. Perfectly reasonable to have a new type here though with the fields we know like you mentioned, and then room to add more fields later on.

Definitely "reservation context" or "reservation info" for just reserve call and has no other reuse is acceptable IMO.

all-sdk/autotuning.md

cretz · 2024-02-26T14:35:51Z

all-sdk/autotuning.md

+pub struct WorkerConfig {
+  // ... existing fields
+  #[builder(default)]
+  pub workflow_task_slot_and_cache: Option<Arc<dyn WorkflowSlotAndCacheManager>>,


I think slot supplier and cache should be provided as separate option, even if the implementer chooses to have the same struct impl both

I'm OK with that but curious to hear why you think so.

I made it this way because I liked the idea of the user explicitly combining the workflow things into one piece and providing that, since it makes it clear they are somewhat related to each other, but it's not the most important reason.

This may be language specific, but many may not care about cache size in their impl.

I'm ok keeping them together, but it becomes harder to justify the separate interfaces then if they must always both be implemented anyways. So now I'm wondering if we want to change the naming a bit from SlotSupplier to a more general purpose Limiter, Tuner, Controller, TaskManager, etc. So:

ActivitySlotSupplier is instead ActivityController and has the slot supplier stuff

WorkflowSlotSupplierAndCacheManager is instead WorkflowController and has slot supplier stuff and cache manager stuff

Basically this lifts the low-level notion of "slot supplying" and "cache managing" to a general purpose controller for limiting/controlling work. I think it will also be more clear to users to controller = new ResourceBasedWorkerController() + myWorkerConfig.WorkflowController = controller. They don't have to care what slots are or that the interface is limited to slot supplying only or whatever.

Or maybe it's all under one large WorkerTaskController or WorkerTuner or something interface and you just have to set one option. That'd be nice and that's what we did with interceptors which is appreciated by users that they don't have to set separate ones. Yeah, I'm thinking I now support the interceptor model of one interface for it all (but can have getters for sub-interfaces or whatever if needed, though may not be needed).

With this change, if there is another dynamic config option we want to expose to users, we can now add it without making users worry about all the different fine-grained interfaces that could be implemented.

Not a fan of combining slot and cache management. Slots are per worker, where the cache is per process in Go and per worker factory in Java. I would also like to leave the door open to how we map the cache to workers in the SDK ideally so we can have per worker cache in Go if desired.

all-sdk/autotuning.md

cretz · 2024-02-26T14:38:15Z

all-sdk/autotuning.md

+  /// Tries to immediately reserve a slot, returning true if a slot is available.
+  /// May also return an error if the backing user implementation encounters an error.
+  /// Returned errors will bubble up and cause the worker to shut down.
+  fn try_reserve_slot(&self) -> Result<bool, anyhow::Error>;


Just to understand, can I get some examples where we'd use non-blocking reserve attempt vs blocking? I can only think about blocking use cases.

Eager workflows/activities

cretz · 2024-02-26T19:05:35Z

all-sdk/autotuning.md

+  /// Called when deciding what workflow should be evicted from the cache. May return the run id of a workflow to evict.
+  /// If None, or an invalid run id is returned, the default LRU eviction / strategy is used. The id returned must
+  /// be one of the workflows provided in `evictable` (the ones for which there is no actively processing WFT).
+  fn eviction_hint(&self, evictable: &[WorkflowSlotInfo]) -> Option<&str>;


What is the use case where a user may want to implement this vs a can_evict_workflow() -> bool? Specifically, if a user doesn't control the cache, why is it up to them to provide a run ID as a response? Do we expect them to use this to make some workflows cached longer than others? I think that's a bit more flexibility than is needed and supports a general purpose cache interface if we need that kind of flexibility today.

Yeah, the point exactly is to allow them to control the caching "algorithm" without having to actually store the data.

They could pick LRU, or they could pick smallest history first like in @Quinn-With-Two-Ns 's example, or whatever they want.

They can simply return None here (which we can even make the default impl) and we'll use LRU

without having to actually store the data

But that means they have to have access to the cache to even know what's in there (or have their own cache of IDs).

They could pick LRU

Can they? Do you call this interface every time the cache is accessed? Without full control of the caching accesses, one cannot do advanced implementation. I think we should either choose always-LRU or allow custom cache impls. I think the hybrid is too confusing.

Ah yeah good point. Seemingly all we'd need to do though is add a get or just_fetched call, and that'd be sufficient (returns nothing, just notifies impl of access). I don't want it to be a real cache implementation because actually storing things won't work well across the language boundary, plus just the intrinsic danger of them returning the wrong item, or losing things that shouldn't be lost, etc. With the facade / sizer even if they screw up we can fall back to something that still operates correctly.

Right, if we don't want a real cache impl, let's not even let them choose which thing to evict. Let's keep that complication out. So now I'm basically thinking cache is a simple:

type CacheController interface { ShouldCacheAsNew(WorkflowInfo) bool }

And if that's false, then it evicts one before caching. Don't need anything more than that (at least not for now).

^ I still think we should consider this simplified form that just lets the user decide whether it should evict upon cache request. Or maybe something like a:

type CacheController interface { ShouldCache(WorkflowInfo) CacheResult } const ( CacheAsNew CacheResult = iota EvictThenCache NoCache )

Sushisource · 2024-02-26T22:27:00Z

OK I resolved a bunch of comments that I addressed, feel free to start a new thread if there's something that wasn't. I just needed to reduce the visual clutter a bit.

Main bigger questions that are still open I think:

Should the cache sizer be an actual cache (I'm pretty strongly inclined no here, but, we could make it a facade for one)
Providing a context object to reserve/try reserve. Pretty sure we need one at least to start with with task_queue / worker_id etc, but, do we feel that we should also put a shutdown() on it, or anything else?
Should the traits have optionally-implementable methods for the things that would emit known metrics like max cache size and total slots?

Also could set up a meeting for these, but I want to give any other people who are interested some time to add comments too

Sushisource · 2024-03-08T22:42:38Z

Resolutions to the above:

Keep cache sizer to just the "should_allow" kind of function to start with
Yes, provide task queue & build id (if set) initially
No. Have implementers report the metric. Provide a helper for easy access to the well-defined metrics.

Quinn-With-Two-Ns · 2024-03-19T23:44:41Z

all-sdk/autotuning.md

+/// if waiting for some time in the blocking version of `reserve_slot`.
+pub trait SlotReservationContext<SK: SlotKind>: Send + Sync {
+  /// Returns information about currently in-use slots
+  fn used_slots(&self) -> &[SK::Info];


Shouldn't the slot supplier know this?

It can know it, but, I don't really see a reason to force the implementor to track it themselves when we already have to track it ourselves at least in some capacity as discussed here: #88 (comment)

Certainly in Core this made dramatically more sense to provide automatically because the SlotSupplier gets wrapped by a struct in Core that has to track all this stuff anyway, and I anticipate creating a similar wrapper in Go/Java.

If it turns out to be a huge pain to provide as I'm working through Java, it can be left off - but I imagine it's going to be pretty straightforward to do.

I think just because we can provided something doesn't mean we should. We can always add information based on user feadback, it is much harder to remove. My main concern is the adding the overhead of keeping track of all this data getting updated from potential 10,000s of goroutines in Go when a user doesn't care about the information.

Same, I prefer the lightest API surface needed to do an an acceptable job (may be only slot kind and task queue name atm)

I definitely have known reasons to want this though, and the stripe guys even explicitly said they would want and use this information inside their implementation. I'm pretty clear on it having value.

The overhead is tiny. It's just keeping things in a list or a map, even if that's 10k things that's not hard, there's no meaningful compute going on. It's already got to be tracked somewhere anyway in order to provide activity/wf context to users, etc.

Well, sure, but this isn't custom information - it's information that's extracted from the task which we'll be storing anyway. It's just making it available to them rather than asking them to do that work themselves. I don't think there's any strong reason they should record it over us. It's just making their lives easier for something that's very likely to come up in many or most implementations.

Well, sure, but this isn't custom information

Right, but I think users will also want to track custom information for each slot if they already want this level of detail.

I don't think there's any strong reason they should record it over us

Because it add overhead to every implementation that doesn't need it like our implementations and it adds more complexity to the SDK and the API. I am not sure this is very likely to come up in real implementations since the ones we have proposed (fixed, resource based, pausable) do not need it. I'm totally open to adding a wrapper that keeps track of it or I think we need to do benchmarking to show this doesn't increase latency or memory overhead. I apologize for being such a stickler here, but since we have more low latency use cases coming into temporal I am very paranoid about introducing potential performance regressions.

The resource based one does need it. It needs to know how many slots are used since you need to know that you're not going to hand out more slots than there is room in the cache (or even just to define a configurable hard limit).

This is zero cost in Core, for example. The info is all tracked already. Fetching system resources like memory usage for that implementation is dramatically more expensive for example.

We can benchmark... but I feel that's a little odd here given there are virtually zero changes we do actually benchmark and I don't see any reason to believe this would add substantial overhead compared to anything else we do. I mean it's literally just stuffing some things (which are likely just references to objects that already exist elsewhere) in a collection, which isn't very expensive to do.

The resource based one does need it. It needs to know how many slots are used since you need to know that you're not going to hand out more slots than there is room in the cache (or even just to define a configurable hard limit).

The resource based one just needs to know the number it doesn't need a whole map of every slot and details about it.

but I feel that's a little odd here given there are virtually zero changes we do actually benchmark

I get for Core it is zero cost, but for Go and Java this is new. I thought in slack we already talked about benchmarking these changes so I didn't even think this was a new requirement? I assumed as part of benchmarking we would test the old SDK implementation with the new fixed slot supplier implementation?

Yeah, of course, but isolating providing this particular function will require it's own independent round of testing - which is fine, but further than we have gone before for any micro-benchmark kind of thing.

The resource based one just needs to know the number it doesn't need a whole map of every slot and details about it.

Yep - but this is more generally useful and (I would bet quite readily) not meaningfully more costly.

In any case, none of this is really a hard blocker either way on this proposal. We need the context object and it'll have basic stuff like get_task_queue. Whether or not this specific function lives on it we can decide after seeing how much effort and overhead it costs, though I'm pretty sure those are both small.

Quinn-With-Two-Ns · 2024-03-19T23:47:37Z

all-sdk/autotuning.md

-struct LocalActivitySlotsInfo {
-  used_slots: Vec<LocalActivitySlotInfo>,
+
+pub enum SlotReleaseReason {


If the worker is shutting down is that an Error or NeverUsed?

I would say NeverUsed, since if it is used, then we're waiting until that task completes before finishing shutdown. (or, if we're not, then it'd be an Error... but we should be waiting for tasks to finish)

Fine with me, might be worth documenting so we are consistent across SDKs.

cretz · 2024-03-20T12:22:46Z

all-sdk/autotuning.md

+/// if waiting for some time in the blocking version of `reserve_slot`.
+pub trait SlotReservationContext<SK: SlotKind>: Send + Sync {
+  /// Returns information about currently in-use slots
+  fn used_slots(&self) -> &[SK::Info];


Same, I prefer the lightest API surface needed to do an an acceptable job (may be only slot kind and task queue name atm)

cretz · 2024-03-20T12:22:50Z

all-sdk/autotuning.md

+  /// Blocks until a slot is available. In languages with explicit cancel mechanisms, this should be cancellable and
+  /// return a boolean indicating whether a slot was actually obtained or not. In Rust, the future can simply
+  /// be dropped if the reservation is no longer desired.
+  async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>);


I am now wondering for future proofing if we should let implementers return a slot and that same object be given back on mark_slot_used and release_slot. I can see use cases where users want to put state on the slot. Would also change the try_reserve_slot response to Option then instead of bool.

This is what I originally had, and is what I have in Rust because internally I store a struct that auto-releases on drop.

As for user data though, if we're saying elsewhere we'd prefer to not have something until there's a demonstrated need for it, I think that's probably even more applicable here. I don't know what the immediate use case is for this, but I do have known use cases for the used_slots on context.

Any language with real destructors, yeah, definitely.

As for user data though, if we're saying elsewhere we'd prefer to not have something until there's a demonstrated need for it, I think that's probably even more applicable here

Less applicable, because this is future proofing. We cannot add a return type to this easily later. The immediate use case would be like etcd lease info. But just an opaque object that we give back to them is nice.

Can we at least make sure that a user can uniquely identify the slot they reserved at mark-used/release time? I don't think there's a way today.

I can add it back. I'm just a little worried about how extensive of changes that might result in in Go/Java where there is no permit object like in Core. We'd have to track one everywhere like Core does, which ultimately is nice, but could be quite a lot of code churn.

I think any kind of uniquely identifying a particular slot/permit would have the same consequence. It's useful though, in the abstract.

@Quinn-With-Two-Ns what do you think?

cretz · 2024-03-20T12:23:39Z

all-sdk/autotuning.md

+  /// Blocks until a slot is available. In languages with explicit cancel mechanisms, this should be cancellable and
+  /// return a boolean indicating whether a slot was actually obtained or not. In Rust, the future can simply
+  /// be dropped if the reservation is no longer desired.
+  async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>);


Can we add a cancellation mechanism to this? (I forget if Rust implicitly supports cancellation of async like Python, but this at least needs to be called out for other langs)

The comment addresses exactly that. There's no need for an explicit cancel in Rust.

cretz · 2024-03-20T12:27:35Z

all-sdk/autotuning.md

+  /// Called when deciding what workflow should be evicted from the cache. May return the run id of a workflow to evict.
+  /// If None, or an invalid run id is returned, the default LRU eviction / strategy is used. The id returned must
+  /// be one of the workflows provided in `evictable` (the ones for which there is no actively processing WFT).
+  fn eviction_hint(&self, evictable: &[WorkflowSlotInfo]) -> Option<&str>;


^ I still think we should consider this simplified form that just lets the user decide whether it should evict upon cache request. Or maybe something like a:

type CacheController interface { ShouldCache(WorkflowInfo) CacheResult } const ( CacheAsNew CacheResult = iota EvictThenCache NoCache )

cretz · 2024-03-20T12:31:38Z

all-sdk/autotuning.md

+}
+```
+
+## A neat example


This is neat, many of our users keep asking how they can limit activities, so being able to acquire a zookeeper/etcd lock for a specific activity types is helpful. But we're gonna have to provide helpers if we want to advertise support for this, it's too complicated for most users.

cretz · 2024-03-20T12:34:19Z

all-sdk/autotuning.md

+  /// 
+  /// Users' implementation of this can choose to emit metrics, or otherwise leverage the information provided by the
+  /// `info` parameter to be better able to make future decisions about whether a slot should be handed out.
+  fn mark_slot_used(&self, info: <Self::SlotKind as SlotKind>::Info<'_>);


After looking at the "neat example" below, it's clear this can be a long-blocking call. It should be documented as such, made async, allow cancellation/interruption, and maybe change the name. "marking" isn't usually considered a blocking operation. Maybe apply_slot or start_slot or something.

Well, I say it could be, but I also explain why that's not necessary and you can defer the async-ness to the reservation. Ideally I think you'd keep this as nonblocking. Places where it gets used really, really shouldn't have to sit and wait.

Hrmm, yeah I first thought the "neat example" blocked in this call, but I see it doesn't. It means slot reservation may be unnecessarily expensive for things that never actually use the slot. Basically it means remote semaphores are based on poll instead of usage. But this is just a product of how we obtain work from server. We should at least document that people should not block in "mark used" since we don't give them async to do so.

Though now I wonder if someone would want to separate their remote polling semaphores from their usage semaphores. It makes sense in cases where you want to do one-activity-running-anywhere-globally cases. So maybe the API should allow that w/ async and just document that the "mark used" is subject to task timeout. Then again, I guess that can be in the activity. Now I'm starting to think about providing contextual data from slot supplier to the activity impl, I'm overthinking for sure.

I feel pretty strongly that marking used can't ever block at all (and yeah I can mention that in the comment).

The reasoning is that it happened, and the implementer has no choice but to accept that fact. If that then means, later, that they take a long time to hand out the next permit that's acceptable, but they can't block on mark used because a) it gums things up, and b) they can't change anything anyway, it's too late.

Works for me (can consider this resolved but can leave discussion open for others if you want). I do think release should be allowed to be async though.

cretz · 2024-03-20T12:34:43Z

all-sdk/autotuning.md

+  /// Frees a slot. This is always called when a slot is no longer needed, even if it was never marked as used.
+  /// EX: If the poller reserves a slot, and then receives an invalid task from the server for whatever reason, this
+  /// method would be called with [SlotReleaseReason::Error].
+  fn release_slot(&self, info: SlotReleaseReason);


If people are allowed to use remote tooling like zookeeper/etcd, this should be async

cretz · 2024-03-20T19:28:29Z

all-sdk/autotuning.md

+  async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>);
+
+  /// Tries to immediately reserve a slot, returning true if a slot is available.
+  fn try_reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>) -> bool;


If people are allowed to use remote tooling like zookeeper/etcd, this should be async

cretz

LGTM!

atihkin · 2024-03-26T13:48:04Z

all-sdk/autotuning.md

+  W->>S: `release_slot`
+```
+
+## Default implementations


This section is important - could we move it up higher? Given these are our proposed defaults, I'd like more eyeballs on this sooner.

atihkin · 2024-03-26T13:50:54Z

all-sdk/autotuning.md

+
+## Drawbacks
+
+* Gives users a big footgun. This is mitigated by the fact that most users are probably just going to use one of our


Anything we can do to mitigate? For example, letting users fallback to a default implementation if something goes wrong?

Not really possible - there's not any way to distinguish between the user's implementation intentionally not handing out slots, or it not handing out slots because it's broken.

Sushisource added 3 commits February 21, 2024 10:11

Move some top-level files down

36f2849

First pass

19c2e6f

Revision

5f5c379