Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Worker Autotuning #88

Merged
merged 9 commits into from
Apr 2, 2024
Merged

Conversation

Sushisource
Copy link
Member

* The current number of active polls for a task type must be <= the number of free slots for that task type at all
times. Otherwise, we can end up receiving a task that won't have anywhere to go until a slot frees. For example,
in Core this is expressed by using a permit reservation system.
* Max outstanding workflow tasks must be <= max cached workflows.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true? You can run a worker with no cache today

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true - but practically speaking it means your cache's peak size will equal the number of concurrent workflow tasks. In a real-world setting you probably care about cache size as it relates to peak memory usage, and since users are (should) always using the cache, the constraint makes sense in that context.


## Open questions

### Async/fallible `mark_slot_used` and `release_slot`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this questions relevant to all SDKs or just a Rust SDK implementation of a SlotSupplier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of them

}
impl<T: SlotKind> PauseableSlotSupplier<T> {
pub fn new(supplier: impl SlotSupplier<SlotKind=T>) -> Self { /* ... */ }
pub fn pause(&self) { /* ... */ }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a interesting question on when pause returns. Do you return when you stop handing out new slots or return when you have no more slots in "inflight" ie reserved but not marked in use

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it should return immediately, pause is a state not an actionable request for the most part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i am imagining it returns immediately. You're saying "become paused" (like initiating shutdown of a worker)

A possible future:

```rust
trait WorkflowCacheSizer {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like us to explore the idea of fully pluggable workflow cache instead of just a pluggable cache sizer. I suspect there is some performance wins for certain workflows to use different caching strategy then we currently use. I see you cover some of it with the eviction_hint , but I think just letting users provide a cache implementations would be simpler .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if I'm late to the game here, but what do you mean by "fully pluggable"? Do you (basically) mean user-written, or that we should provide multiple implementations? If so could you give a hint what a different (from current) version might look like?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By fully pluggable I mean user written and we provide a few implementation. Right now we have a bounded LRU cache, a different version I would like to experiment with is a cache that accounts for how much work it would take to replay the workflow. One way I image it could do that is when it needs to evict a workflow weight the history size in what workflow we evict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concur, the difference between providing a cache interface and a can-cache interface is minimal, but the former has lots of upside. Let's just expose a get/put/delete cache interface that also has a "can allow" call.

For core langs however, they will only expose the "can" (and eviction callbacks if they want) because the stateful cache impl has to be in Rust IMO w/ just hooks for checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want a provided implementation to be literally responsible for storing the data, though, that seems potentially easy to create confusing-to-diagnose bugs & more work for the implementor.

What would actual storage provide that eviction_hint doesn't? The example you gave @Quinn-With-Two-Ns seems to be serviced by the eviction hint.

Copy link
Member

@cretz cretz Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is flexibility. It's basically a question of allowing replacement of the cache impl vs just the cache size. The former is more flexible I think and easy enough to build a sizer on top of, but I guess you may have to abstract the sizer anyways then. So a double-abstraction could exist: a worker accepts a custom cache implementation, and the LRU implementation could accept a dynamic sizer.

I don't have a strong opinion and can understand why just the sizer is enough. But I would change the names on some of these things a bit and I question the "hint" call (bringing up in different comment).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there is a concern around making users store the data and potential bugs, but I think interceptors and data converters have equal if not more potential for introducing bugs.

I think you could get something similar with eviction_hint , but it would be less flexible and efficient since you can't utilize a custom data structure to store the cached workflows

Copy link
Member Author

@Sushisource Sushisource Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think interceptors and data converters have equal if not more potential for introducing bugs.

This is true, but, also doesn't supply a reason to introduce additional dangerous things. Those are necessary, but the user actually storing the cached data for us certainly isn't.

}

struct WorkflowSlotInfo {
workflow_type: String,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we expose if these are eager workflows/activities here? I think it would be useful for some metrics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, certainly. We can put whatever we like in here.

@dandavison
Copy link

I really appreciate how well-written this proposal is. (I'm just personally lacking a bit of necessary background.)

runtime. Autotuning of these values seeks to allow them to change at runtime according to available resources on
the worker. It is not feasible for us to choose *optimal* values for these at runtime, because we cannot know a-priori
about how much resources any given workflow or activity task will consume - only the user can (maybe) know this.
However, we can provide reasonable default autotuning implementations while also allowing users to provide their own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to say something here about what this might look like. For instance, we could take the approach that we tune for maximum concurrency (maintaining OOM or other error avoidance), or total throughput per worker, or minimum latency (for some measure of latency and some form of "minimum"). Any (or even even each) of these seem like valid strategies, as might some utility function that combined them. Do we plan to offer the user a choice? Pick one of these (or something I didn't mention) and do just that for now? What's our true goal here?

Even though I have placed my comment here in the slot/cache tuning section, it's really more of a global comment. We can probably do some stuff that improves worker performance in ways that are pareto optimal ... i.e. improve some dimension of performance without any cost in other dimensions. Maybe that should be our first goal. But most (I might even say all) performance tuning eventually requires a definition of performance that expresses the specific goal directly because there are always tradeoffs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll add a little more context.

But, yes, I think the idea here is we can add some default implementations that are a nice improvement for most people most of the time over the current fixed implementations. Beyond that, I agree, it starts to get specific, and people with those needs can implement exactly what they want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the "default implementations" section does make some reference to those.

all-sdk/autotuning.md Show resolved Hide resolved
A possible future:

```rust
trait WorkflowCacheSizer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if I'm late to the game here, but what do you mean by "fully pluggable"? Do you (basically) mean user-written, or that we should provide multiple implementations? If so could you give a hint what a different (from current) version might look like?


This proposal has focused specifically on worker autotuning and user customization thereof. However, it's worth
considering how this effort will fit in to future work we know we want to do, namely around autoscaling and on-demand
workers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another future option worth considering (imo) would be where some workers (at least new ones) would learn from already running workers about their learned configuration to speed up the learning process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that'd be neat. I can make reference to that here but certainly something for later.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Like with many of these proposals, two recommendations (take em or leave em):

  • Setup meeting to knock out each contentious point
  • Close (without merge) early, noisier PRs where there's a lot of chatter once many things are resolved and make a final PR for final +1's for merge

all-sdk/autotuning.md Outdated Show resolved Hide resolved
/// Blocks until a slot is available.
/// May also return an error if the backing user implementation encounters an error.
/// Returned errors will bubble up and cause the worker to shut down.
async fn reserve_slot(&self) -> Result<(), anyhow::Error>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I know this is just explanatory, I think a "slot info" should be provided with contextual information about the reservation request. Even if it's empty today (though unlikely to be, it should at least have namespace and task queue and worker build ID and maybe other worker identifiable things), it'll help future proof the API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's a good point. It can't (well, shouldn't) be exactly the same type as in the other places though, as it'd mean 90% of the fields on it become optional. Perfectly reasonable to have a new type here though with the fields we know like you mentioned, and then room to add more fields later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely "reservation context" or "reservation info" for just reserve call and has no other reuse is acceptable IMO.

all-sdk/autotuning.md Outdated Show resolved Hide resolved
all-sdk/autotuning.md Outdated Show resolved Hide resolved
all-sdk/autotuning.md Outdated Show resolved Hide resolved
all-sdk/autotuning.md Show resolved Hide resolved
all-sdk/autotuning.md Outdated Show resolved Hide resolved
pub struct WorkerConfig {
// ... existing fields
#[builder(default)]
pub workflow_task_slot_and_cache: Option<Arc<dyn WorkflowSlotAndCacheManager>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think slot supplier and cache should be provided as separate option, even if the implementer chooses to have the same struct impl both

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with that but curious to hear why you think so.

I made it this way because I liked the idea of the user explicitly combining the workflow things into one piece and providing that, since it makes it clear they are somewhat related to each other, but it's not the most important reason.

Copy link
Member

@cretz cretz Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be language specific, but many may not care about cache size in their impl.

I'm ok keeping them together, but it becomes harder to justify the separate interfaces then if they must always both be implemented anyways. So now I'm wondering if we want to change the naming a bit from SlotSupplier to a more general purpose Limiter, Tuner, Controller, TaskManager, etc. So:

  • ActivitySlotSupplier is instead ActivityController and has the slot supplier stuff
  • WorkflowSlotSupplierAndCacheManager is instead WorkflowController and has slot supplier stuff and cache manager stuff

Basically this lifts the low-level notion of "slot supplying" and "cache managing" to a general purpose controller for limiting/controlling work. I think it will also be more clear to users to controller = new ResourceBasedWorkerController() + myWorkerConfig.WorkflowController = controller. They don't have to care what slots are or that the interface is limited to slot supplying only or whatever.

Or maybe it's all under one large WorkerTaskController or WorkerTuner or something interface and you just have to set one option. That'd be nice and that's what we did with interceptors which is appreciated by users that they don't have to set separate ones. Yeah, I'm thinking I now support the interceptor model of one interface for it all (but can have getters for sub-interfaces or whatever if needed, though may not be needed).

With this change, if there is another dynamic config option we want to expose to users, we can now add it without making users worry about all the different fine-grained interfaces that could be implemented.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of combining slot and cache management. Slots are per worker, where the cache is per process in Go and per worker factory in Java. I would also like to leave the door open to how we map the cache to workers in the SDK ideally so we can have per worker cache in Go if desired.

all-sdk/autotuning.md Show resolved Hide resolved
/// Tries to immediately reserve a slot, returning true if a slot is available.
/// May also return an error if the backing user implementation encounters an error.
/// Returned errors will bubble up and cause the worker to shut down.
fn try_reserve_slot(&self) -> Result<bool, anyhow::Error>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, can I get some examples where we'd use non-blocking reserve attempt vs blocking? I can only think about blocking use cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eager workflows/activities

/// Called when deciding what workflow should be evicted from the cache. May return the run id of a workflow to evict.
/// If None, or an invalid run id is returned, the default LRU eviction / strategy is used. The id returned must
/// be one of the workflows provided in `evictable` (the ones for which there is no actively processing WFT).
fn eviction_hint(&self, evictable: &[WorkflowSlotInfo]) -> Option<&str>;
Copy link
Member

@cretz cretz Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case where a user may want to implement this vs a can_evict_workflow() -> bool? Specifically, if a user doesn't control the cache, why is it up to them to provide a run ID as a response? Do we expect them to use this to make some workflows cached longer than others? I think that's a bit more flexibility than is needed and supports a general purpose cache interface if we need that kind of flexibility today.

Copy link
Member Author

@Sushisource Sushisource Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the point exactly is to allow them to control the caching "algorithm" without having to actually store the data.

They could pick LRU, or they could pick smallest history first like in @Quinn-With-Two-Ns 's example, or whatever they want.

They can simply return None here (which we can even make the default impl) and we'll use LRU

Copy link
Member

@cretz cretz Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without having to actually store the data

But that means they have to have access to the cache to even know what's in there (or have their own cache of IDs).

They could pick LRU

Can they? Do you call this interface every time the cache is accessed? Without full control of the caching accesses, one cannot do advanced implementation. I think we should either choose always-LRU or allow custom cache impls. I think the hybrid is too confusing.

Copy link
Member Author

@Sushisource Sushisource Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good point. Seemingly all we'd need to do though is add a get or just_fetched call, and that'd be sufficient (returns nothing, just notifies impl of access). I don't want it to be a real cache implementation because actually storing things won't work well across the language boundary, plus just the intrinsic danger of them returning the wrong item, or losing things that shouldn't be lost, etc. With the facade / sizer even if they screw up we can fall back to something that still operates correctly.

Copy link
Member

@cretz cretz Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if we don't want a real cache impl, let's not even let them choose which thing to evict. Let's keep that complication out. So now I'm basically thinking cache is a simple:

type CacheController interface {
    ShouldCacheAsNew(WorkflowInfo) bool
}

And if that's false, then it evicts one before caching. Don't need anything more than that (at least not for now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ I still think we should consider this simplified form that just lets the user decide whether it should evict upon cache request. Or maybe something like a:

type CacheController interface {
    ShouldCache(WorkflowInfo) CacheResult
}

const (
    CacheAsNew CacheResult = iota
    EvictThenCache
    NoCache
)

@Sushisource
Copy link
Member Author

OK I resolved a bunch of comments that I addressed, feel free to start a new thread if there's something that wasn't. I just needed to reduce the visual clutter a bit.

Main bigger questions that are still open I think:

  • Should the cache sizer be an actual cache (I'm pretty strongly inclined no here, but, we could make it a facade for one)
  • Providing a context object to reserve/try reserve. Pretty sure we need one at least to start with with task_queue / worker_id etc, but, do we feel that we should also put a shutdown() on it, or anything else?
  • Should the traits have optionally-implementable methods for the things that would emit known metrics like max cache size and total slots?

Also could set up a meeting for these, but I want to give any other people who are interested some time to add comments too

@Sushisource
Copy link
Member Author

Resolutions to the above:

  • Keep cache sizer to just the "should_allow" kind of function to start with
  • Yes, provide task queue & build id (if set) initially
  • No. Have implementers report the metric. Provide a helper for easy access to the well-defined metrics.

/// if waiting for some time in the blocking version of `reserve_slot`.
pub trait SlotReservationContext<SK: SlotKind>: Send + Sync {
/// Returns information about currently in-use slots
fn used_slots(&self) -> &[SK::Info];
Copy link

@Quinn-With-Two-Ns Quinn-With-Two-Ns Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the slot supplier know this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can know it, but, I don't really see a reason to force the implementor to track it themselves when we already have to track it ourselves at least in some capacity as discussed here: #88 (comment)

Certainly in Core this made dramatically more sense to provide automatically because the SlotSupplier gets wrapped by a struct in Core that has to track all this stuff anyway, and I anticipate creating a similar wrapper in Go/Java.

If it turns out to be a huge pain to provide as I'm working through Java, it can be left off - but I imagine it's going to be pretty straightforward to do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just because we can provided something doesn't mean we should. We can always add information based on user feadback, it is much harder to remove. My main concern is the adding the overhead of keeping track of all this data getting updated from potential 10,000s of goroutines in Go when a user doesn't care about the information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I prefer the lightest API surface needed to do an an acceptable job (may be only slot kind and task queue name atm)

Copy link
Member Author

@Sushisource Sushisource Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely have known reasons to want this though, and the stripe guys even explicitly said they would want and use this information inside their implementation. I'm pretty clear on it having value.

The overhead is tiny. It's just keeping things in a list or a map, even if that's 10k things that's not hard, there's no meaningful compute going on. It's already got to be tracked somewhere anyway in order to provide activity/wf context to users, etc.

Copy link
Member Author

@Sushisource Sushisource Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sure, but this isn't custom information - it's information that's extracted from the task which we'll be storing anyway. It's just making it available to them rather than asking them to do that work themselves. I don't think there's any strong reason they should record it over us. It's just making their lives easier for something that's very likely to come up in many or most implementations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sure, but this isn't custom information

Right, but I think users will also want to track custom information for each slot if they already want this level of detail.

I don't think there's any strong reason they should record it over us

Because it add overhead to every implementation that doesn't need it like our implementations and it adds more complexity to the SDK and the API. I am not sure this is very likely to come up in real implementations since the ones we have proposed (fixed, resource based, pausable) do not need it. I'm totally open to adding a wrapper that keeps track of it or I think we need to do benchmarking to show this doesn't increase latency or memory overhead. I apologize for being such a stickler here, but since we have more low latency use cases coming into temporal I am very paranoid about introducing potential performance regressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource based one does need it. It needs to know how many slots are used since you need to know that you're not going to hand out more slots than there is room in the cache (or even just to define a configurable hard limit).

This is zero cost in Core, for example. The info is all tracked already. Fetching system resources like memory usage for that implementation is dramatically more expensive for example.

We can benchmark... but I feel that's a little odd here given there are virtually zero changes we do actually benchmark and I don't see any reason to believe this would add substantial overhead compared to anything else we do. I mean it's literally just stuffing some things (which are likely just references to objects that already exist elsewhere) in a collection, which isn't very expensive to do.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource based one does need it. It needs to know how many slots are used since you need to know that you're not going to hand out more slots than there is room in the cache (or even just to define a configurable hard limit).

The resource based one just needs to know the number it doesn't need a whole map of every slot and details about it.

but I feel that's a little odd here given there are virtually zero changes we do actually benchmark

I get for Core it is zero cost, but for Go and Java this is new. I thought in slack we already talked about benchmarking these changes so I didn't even think this was a new requirement? I assumed as part of benchmarking we would test the old SDK implementation with the new fixed slot supplier implementation?

Copy link
Member Author

@Sushisource Sushisource Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, of course, but isolating providing this particular function will require it's own independent round of testing - which is fine, but further than we have gone before for any micro-benchmark kind of thing.

The resource based one just needs to know the number it doesn't need a whole map of every slot and details about it.

Yep - but this is more generally useful and (I would bet quite readily) not meaningfully more costly.

In any case, none of this is really a hard blocker either way on this proposal. We need the context object and it'll have basic stuff like get_task_queue. Whether or not this specific function lives on it we can decide after seeing how much effort and overhead it costs, though I'm pretty sure those are both small.

struct LocalActivitySlotsInfo {
used_slots: Vec<LocalActivitySlotInfo>,

pub enum SlotReleaseReason {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the worker is shutting down is that an Error or NeverUsed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say NeverUsed, since if it is used, then we're waiting until that task completes before finishing shutdown. (or, if we're not, then it'd be an Error... but we should be waiting for tasks to finish)

Copy link

@Quinn-With-Two-Ns Quinn-With-Two-Ns Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me, might be worth documenting so we are consistent across SDKs.

/// if waiting for some time in the blocking version of `reserve_slot`.
pub trait SlotReservationContext<SK: SlotKind>: Send + Sync {
/// Returns information about currently in-use slots
fn used_slots(&self) -> &[SK::Info];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I prefer the lightest API surface needed to do an an acceptable job (may be only slot kind and task queue name atm)

/// Blocks until a slot is available. In languages with explicit cancel mechanisms, this should be cancellable and
/// return a boolean indicating whether a slot was actually obtained or not. In Rust, the future can simply
/// be dropped if the reservation is no longer desired.
async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now wondering for future proofing if we should let implementers return a slot and that same object be given back on mark_slot_used and release_slot. I can see use cases where users want to put state on the slot. Would also change the try_reserve_slot response to Option then instead of bool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I originally had, and is what I have in Rust because internally I store a struct that auto-releases on drop.

As for user data though, if we're saying elsewhere we'd prefer to not have something until there's a demonstrated need for it, I think that's probably even more applicable here. I don't know what the immediate use case is for this, but I do have known use cases for the used_slots on context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any language with real destructors, yeah, definitely.

Copy link
Member

@cretz cretz Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for user data though, if we're saying elsewhere we'd prefer to not have something until there's a demonstrated need for it, I think that's probably even more applicable here

Less applicable, because this is future proofing. We cannot add a return type to this easily later. The immediate use case would be like etcd lease info. But just an opaque object that we give back to them is nice.

Can we at least make sure that a user can uniquely identify the slot they reserved at mark-used/release time? I don't think there's a way today.

Copy link
Member Author

@Sushisource Sushisource Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add it back. I'm just a little worried about how extensive of changes that might result in in Go/Java where there is no permit object like in Core. We'd have to track one everywhere like Core does, which ultimately is nice, but could be quite a lot of code churn.

I think any kind of uniquely identifying a particular slot/permit would have the same consequence. It's useful though, in the abstract.

@Quinn-With-Two-Ns what do you think?

/// Blocks until a slot is available. In languages with explicit cancel mechanisms, this should be cancellable and
/// return a boolean indicating whether a slot was actually obtained or not. In Rust, the future can simply
/// be dropped if the reservation is no longer desired.
async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a cancellation mechanism to this? (I forget if Rust implicitly supports cancellation of async like Python, but this at least needs to be called out for other langs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment addresses exactly that. There's no need for an explicit cancel in Rust.

/// Called when deciding what workflow should be evicted from the cache. May return the run id of a workflow to evict.
/// If None, or an invalid run id is returned, the default LRU eviction / strategy is used. The id returned must
/// be one of the workflows provided in `evictable` (the ones for which there is no actively processing WFT).
fn eviction_hint(&self, evictable: &[WorkflowSlotInfo]) -> Option<&str>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ I still think we should consider this simplified form that just lets the user decide whether it should evict upon cache request. Or maybe something like a:

type CacheController interface {
    ShouldCache(WorkflowInfo) CacheResult
}

const (
    CacheAsNew CacheResult = iota
    EvictThenCache
    NoCache
)

}
```

## A neat example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, many of our users keep asking how they can limit activities, so being able to acquire a zookeeper/etcd lock for a specific activity types is helpful. But we're gonna have to provide helpers if we want to advertise support for this, it's too complicated for most users.

///
/// Users' implementation of this can choose to emit metrics, or otherwise leverage the information provided by the
/// `info` parameter to be better able to make future decisions about whether a slot should be handed out.
fn mark_slot_used(&self, info: <Self::SlotKind as SlotKind>::Info<'_>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the "neat example" below, it's clear this can be a long-blocking call. It should be documented as such, made async, allow cancellation/interruption, and maybe change the name. "marking" isn't usually considered a blocking operation. Maybe apply_slot or start_slot or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I say it could be, but I also explain why that's not necessary and you can defer the async-ness to the reservation. Ideally I think you'd keep this as nonblocking. Places where it gets used really, really shouldn't have to sit and wait.

Copy link
Member

@cretz cretz Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, yeah I first thought the "neat example" blocked in this call, but I see it doesn't. It means slot reservation may be unnecessarily expensive for things that never actually use the slot. Basically it means remote semaphores are based on poll instead of usage. But this is just a product of how we obtain work from server. We should at least document that people should not block in "mark used" since we don't give them async to do so.

Though now I wonder if someone would want to separate their remote polling semaphores from their usage semaphores. It makes sense in cases where you want to do one-activity-running-anywhere-globally cases. So maybe the API should allow that w/ async and just document that the "mark used" is subject to task timeout. Then again, I guess that can be in the activity. Now I'm starting to think about providing contextual data from slot supplier to the activity impl, I'm overthinking for sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel pretty strongly that marking used can't ever block at all (and yeah I can mention that in the comment).

The reasoning is that it happened, and the implementer has no choice but to accept that fact. If that then means, later, that they take a long time to hand out the next permit that's acceptable, but they can't block on mark used because a) it gums things up, and b) they can't change anything anyway, it's too late.

Copy link
Member

@cretz cretz Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me (can consider this resolved but can leave discussion open for others if you want). I do think release should be allowed to be async though.

/// Frees a slot. This is always called when a slot is no longer needed, even if it was never marked as used.
/// EX: If the poller reserves a slot, and then receives an invalid task from the server for whatever reason, this
/// method would be called with [SlotReleaseReason::Error].
fn release_slot(&self, info: SlotReleaseReason);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people are allowed to use remote tooling like zookeeper/etcd, this should be async

async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>);

/// Tries to immediately reserve a slot, returning true if a slot is available.
fn try_reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>) -> bool;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people are allowed to use remote tooling like zookeeper/etcd, this should be async

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

W->>S: `release_slot`
```

## Default implementations
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is important - could we move it up higher? Given these are our proposed defaults, I'd like more eyeballs on this sooner.


## Drawbacks

* Gives users a big footgun. This is mitigated by the fact that most users are probably just going to use one of our
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything we can do to mitigate? For example, letting users fallback to a default implementation if something goes wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really possible - there's not any way to distinguish between the user's implementation intentionally not handing out slots, or it not handing out slots because it's broken.

@Sushisource Sushisource merged commit 963303d into temporalio:master Apr 2, 2024
2 checks passed
@Sushisource Sushisource deleted the worker-autotune branch April 2, 2024 18:08
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants