-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Worker Autotuning #88
Conversation
* The current number of active polls for a task type must be <= the number of free slots for that task type at all | ||
times. Otherwise, we can end up receiving a task that won't have anywhere to go until a slot frees. For example, | ||
in Core this is expressed by using a permit reservation system. | ||
* Max outstanding workflow tasks must be <= max cached workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true? You can run a worker with no cache today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's true - but practically speaking it means your cache's peak size will equal the number of concurrent workflow tasks. In a real-world setting you probably care about cache size as it relates to peak memory usage, and since users are (should) always using the cache, the constraint makes sense in that context.
|
||
## Open questions | ||
|
||
### Async/fallible `mark_slot_used` and `release_slot` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this questions relevant to all SDKs or just a Rust SDK implementation of a SlotSupplier
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of them
all-sdk/autotuning.md
Outdated
} | ||
impl<T: SlotKind> PauseableSlotSupplier<T> { | ||
pub fn new(supplier: impl SlotSupplier<SlotKind=T>) -> Self { /* ... */ } | ||
pub fn pause(&self) { /* ... */ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a interesting question on when pause
returns. Do you return when you stop handing out new slots or return when you have no more slots in "inflight" ie reserved but not marked in use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it should return immediately, pause is a state not an actionable request for the most part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i am imagining it returns immediately. You're saying "become paused" (like initiating shutdown of a worker)
A possible future: | ||
|
||
```rust | ||
trait WorkflowCacheSizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like us to explore the idea of fully pluggable workflow cache instead of just a pluggable cache sizer. I suspect there is some performance wins for certain workflows to use different caching strategy then we currently use. I see you cover some of it with the eviction_hint
, but I think just letting users provide a cache implementations would be simpler .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if I'm late to the game here, but what do you mean by "fully pluggable"? Do you (basically) mean user-written, or that we should provide multiple implementations? If so could you give a hint what a different (from current) version might look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By fully pluggable I mean user written and we provide a few implementation. Right now we have a bounded LRU cache, a different version I would like to experiment with is a cache that accounts for how much work it would take to replay the workflow. One way I image it could do that is when it needs to evict a workflow weight the history size in what workflow we evict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concur, the difference between providing a cache interface and a can-cache interface is minimal, but the former has lots of upside. Let's just expose a get/put/delete cache interface that also has a "can allow" call.
For core langs however, they will only expose the "can" (and eviction callbacks if they want) because the stateful cache impl has to be in Rust IMO w/ just hooks for checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want a provided implementation to be literally responsible for storing the data, though, that seems potentially easy to create confusing-to-diagnose bugs & more work for the implementor.
What would actual storage provide that eviction_hint
doesn't? The example you gave @Quinn-With-Two-Ns seems to be serviced by the eviction hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason is flexibility. It's basically a question of allowing replacement of the cache impl vs just the cache size. The former is more flexible I think and easy enough to build a sizer on top of, but I guess you may have to abstract the sizer anyways then. So a double-abstraction could exist: a worker accepts a custom cache implementation, and the LRU implementation could accept a dynamic sizer.
I don't have a strong opinion and can understand why just the sizer is enough. But I would change the names on some of these things a bit and I question the "hint" call (bringing up in different comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there is a concern around making users store the data and potential bugs, but I think interceptors
and data converters
have equal if not more potential for introducing bugs.
I think you could get something similar with eviction_hint
, but it would be less flexible and efficient since you can't utilize a custom data structure to store the cached workflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think interceptors and data converters have equal if not more potential for introducing bugs.
This is true, but, also doesn't supply a reason to introduce additional dangerous things. Those are necessary, but the user actually storing the cached data for us certainly isn't.
all-sdk/autotuning.md
Outdated
} | ||
|
||
struct WorkflowSlotInfo { | ||
workflow_type: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we expose if these are eager workflows/activities here? I think it would be useful for some metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, certainly. We can put whatever we like in here.
I really appreciate how well-written this proposal is. (I'm just personally lacking a bit of necessary background.) |
runtime. Autotuning of these values seeks to allow them to change at runtime according to available resources on | ||
the worker. It is not feasible for us to choose *optimal* values for these at runtime, because we cannot know a-priori | ||
about how much resources any given workflow or activity task will consume - only the user can (maybe) know this. | ||
However, we can provide reasonable default autotuning implementations while also allowing users to provide their own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to say something here about what this might look like. For instance, we could take the approach that we tune for maximum concurrency (maintaining OOM or other error avoidance), or total throughput per worker, or minimum latency (for some measure of latency and some form of "minimum"). Any (or even even each) of these seem like valid strategies, as might some utility function that combined them. Do we plan to offer the user a choice? Pick one of these (or something I didn't mention) and do just that for now? What's our true goal here?
Even though I have placed my comment here in the slot/cache tuning section, it's really more of a global comment. We can probably do some stuff that improves worker performance in ways that are pareto optimal ... i.e. improve some dimension of performance without any cost in other dimensions. Maybe that should be our first goal. But most (I might even say all) performance tuning eventually requires a definition of performance that expresses the specific goal directly because there are always tradeoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll add a little more context.
But, yes, I think the idea here is we can add some default implementations that are a nice improvement for most people most of the time over the current fixed implementations. Beyond that, I agree, it starts to get specific, and people with those needs can implement exactly what they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the "default implementations" section does make some reference to those.
A possible future: | ||
|
||
```rust | ||
trait WorkflowCacheSizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if I'm late to the game here, but what do you mean by "fully pluggable"? Do you (basically) mean user-written, or that we should provide multiple implementations? If so could you give a hint what a different (from current) version might look like?
|
||
This proposal has focused specifically on worker autotuning and user customization thereof. However, it's worth | ||
considering how this effort will fit in to future work we know we want to do, namely around autoscaling and on-demand | ||
workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another future option worth considering (imo) would be where some workers (at least new ones) would learn from already running workers about their learned configuration to speed up the learning process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that'd be neat. I can make reference to that here but certainly something for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Like with many of these proposals, two recommendations (take em or leave em):
- Setup meeting to knock out each contentious point
- Close (without merge) early, noisier PRs where there's a lot of chatter once many things are resolved and make a final PR for final +1's for merge
all-sdk/autotuning.md
Outdated
/// Blocks until a slot is available. | ||
/// May also return an error if the backing user implementation encounters an error. | ||
/// Returned errors will bubble up and cause the worker to shut down. | ||
async fn reserve_slot(&self) -> Result<(), anyhow::Error>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I know this is just explanatory, I think a "slot info" should be provided with contextual information about the reservation request. Even if it's empty today (though unlikely to be, it should at least have namespace and task queue and worker build ID and maybe other worker identifiable things), it'll help future proof the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's a good point. It can't (well, shouldn't) be exactly the same type as in the other places though, as it'd mean 90% of the fields on it become optional. Perfectly reasonable to have a new type here though with the fields we know like you mentioned, and then room to add more fields later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely "reservation context" or "reservation info" for just reserve call and has no other reuse is acceptable IMO.
pub struct WorkerConfig { | ||
// ... existing fields | ||
#[builder(default)] | ||
pub workflow_task_slot_and_cache: Option<Arc<dyn WorkflowSlotAndCacheManager>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think slot supplier and cache should be provided as separate option, even if the implementer chooses to have the same struct impl both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with that but curious to hear why you think so.
I made it this way because I liked the idea of the user explicitly combining the workflow things into one piece and providing that, since it makes it clear they are somewhat related to each other, but it's not the most important reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be language specific, but many may not care about cache size in their impl.
I'm ok keeping them together, but it becomes harder to justify the separate interfaces then if they must always both be implemented anyways. So now I'm wondering if we want to change the naming a bit from SlotSupplier
to a more general purpose Limiter
, Tuner
, Controller
, TaskManager
, etc. So:
ActivitySlotSupplier
is insteadActivityController
and has the slot supplier stuffWorkflowSlotSupplierAndCacheManager
is insteadWorkflowController
and has slot supplier stuff and cache manager stuff
Basically this lifts the low-level notion of "slot supplying" and "cache managing" to a general purpose controller for limiting/controlling work. I think it will also be more clear to users to controller = new ResourceBasedWorkerController()
+ myWorkerConfig.WorkflowController = controller
. They don't have to care what slots are or that the interface is limited to slot supplying only or whatever.
Or maybe it's all under one large WorkerTaskController
or WorkerTuner
or something interface and you just have to set one option. That'd be nice and that's what we did with interceptors which is appreciated by users that they don't have to set separate ones. Yeah, I'm thinking I now support the interceptor model of one interface for it all (but can have getters for sub-interfaces or whatever if needed, though may not be needed).
With this change, if there is another dynamic config option we want to expose to users, we can now add it without making users worry about all the different fine-grained interfaces that could be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of combining slot and cache management. Slots are per worker, where the cache is per process in Go and per worker factory in Java. I would also like to leave the door open to how we map the cache to workers in the SDK ideally so we can have per worker cache in Go if desired.
all-sdk/autotuning.md
Outdated
/// Tries to immediately reserve a slot, returning true if a slot is available. | ||
/// May also return an error if the backing user implementation encounters an error. | ||
/// Returned errors will bubble up and cause the worker to shut down. | ||
fn try_reserve_slot(&self) -> Result<bool, anyhow::Error>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand, can I get some examples where we'd use non-blocking reserve attempt vs blocking? I can only think about blocking use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eager workflows/activities
/// Called when deciding what workflow should be evicted from the cache. May return the run id of a workflow to evict. | ||
/// If None, or an invalid run id is returned, the default LRU eviction / strategy is used. The id returned must | ||
/// be one of the workflows provided in `evictable` (the ones for which there is no actively processing WFT). | ||
fn eviction_hint(&self, evictable: &[WorkflowSlotInfo]) -> Option<&str>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case where a user may want to implement this vs a can_evict_workflow() -> bool
? Specifically, if a user doesn't control the cache, why is it up to them to provide a run ID as a response? Do we expect them to use this to make some workflows cached longer than others? I think that's a bit more flexibility than is needed and supports a general purpose cache interface if we need that kind of flexibility today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the point exactly is to allow them to control the caching "algorithm" without having to actually store the data.
They could pick LRU, or they could pick smallest history first like in @Quinn-With-Two-Ns 's example, or whatever they want.
They can simply return None
here (which we can even make the default impl) and we'll use LRU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without having to actually store the data
But that means they have to have access to the cache to even know what's in there (or have their own cache of IDs).
They could pick LRU
Can they? Do you call this interface every time the cache is accessed? Without full control of the caching accesses, one cannot do advanced implementation. I think we should either choose always-LRU or allow custom cache impls. I think the hybrid is too confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good point. Seemingly all we'd need to do though is add a get
or just_fetched
call, and that'd be sufficient (returns nothing, just notifies impl of access). I don't want it to be a real cache implementation because actually storing things won't work well across the language boundary, plus just the intrinsic danger of them returning the wrong item, or losing things that shouldn't be lost, etc. With the facade / sizer even if they screw up we can fall back to something that still operates correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, if we don't want a real cache impl, let's not even let them choose which thing to evict. Let's keep that complication out. So now I'm basically thinking cache is a simple:
type CacheController interface {
ShouldCacheAsNew(WorkflowInfo) bool
}
And if that's false, then it evicts one before caching. Don't need anything more than that (at least not for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ I still think we should consider this simplified form that just lets the user decide whether it should evict upon cache request. Or maybe something like a:
type CacheController interface {
ShouldCache(WorkflowInfo) CacheResult
}
const (
CacheAsNew CacheResult = iota
EvictThenCache
NoCache
)
OK I resolved a bunch of comments that I addressed, feel free to start a new thread if there's something that wasn't. I just needed to reduce the visual clutter a bit. Main bigger questions that are still open I think:
Also could set up a meeting for these, but I want to give any other people who are interested some time to add comments too |
Resolutions to the above:
|
/// if waiting for some time in the blocking version of `reserve_slot`. | ||
pub trait SlotReservationContext<SK: SlotKind>: Send + Sync { | ||
/// Returns information about currently in-use slots | ||
fn used_slots(&self) -> &[SK::Info]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the slot supplier know this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can know it, but, I don't really see a reason to force the implementor to track it themselves when we already have to track it ourselves at least in some capacity as discussed here: #88 (comment)
Certainly in Core this made dramatically more sense to provide automatically because the SlotSupplier
gets wrapped by a struct in Core that has to track all this stuff anyway, and I anticipate creating a similar wrapper in Go/Java.
If it turns out to be a huge pain to provide as I'm working through Java, it can be left off - but I imagine it's going to be pretty straightforward to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just because we can provided something doesn't mean we should. We can always add information based on user feadback, it is much harder to remove. My main concern is the adding the overhead of keeping track of all this data getting updated from potential 10,000s of goroutines in Go when a user doesn't care about the information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, I prefer the lightest API surface needed to do an an acceptable job (may be only slot kind and task queue name atm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely have known reasons to want this though, and the stripe guys even explicitly said they would want and use this information inside their implementation. I'm pretty clear on it having value.
The overhead is tiny. It's just keeping things in a list or a map, even if that's 10k things that's not hard, there's no meaningful compute going on. It's already got to be tracked somewhere anyway in order to provide activity/wf context to users, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, sure, but this isn't custom information - it's information that's extracted from the task which we'll be storing anyway. It's just making it available to them rather than asking them to do that work themselves. I don't think there's any strong reason they should record it over us. It's just making their lives easier for something that's very likely to come up in many or most implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, sure, but this isn't custom information
Right, but I think users will also want to track custom information for each slot if they already want this level of detail.
I don't think there's any strong reason they should record it over us
Because it add overhead to every implementation that doesn't need it like our implementations and it adds more complexity to the SDK and the API. I am not sure this is very likely to come up in real implementations since the ones we have proposed (fixed, resource based, pausable) do not need it. I'm totally open to adding a wrapper that keeps track of it or I think we need to do benchmarking to show this doesn't increase latency or memory overhead. I apologize for being such a stickler here, but since we have more low latency use cases coming into temporal I am very paranoid about introducing potential performance regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource based one does need it. It needs to know how many slots are used since you need to know that you're not going to hand out more slots than there is room in the cache (or even just to define a configurable hard limit).
This is zero cost in Core, for example. The info is all tracked already. Fetching system resources like memory usage for that implementation is dramatically more expensive for example.
We can benchmark... but I feel that's a little odd here given there are virtually zero changes we do actually benchmark and I don't see any reason to believe this would add substantial overhead compared to anything else we do. I mean it's literally just stuffing some things (which are likely just references to objects that already exist elsewhere) in a collection, which isn't very expensive to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource based one does need it. It needs to know how many slots are used since you need to know that you're not going to hand out more slots than there is room in the cache (or even just to define a configurable hard limit).
The resource based one just needs to know the number it doesn't need a whole map of every slot and details about it.
but I feel that's a little odd here given there are virtually zero changes we do actually benchmark
I get for Core it is zero cost, but for Go and Java this is new. I thought in slack we already talked about benchmarking these changes so I didn't even think this was a new requirement? I assumed as part of benchmarking we would test the old SDK implementation with the new fixed slot supplier implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, of course, but isolating providing this particular function will require it's own independent round of testing - which is fine, but further than we have gone before for any micro-benchmark kind of thing.
The resource based one just needs to know the number it doesn't need a whole map of every slot and details about it.
Yep - but this is more generally useful and (I would bet quite readily) not meaningfully more costly.
In any case, none of this is really a hard blocker either way on this proposal. We need the context object and it'll have basic stuff like get_task_queue
. Whether or not this specific function lives on it we can decide after seeing how much effort and overhead it costs, though I'm pretty sure those are both small.
struct LocalActivitySlotsInfo { | ||
used_slots: Vec<LocalActivitySlotInfo>, | ||
|
||
pub enum SlotReleaseReason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the worker is shutting down is that an Error
or NeverUsed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say NeverUsed
, since if it is used, then we're waiting until that task completes before finishing shutdown. (or, if we're not, then it'd be an Error
... but we should be waiting for tasks to finish)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me, might be worth documenting so we are consistent across SDKs.
/// if waiting for some time in the blocking version of `reserve_slot`. | ||
pub trait SlotReservationContext<SK: SlotKind>: Send + Sync { | ||
/// Returns information about currently in-use slots | ||
fn used_slots(&self) -> &[SK::Info]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, I prefer the lightest API surface needed to do an an acceptable job (may be only slot kind and task queue name atm)
all-sdk/autotuning.md
Outdated
/// Blocks until a slot is available. In languages with explicit cancel mechanisms, this should be cancellable and | ||
/// return a boolean indicating whether a slot was actually obtained or not. In Rust, the future can simply | ||
/// be dropped if the reservation is no longer desired. | ||
async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now wondering for future proofing if we should let implementers return a slot and that same object be given back on mark_slot_used
and release_slot
. I can see use cases where users want to put state on the slot. Would also change the try_reserve_slot
response to Option
then instead of bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I originally had, and is what I have in Rust because internally I store a struct that auto-releases on drop.
As for user data though, if we're saying elsewhere we'd prefer to not have something until there's a demonstrated need for it, I think that's probably even more applicable here. I don't know what the immediate use case is for this, but I do have known use cases for the used_slots
on context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any language with real destructors, yeah, definitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for user data though, if we're saying elsewhere we'd prefer to not have something until there's a demonstrated need for it, I think that's probably even more applicable here
Less applicable, because this is future proofing. We cannot add a return type to this easily later. The immediate use case would be like etcd lease info. But just an opaque object that we give back to them is nice.
Can we at least make sure that a user can uniquely identify the slot they reserved at mark-used/release time? I don't think there's a way today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add it back. I'm just a little worried about how extensive of changes that might result in in Go/Java where there is no permit object like in Core. We'd have to track one everywhere like Core does, which ultimately is nice, but could be quite a lot of code churn.
I think any kind of uniquely identifying a particular slot/permit would have the same consequence. It's useful though, in the abstract.
@Quinn-With-Two-Ns what do you think?
all-sdk/autotuning.md
Outdated
/// Blocks until a slot is available. In languages with explicit cancel mechanisms, this should be cancellable and | ||
/// return a boolean indicating whether a slot was actually obtained or not. In Rust, the future can simply | ||
/// be dropped if the reservation is no longer desired. | ||
async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a cancellation mechanism to this? (I forget if Rust implicitly supports cancellation of async like Python, but this at least needs to be called out for other langs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment addresses exactly that. There's no need for an explicit cancel in Rust.
/// Called when deciding what workflow should be evicted from the cache. May return the run id of a workflow to evict. | ||
/// If None, or an invalid run id is returned, the default LRU eviction / strategy is used. The id returned must | ||
/// be one of the workflows provided in `evictable` (the ones for which there is no actively processing WFT). | ||
fn eviction_hint(&self, evictable: &[WorkflowSlotInfo]) -> Option<&str>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ I still think we should consider this simplified form that just lets the user decide whether it should evict upon cache request. Or maybe something like a:
type CacheController interface {
ShouldCache(WorkflowInfo) CacheResult
}
const (
CacheAsNew CacheResult = iota
EvictThenCache
NoCache
)
} | ||
``` | ||
|
||
## A neat example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat, many of our users keep asking how they can limit activities, so being able to acquire a zookeeper/etcd lock for a specific activity types is helpful. But we're gonna have to provide helpers if we want to advertise support for this, it's too complicated for most users.
all-sdk/autotuning.md
Outdated
/// | ||
/// Users' implementation of this can choose to emit metrics, or otherwise leverage the information provided by the | ||
/// `info` parameter to be better able to make future decisions about whether a slot should be handed out. | ||
fn mark_slot_used(&self, info: <Self::SlotKind as SlotKind>::Info<'_>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at the "neat example" below, it's clear this can be a long-blocking call. It should be documented as such, made async, allow cancellation/interruption, and maybe change the name. "marking" isn't usually considered a blocking operation. Maybe apply_slot
or start_slot
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I say it could be, but I also explain why that's not necessary and you can defer the async-ness to the reservation. Ideally I think you'd keep this as nonblocking. Places where it gets used really, really shouldn't have to sit and wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, yeah I first thought the "neat example" blocked in this call, but I see it doesn't. It means slot reservation may be unnecessarily expensive for things that never actually use the slot. Basically it means remote semaphores are based on poll instead of usage. But this is just a product of how we obtain work from server. We should at least document that people should not block in "mark used" since we don't give them async to do so.
Though now I wonder if someone would want to separate their remote polling semaphores from their usage semaphores. It makes sense in cases where you want to do one-activity-running-anywhere-globally cases. So maybe the API should allow that w/ async and just document that the "mark used" is subject to task timeout. Then again, I guess that can be in the activity. Now I'm starting to think about providing contextual data from slot supplier to the activity impl, I'm overthinking for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel pretty strongly that marking used can't ever block at all (and yeah I can mention that in the comment).
The reasoning is that it happened, and the implementer has no choice but to accept that fact. If that then means, later, that they take a long time to hand out the next permit that's acceptable, but they can't block on mark used because a) it gums things up, and b) they can't change anything anyway, it's too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me (can consider this resolved but can leave discussion open for others if you want). I do think release should be allowed to be async though.
all-sdk/autotuning.md
Outdated
/// Frees a slot. This is always called when a slot is no longer needed, even if it was never marked as used. | ||
/// EX: If the poller reserves a slot, and then receives an invalid task from the server for whatever reason, this | ||
/// method would be called with [SlotReleaseReason::Error]. | ||
fn release_slot(&self, info: SlotReleaseReason); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If people are allowed to use remote tooling like zookeeper/etcd, this should be async
all-sdk/autotuning.md
Outdated
async fn reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>); | ||
|
||
/// Tries to immediately reserve a slot, returning true if a slot is available. | ||
fn try_reserve_slot(&self, ctx: &dyn SlotReservationContext<Self::SlotKind>) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If people are allowed to use remote tooling like zookeeper/etcd, this should be async
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
W->>S: `release_slot` | ||
``` | ||
|
||
## Default implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is important - could we move it up higher? Given these are our proposed defaults, I'd like more eyeballs on this sooner.
|
||
## Drawbacks | ||
|
||
* Gives users a big footgun. This is mitigated by the fact that most users are probably just going to use one of our |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything we can do to mitigate? For example, letting users fallback to a default implementation if something goes wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really possible - there's not any way to distinguish between the user's implementation intentionally not handing out slots, or it not handing out slots because it's broken.
Rendered: https://github.com/Sushisource/proposals/blob/worker-autotune/all-sdk/autotuning.md