-
Notifications
You must be signed in to change notification settings - Fork 47
Added local queue scheduling and "next_task" optimization #22
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
…om being put there
…t_task field cannot be stolen. this can lead to deadlocks because run() is a future that may not be constantly polled, and thus there's no guarantee that local queues will make progress.
@nullchinchilla Would you be open to rebasing/cutting down on this PR? These optimizations are important and I would be open to reviewing it. |
I've actually decided on a different course, since I've realized that local scheduling and an unstealable next_task cell can cause issues (such as deadlocks if we nest smol::block_on). You can check my latest executor work in the "smolscale" crate, which uses smol::Task as well and is fully compatible with the smol-rs ecosystem, but to be easier to optimize forces a global executor. |
Thanks for letting me know! I think that this crate should act more as a "reference" executor, that aims to implement features rather than be as optimal as possible. I'll close this for now. |
Two major changes significantly improve performance:
Executor::run()
is called, a handle to the local queue and ticker are cached into TLS. This lets tasks schedule to a thread-local queue rather than always to the global queue.next_task
optimization (see https://tokio.rs/blog/2019-10-scheduler) to greatly reduce context-switch costs in message-passing patterns. We avoid putting the same task intonext_task
twice to avoid starvation.Through both unit testing and production deployment in https://github.com/geph-official/geph4, whose QUIC-like
sosistab
protocol is structured in an actor-like fashion that greatly stresses the scheduler, I see significant improvements in real-world throughput (up to 30%, and this is in a server dominated by cryptography CPU usage) and massive improvements in microbenchmarks (up to 10x faster in theyield_now
benchmark and similar context-switch benchmarks). I see no downsides --- the code should gracefully fall back to pushing to the global queeu in case e.g. nestingExecutor
s invalidates the TLS cache.I also added
criterion
benchmarks.