-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor: rewrite the work-stealing thread pool #1657
Conversation
This patch is a ground up rewrite of the existing work-stealing thread pool. The goal is to reduce overhead while simplifying code when possible. At a high level, the following architectural changes were made: - The local run queues were switched for bounded circle buffer queues. - Reduce cross-thread synchronization. - Refactor task constructs to use a single allocation and always include a join handle (#887). - Simplify logic around putting workers to sleep and waking them up. Move away from crossbeam's implementation of the Chase-Lev deque. This implementation included unnecessary overhead as it supported capabilities that are not needed for the work-stealing thread pool. Instead, a fixed size circle buffer is used for the local queue. When the local queue is full, half of the tasks contained in it are moved to the global run queue. This is done via many small improvements. Primarily, an upper bound is placed on the number of concurrent stealers. Limiting the number of stealers results in lower contention. Secondly, the rate at which workers are notified and woken up is throttled. This also reduces contention by preventing many threads from racing to steal work. Now that Tokio is able to target a rust version that supports `std::alloc` as well as `std::task`, the pool is able to optimize how the task structure is laid out. Now, a single allocation per task is required and a join handle is always provided enabling the spawner to retrieve the result of the task (#887). When possible, complexity is reduced in the implementation. This is done by using locks and other simpler constructs in cold paths. The set of sleeping workers is now represented as a `Mutex<VecDeque<usize>>`. Instead of optimizing access to this structure, we reduce the amount the pool must access this structure. Secondly, we have (temporarily) removed `threadpool::blocking`. This capability will come back later, but the original implementation was way more complicated than necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great overall and the associated blog post is a really good explanation. I commented on some non-blocking nits.
Can the |
I tried out the branch with some tokio-postgres benchmarks and hit this panic:
|
@sfackler can you give me repro steps? |
I hit it runing It seems like you can probably run into it by just dropping a runtime with a spawned future still alive though. |
@sfackler thanks... looking at your backtrace, it looks like there is a bug when the reactor notifies tasks (which is internal to the thread pool) and dropping... i don't think I thought of testing that case. ❤️ the QA |
This is probably not super important, but lets make it technically correct.
Co-Authored-By: Doug Wilson <[email protected]>
Co-Authored-By: Eliza Weisman <[email protected]>
Co-Authored-By: Eliza Weisman <[email protected]>
Co-Authored-By: Taiki Endo <[email protected]>
Co-Authored-By: Taiki Endo <[email protected]>
I have reproed @sfackler's reported bug. It is indeed a shutdown bug. That is what I get for not using loom to test the shutdown process :'( Adding a loom shutdown test catches it. |
I’m sorry to barge in like that. I‘d just like to cross post what I wrote on Reddit over here where it might be a better fit. 🙂 Nice work! In particular I‘d like to mention the |
@lhecker thanks for the references. There are definitely still areas to improve on in this PR. I’ll be taking a look. Feel free to leave whatever other feedback you might have. |
When shutting down the scheduler, the first step is to close the global (injection) queue. Once this is done, it is possible for a task to be notified from *outside* of a scheduler thread. This will attempt to push the task onto the injection queue but fail. When this happens, the task must be explicitly be shutdown or bad things will happen.
@sfackler I fixed the bug you reported. By adding a loom test, I found some additional bugs related to shutdown. I applied a temporary fix, but it needs some cleanup. |
pub(super) executor: CausalCell<Option<NonNull<S>>>, | ||
|
||
/// Pointer to next task, used for misc task linked lists. | ||
pub(crate) queue_next: UnsafeCell<*const Header<S>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be changed to an Option<NonNull<Header<S>>>
as well, if only for consistency with the owned
list?
let (pool, mut w, mock_park) = pool!(!2); | ||
(pool, w.remove(0), w.remove(0), mock_park) | ||
}}; | ||
(! $n:expr) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIOLI: I think typically @
is used in macros for "inner" match arms?
|
||
for (_, join) in map.iter_mut() { | ||
let mut cx = Context::from_waker(noop_waker_ref()); | ||
match Pin::new(join).poll(&mut cx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit/TIOLI: Could be simplified to
if let Ready(n) = Pin::new(join).poll(&mut cx) {
num = Some(n.unwrap());
break;
}
Co-Authored-By: Eliza Weisman <[email protected]>
Co-Authored-By: Eliza Weisman <[email protected]>
This patch is a ground up rewrite of the existing work-stealing thread
pool. The goal is to reduce overhead while simplifying code when
possible.
At a high level, the following architectural changes were made:
a join handle (Refine
Executor
trait #887).This article goes into details of the implementation and would be helpful reading when reviewing this PR: https://tokio.rs/blog/2019-10-scheduler/
Local run queues
Move away from crossbeam's implementation of the Chase-Lev deque. This
implementation included unnecessary overhead as it supported
capabilities that are not needed for the work-stealing thread pool.
Instead, a fixed size circle buffer is used for the local queue. When
the local queue is full, half of the tasks contained in it are moved to
the global run queue.
Reduce cross-thread synchronization
This is done via many small improvements. Primarily, an upper bound is
placed on the number of concurrent stealers. Limiting the number of
stealers results in lower contention. Secondly, the rate at which
workers are notified and woken up is throttled. This also reduces
contention by preventing many threads from racing to steal work.
Refactor task structure
Now that Tokio is able to target a rust version that supports
std::alloc
as well asstd::task
, the pool is able to optimize howthe task structure is laid out. Now, a single allocation per task is
required and a join handle is always provided enabling the spawner to
retrieve the result of the task (#887).
Simplifying logic
When possible, complexity is reduced in the implementation. This is done
by using locks and other simpler constructs in cold paths. The set of
sleeping workers is now represented as a
Mutex<VecDeque<usize>>
.Instead of optimizing access to this structure, we reduce the amount the
pool must access this structure.
Secondly, we have (temporarily) removed
threadpool::blocking
. Thiscapability will come back later, but the original implementation was way
more complicated than necessary.
Results
The thread pool benchmarks have improved significantly:
Old thread pool
New thread pool
Real-world benchmarks improve significantly as well. This is testing the hyper hello world server using
wrk -t1 -c50 -d10
:Old scheduler
New scheduler