Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNM: Fix various Threading() bugs #761

Merged
merged 11 commits into from
Jul 26, 2024
Merged

DNM: Fix various Threading() bugs #761

merged 11 commits into from
Jul 26, 2024

Conversation

odow
Copy link
Owner

@odow odow commented Jul 24, 2024

No description provided.

@odow odow force-pushed the od/threaded-default branch 2 times, most recently from f1661e6 to 420e1e6 Compare July 25, 2024 02:42
Comment on lines 365 to 384
Threads.@spawn begin
try
while keep_iterating
result = iteration(model, options)
lock(options.lock) do
options.post_iteration_callback(result)
log_iteration(options)
return
end
if result.has_converged
lock(convergence_lock) do
keep_iterating = false
status = result.status
return
end
end
end
finally
lock(convergence_lock) do
keep_iterating = false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the intent here correctly, you intend to iterate until one of the threads has either converged or errored? When the first thread to finish finishes it will set result.has_converged to True and the others will check this before their next iteration.

I believe there are edge cases where this would not happen, I have written the below as if it is absolute truth to keep it brief, but add the caveat here - I dont fully understand the julia memory model.

The TLDR is that you are reading keep_iterating from many threads and setting it from another without protecting this access with a shared lock.

In general with multi threaded code, there is a notion of happens before relationships which defines a relationship about what changes made by another thread are visible. These happens before relationships are provided by synronisation primitives like locks.

Most code running in different threads is not ordered, if there are two threads A & B, if thread A changes a global variable x from value true to false, then without synchronisation, regardless of the passage of wall clock time, it is not defined whether thread B will read x as true or false. One of the reasons for this is that A has written this change back to main memory, but B may be reading the value from a cache (say L1 cache).

If thread A aquires a lock, and then changes x to false, and then releases the lock, B still may read either value, unless it has aquired the SAME lock as A, in which case you can say the read of x from thread B happened after the write from thread A, and so x will be the value set by A.

Given all of this, and that the setting of keep_iterating and status take next to no time, I think it is best to remove the convergence lock and simply use options.lock.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm aware of this. I had it as-is because it doesn't really matter if we read the "wrong" value of keep_iterating.

src/algorithm.jl Outdated
if VERSION >= v"1.7"
first_thread = first(Threads.threadpooltids(Threads.threadpool()))
end
if Threads.threadid() == first_thread

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant here https://julialang.org/blog/2023/07/PSA-dont-use-threadid/

I believe that Threads.threadid() could change throughout the course of a threads execution and so multiple threads or no threads may have the id first_thread at this point.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly care about this one, since it is only for profiling with TimerOutputs. Perhaps I should just turn this off if nthreads() > 1.

src/algorithm.jl Outdated
@@ -302,11 307,13 @@ function attempt_numerical_recovery(model::PolicyGraph, node::Node)
end
if !_has_primal_solution(node)
model.ext[:numerical_issue] = true
@info "Writing cuts to the file `model.cuts.json`"
write_cuts_to_file(model, "model.cuts.json")
suffix = Threads.threadid() == 1 ? "" : "_pid=$(Threads.threadid())"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I dont think this is safe, you could end up with multiple threads writing to the same file in either an overlapping or overwritting fashion. Maybe just a uuid instead of threadid would be safer?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, tasks can be moved between threads, but if two tasks are on the same thread, they can't be running concurrently? But I guess the risk is that the task might be swapped between this line and the next?

Regardless, a UUID is a better idea.

src/algorithm.jl Outdated
has_converged,
status,
cuts,
model.ext[:numerical_issue],
Copy link

@dannyopts dannyopts Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could be writing to model.ext[:numerical_issue] from other threads at this point (for example at the top of iteration you write to this field without the options.log), also I think you could have overwritten the state from another thread between it being set to true within forward or backward pass and the time that you return it here.

I think something similar is true for model.ext[:total_solves] which you write to in solve_subproblem without holding the same shared lock across all threads.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix total_solves easily.

The :numerical_issue is a bit more complicated, but to be honest, it doesn't really matter. It's just for reporting a very subtle thing in the log.

@@ -302,11 307,13 @@ function attempt_numerical_recovery(model::PolicyGraph, node::Node)
end
if !_has_primal_solution(node)
model.ext[:numerical_issue] = true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a lock here before writing to the model?

src/algorithm.jl Outdated
@@ -1399,7 1433,7 @@ function simulate(
custom_recorders = Dict{Symbol,Function}(),
duality_handler::Union{Nothing,AbstractDualityHandler} = nothing,
skip_undefined_variables::Bool = false,
parallel_scheme::AbstractParallelScheme = Serial(),
parallel_scheme::AbstractParallelScheme = Threaded(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs still say "Defaults to [Serial]"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "DNM" in the title means "Do Not Merge" 😄 I don't plan to make this the default anytime soon, so I'll be reverting this line before merging.

@dannyopts
Copy link

Hey, I hope you dont mind me putting some comments on. Hopefully they are useful, if not then applogies :)

Copy link

codecov bot commented Jul 26, 2024

Codecov Report

Attention: Patch coverage is 97.75281% with 4 lines in your changes missing coverage. Please review.

Project coverage is 93.40%. Comparing base (f5d24b7) to head (dff315d).

Files Patch % Lines
src/algorithm.jl 96.33% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #761       /-   ##
==========================================
- Coverage   93.40%   93.40%   -0.01%     
==========================================
  Files          27       27              
  Lines        3458     3486       28     
==========================================
  Hits         3230     3256       26     
- Misses        228      230        2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@odow odow changed the title DNM: test default threading DNM: Fix various Threading() bugs Jul 26, 2024
@odow
Copy link
Owner Author

odow commented Jul 26, 2024

Merging for now as an improvement. I need to fix some other things, but I'll do them in separate PRs.

@odow odow merged commit 73dd799 into master Jul 26, 2024
6 of 7 checks passed
@odow odow deleted the od/threaded-default branch July 26, 2024 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants