Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow NUTS to do eager evaluation on forward and backward trajectory in parallel #3103

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

SteveBronder
Copy link
Collaborator

Submission Checklist

  • Run unit tests: ./runTests.py src/test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary

This is a WIP to allow NUTS to perform calculation of the forward and backwards trajectory in parallel.

Intended Effect

How to Verify

Side Effects

Documentation

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Sebastian Weber and Steve Bronder

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Copy link
Collaborator Author

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wds15 had a couple Qs on the impl, it"s extremely cool! I"ve always wanted to dive into the tbb graph stuff more and how you did things here helped that stuff make a lot more since. Big favor, but could you write out the actual graph? Like with a pen and paper, google slide, latex, whatever is most convenient for you. I"m just worried I"m missing some key component of the graph structure and want to sanity check I"m reading this correctly

src/stan/mcmc/hmc/nuts/base_parallel_nuts.hpp Show resolved Hide resolved
Comment on lines +342 to +343
n_leapfrog += std::get<5>(subtree_result);
sum_metro_prob += std::get<6>(subtree_result);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the forward and backward update at the same time is there a chance for a race condition to happen here? This comment applies to most of the stuff here made outside of scope and assigned to. Maybe we should just have vector for these things and sum them up at the end?

Comment on lines +420 to +431
if (fwd_direction[0]) {
fwd_builder[0].try_put(tbb::flow::continue_msg());
// the first turn is fwd, so kick off the bck walker if needed
if (num_bck != 0) {
bck_builder[0].try_put(tbb::flow::continue_msg());
}
} else {
bck_builder[0].try_put(tbb::flow::continue_msg());
if (num_fwd != 0) {
fwd_builder[0].try_put(tbb::flow::continue_msg());
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Could we just have one vector here to store all the builders? It looks like we have two because it lets us just do

tbb::flow::make_edge(*(fwd_iter - 1), *fwd_iter);

But could we just have one vector and keep track of which the index for the last forward or backward iteration. So the line that uses the edge above would turn into something like

// Before loop
tbb::concurrent_vector<tree_builder_t> builder;
builder.reserve(this->max_depth_);
// ...
        if (fwd_idx != 0) {
          // in this case this is not the starting node, we
          // connect this with its predecessor
          tbb::flow::make_edge(*(builder[fwd_idx]), *fwd_iter);
          // I _think_ this is safe as long as we reserve max_depth_ size
          //  of the vector before running 
          fwd_idx = std::distance(*(builder[0]), *fwd_iter);
        }

checks.back());
}
if (depth != 0) {
tbb::flow::make_edge(checks[depth - 1], checks.back());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If each check is dependent on one another and each check is also dependent on the forward or backward iteration, are we essentially doing all these checks serially? If so could we just make it a function_node with a concurrency of 1?

double accept_prob
= sum_metro_prob / static_cast<double>(this->n_leapfrog_);

this->z_.ps_point::operator=(z_sample);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Would be nice to have a move operator for ps point so this doesn"t make a copy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also doesn"t need an explicit static cast. C++ automatically promotes the lower type to the higher type in binary arithmetic.

And shouldn"t we be able to write line 447 as this->z.ps_point = z_sample; or even just z.ps_point = z_sample;? I thought that was the whole point of implementing operator=().

@wds15
Copy link
Contributor

wds15 commented Feb 17, 2022

I think I did pencil the graph structure down somewhere…let me search for it.

@bob-carpenter
Copy link
Contributor

I only saw this in the email preview (in the same file as @SteveBronder is commenting on), I"d have coded this

fwd_idx == 0 ? true : valid_subtree_fwd[fwd_idx - 1];

as

fwd_idx == 0 || valid_subtree_fwd[fwd_idx - 1];

It"s the same behavior because of the short-circuiting of ||.

@wds15
Copy link
Contributor

wds15 commented Feb 18, 2022

Here is what my iPad has in my notes app on this:

Screenshot 2022-02-18 at 14 33 36

(ignore for now the red arrows..I explain them below)

So you see the left and the right legs of the trajectory which have yellow path"s. Each blob is a point on the trajectory where something happens. Specifically at these points the state is being send along the arrows to the middle process in black, which is the "abort checker process". The blue and green blobs fire off that we explore 2^n (n being the current depth) steps deeper the trajectory in that direction. The checker process in the middle recieves the states which must be compared to evaluate the U-turn criterion.

The idea is that you lay out the full tree to it"s maximum depth first (putting lambda"s into the leaves doing the work when called) and then you let the TBB run the tree for you in full automation. The execution is canceled once the U-turn criterion is met. At that stage execution of the tree is stopped and the identified state is returned.

If I recall correctly, the red connections are only wired up if you want a serial execution of things. So I intentionally wire-up the graph differently in the case of running this sequentially. This is to help the TBB figure out easier what to do when running the graph. I recall that this was good for single-core performance.

Ok... now that"s what is behind this. Does this help already or should I try to annotate the source with reference to the figure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants