-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement xm.rendezvous
with XLA collective communication
#4181
Conversation
78ecc43
to
585982d
Compare
xm.rendezvous
with XLA collective communicationxm.rendezvous
with XLA collective communication
I tested this on v4-8 and v4-4096 and it seems to work successfully on both accelerator types. Using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything in the PJRT README we should update?
I"ll update the readme after #4193 |
We have found that
gloo
doesn"t scale effectively to large pod sizes, and it"s not easily possible to usetorch.distributed
in a multithreaded context such as TPU v3.xm.rendezvous
will now callxm.mark_step
to sync results from XLA.xm.rendezvous
. Also remove initialization based onXRT_MESH_SERVICE_ADDRESS
since host 0 is not predictable anyway.xm.rendezvous
from all replicas per XLA requirements:Computing the result of AllReduce requires having one input from each replica, so if one replica executes a AllReduce node more times than another, then the former replica will wait forever.
This covers the vast majority of the usage ofrendezvous
in our experience.Tested manually on a TPU v4-8 with 1 process and 4 threads to simulate a v3.