Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement xm.rendezvous with XLA collective communication #4181

Merged
merged 6 commits into from
Nov 16, 2022

Conversation

will-cromar
Copy link
Collaborator

@will-cromar will-cromar commented Nov 9, 2022

We have found that gloo doesn"t scale effectively to large pod sizes, and it"s not easily possible to use torch.distributed in a multithreaded context such as TPU v3.

  • xm.rendezvous will now call xm.mark_step to sync results from XLA.
  • Support multithreaded contexts like TPU v2/v3.
  • Don"t initialize an XLA process group in xm.rendezvous. Also remove initialization based on XRT_MESH_SERVICE_ADDRESS since host 0 is not predictable anyway.
  • Require that user calls xm.rendezvous from all replicas per XLA requirements: Computing the result of AllReduce requires having one input from each replica, so if one replica executes a AllReduce node more times than another, then the former replica will wait forever. This covers the vast majority of the usage of rendezvous in our experience.

Tested manually on a TPU v4-8 with 1 process and 4 threads to simulate a v3.

@will-cromar will-cromar force-pushed the wcromar/xla-rendezvous branch from 78ecc43 to 585982d Compare November 10, 2022 20:53
@will-cromar will-cromar changed the title [WIP] Implement xm.rendezvous with XLA collective communication Implement xm.rendezvous with XLA collective communication Nov 10, 2022
@will-cromar will-cromar requested a review from JackCaoG November 10, 2022 21:00
@will-cromar will-cromar marked this pull request as ready for review November 10, 2022 21:00
@AlexWertheim
Copy link
Contributor

I tested this on v4-8 and v4-4096 and it seems to work successfully on both accelerator types. Using gloo on v4-4096 resulted in connection refused and connection closed by peer errors when calling xm.rendezvous; after discussion with @will-cromar and @JackCaoG, we suspect there are limits with the number of active tcp connections between devices.

@will-cromar will-cromar requested a review from JackCaoG November 15, 2022 17:16
Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything in the PJRT README we should update?

@will-cromar
Copy link
Collaborator Author

I"ll update the readme after #4193

@will-cromar will-cromar merged commit a9eb345 into master Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants