Implement `xm.rendezvous` with XLA collective communication #4181

will-cromar · 2022-11-09T22:24:19Z

We have found that gloo doesn"t scale effectively to large pod sizes, and it"s not easily possible to use torch.distributed in a multithreaded context such as TPU v3.

xm.rendezvous will now call xm.mark_step to sync results from XLA.
Support multithreaded contexts like TPU v2/v3.
Don"t initialize an XLA process group in xm.rendezvous. Also remove initialization based on XRT_MESH_SERVICE_ADDRESS since host 0 is not predictable anyway.
Require that user calls xm.rendezvous from all replicas per XLA requirements: Computing the result of AllReduce requires having one input from each replica, so if one replica executes a AllReduce node more times than another, then the former replica will wait forever. This covers the vast majority of the usage of rendezvous in our experience.

Tested manually on a TPU v4-8 with 1 process and 4 threads to simulate a v3.

torch_xla/experimental/pjrt.py

AlexWertheim · 2022-11-11T18:08:10Z

I tested this on v4-8 and v4-4096 and it seems to work successfully on both accelerator types. Using gloo on v4-4096 resulted in connection refused and connection closed by peer errors when calling xm.rendezvous; after discussion with @will-cromar and @JackCaoG, we suspect there are limits with the number of active tcp connections between devices.

test/pjrt/test_mesh_service.py

JackCaoG

Anything in the PJRT README we should update?

will-cromar · 2022-11-16T19:42:03Z

I"ll update the readme after #4193

ronghanghu reviewed Nov 10, 2022

View reviewed changes

torch_xla/experimental/pjrt.py Show resolved Hide resolved

will-cromar added 5 commits November 10, 2022 20:53

Implement pjrt.rendezvous with XLA collective ops.

3c1555c

Update tests.

9fe6013

Formatting

bd2ca0c

Handle some edge cases better

f66a6e8

formatting

585982d

will-cromar force-pushed the wcromar/xla-rendezvous branch from 78ecc43 to 585982d Compare November 10, 2022 20:53

will-cromar changed the title ~~[WIP] Implement xm.rendezvous with XLA collective communication~~ Implement xm.rendezvous with XLA collective communication Nov 10, 2022

will-cromar requested a review from JackCaoG November 10, 2022 21:00

will-cromar marked this pull request as ready for review November 10, 2022 21:00

will-cromar mentioned this pull request Nov 11, 2022

Experimental TPU implementation of DistributedDataParallel #4193

Merged

JackCaoG reviewed Nov 11, 2022

View reviewed changes

test/pjrt/test_mesh_service.py Show resolved Hide resolved

Add mesh service test to CI

9120768

will-cromar requested a review from JackCaoG November 15, 2022 17:16

JackCaoG approved these changes Nov 15, 2022

View reviewed changes

will-cromar added the runtime label Nov 16, 2022

will-cromar merged commit a9eb345 into master Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `xm.rendezvous` with XLA collective communication #4181

Implement `xm.rendezvous` with XLA collective communication #4181

will-cromar commented Nov 9, 2022 •

edited

Loading

AlexWertheim commented Nov 11, 2022

JackCaoG left a comment

will-cromar commented Nov 16, 2022

Implement xm.rendezvous with XLA collective communication #4181

Implement xm.rendezvous with XLA collective communication #4181

Conversation

will-cromar commented Nov 9, 2022 • edited Loading

AlexWertheim commented Nov 11, 2022

JackCaoG left a comment

Choose a reason for hiding this comment

will-cromar commented Nov 16, 2022

Implement `xm.rendezvous` with XLA collective communication #4181

Implement `xm.rendezvous` with XLA collective communication #4181

will-cromar commented Nov 9, 2022 •

edited

Loading