[SPMD] Preserve parameter sharding with output data sharding #4721

yeounoh · 2023-03-03T23:52:27Z

This addresses the same problem as in #4696 with an alternative solution. We shard the replicated output while handling the computation results. This avoids post traversal pass to replace original data node with a sharded one, thus more efficient. Key changes include:

Introduce ShardingUtil::OutputHandler
Add XLAShardingTest.OutputHandler test for unit testing, test_optimizer_step_with_sharding checks the validity of the change with a simple e2e example already.
Add std::optional<xla::Shape> to ShardingSpec
Add std::optional<xla::OpSharding> to PjRtShardedData
Pass an additional std::vector<XLATensor::ShardingSpecPtr> param to XLAGraphExecutor::ScheduleSyncTensorsGraph, since the async function now calls ShardingUtil::OutputHandler
Introduce & call XLAGraphExecutor::CollectShardingSpecs before calling ScheduleSyncTensorsGraph
Introduce WrapDataShards and GetDataSharding APIs in ComputationClient.

third_party/xla_client/pjrt_computation_client.cc

third_party/xla_client/pjrt_computation_client.h

torch_xla/csrc/init_python_bindings.cpp

torch_xla/csrc/tensor.h

torch_xla/csrc/xla_graph_executor.cpp

torch_xla/csrc/xla_sharding_util.cpp

test/cpp/test_xla_sharding.cpp

third_party/xla_client/pjrt_computation_client.cc

third_party/xla_client/xrt_computation_client.h

torch_xla/csrc/xla_graph_executor.cpp

torch_xla/csrc/xla_sharding_util.h

JackCaoG · 2023-03-07T18:51:50Z

OutputHandler seems to crash

yeounoh · 2023-03-07T21:05:59Z

OutputHandler seems to crash

Yea, we need at least 2 devices to create Hlo sharding.

2023-03-07 21:01:25.297118: F external/org_tensorflow/tensorflow/compiler/xla/hlo/ir/hlo_sharding.cc:54] Check failed: num_tiles > 1 (1 vs. 1)

Added the safeguard.

torch_xla/csrc/xla_graph_executor.cpp

JackCaoG

Thanks!

jonb377

LGTM, thanks!

jonb377 · 2023-03-07T23:43:34Z

third_party/xla_client/pjrt_computation_client.h

@@ -179,7  184,11 @@ class PjRtComputationClient : public ComputationClient {
    }

    void Assign(const Data& data) override {
-      XLA_ERROR() << __FUNCTION__ << " not supported.";


Nice! We can retry the simple MpDeviceLoader hack for SPMD once this lands, this was the blocker.

[SPMD] Persist tensor sharding with XLA sharding propagation

…#4721) [SPMD] Persist tensor sharding with XLA sharding propagation

yeounoh added the SPMD / Distributed label Mar 3, 2023

yeounoh requested review from alanwaketan and JackCaoG March 3, 2023 23:52

yeounoh self-assigned this Mar 3, 2023

yeounoh marked this pull request as draft March 3, 2023 23:52

yeounoh force-pushed the new_param_sharding_fix branch from 54c51f1 to 420d701 Compare March 3, 2023 23:55

JackCaoG reviewed Mar 4, 2023

View reviewed changes

third_party/xla_client/pjrt_computation_client.cc Show resolved Hide resolved

JackCaoG reviewed Mar 4, 2023

View reviewed changes

third_party/xla_client/pjrt_computation_client.cc Show resolved Hide resolved

JackCaoG reviewed Mar 4, 2023

View reviewed changes

third_party/xla_client/pjrt_computation_client.h Outdated Show resolved Hide resolved

JackCaoG reviewed Mar 4, 2023

View reviewed changes

torch_xla/csrc/init_python_bindings.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Mar 4, 2023

View reviewed changes

torch_xla/csrc/tensor.h Show resolved Hide resolved

yeounoh force-pushed the new_param_sharding_fix branch 15 times, most recently from 3eac5e6 to f26b305 Compare March 4, 2023 04:02

yeounoh marked this pull request as ready for review March 4, 2023 04:20

yeounoh force-pushed the new_param_sharding_fix branch 2 times, most recently from 5c3e631 to 0ddee73 Compare March 6, 2023 22:59

JackCaoG reviewed Mar 7, 2023

View reviewed changes

torch_xla/csrc/xla_graph_executor.cpp Outdated Show resolved Hide resolved

torch_xla/csrc/xla_sharding_util.cpp Outdated Show resolved Hide resolved

test/cpp/test_xla_sharding.cpp Outdated Show resolved Hide resolved

third_party/xla_client/pjrt_computation_client.cc Show resolved Hide resolved

JackCaoG reviewed Mar 7, 2023

View reviewed changes

third_party/xla_client/xrt_computation_client.h Outdated Show resolved Hide resolved

JackCaoG reviewed Mar 7, 2023

View reviewed changes

torch_xla/csrc/xla_graph_executor.cpp Outdated Show resolved Hide resolved

JackCaoG reviewed Mar 7, 2023

View reviewed changes

torch_xla/csrc/xla_sharding_util.h Show resolved Hide resolved

yeounoh force-pushed the new_param_sharding_fix branch from a90760e to e45ab94 Compare March 7, 2023 06:25

yeounoh mentioned this pull request Mar 7, 2023

Preserve parameter sharding & propagate to data nodes #4696

Closed

yeounoh force-pushed the new_param_sharding_fix branch from e45ab94 to 5ba829f Compare March 7, 2023 21:04

yeounoh added 3 commits March 7, 2023 13:04

[SPMD] Persist tensor sharding with XLA sharding propagation.

0004c24

Add tests

e1a7b8a

Add more comments

4c1a7e6

yeounoh force-pushed the new_param_sharding_fix branch from 5ba829f to 26279e3 Compare March 7, 2023 21:04

yeounoh requested review from steventk-g and jonb377 March 7, 2023 22:03

yeounoh changed the title ~~Preserve parameter sharding with output data sharding~~ [SPMD] Preserve parameter sharding with output data sharding Mar 7, 2023

JackCaoG reviewed Mar 7, 2023

View reviewed changes

torch_xla/csrc/xla_graph_executor.cpp Outdated Show resolved Hide resolved

Refactoring

8d83ef4

yeounoh force-pushed the new_param_sharding_fix branch from 26279e3 to 8d83ef4 Compare March 7, 2023 22:55

JackCaoG approved these changes Mar 7, 2023

View reviewed changes

jonb377 approved these changes Mar 8, 2023

View reviewed changes

yeounoh merged commit e2abcaf into master Mar 8, 2023

mateuszlewko pushed a commit that referenced this pull request Mar 15, 2023

[SPMD] Preserve parameter sharding with output data sharding (#4721)

609d06c

[SPMD] Persist tensor sharding with XLA sharding propagation

yeounoh mentioned this pull request Mar 22, 2023

[SPMD] asynchronous data loading with ParallelLoader #4808

Merged

ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Mar 29, 2023

[SPMD] Preserve parameter sharding with output data sharding (pytorch…

73567b2

…#4721) [SPMD] Persist tensor sharding with XLA sharding propagation

ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Mar 29, 2023

[SPMD] Preserve parameter sharding with output data sharding (pytorch…

ebb3c1b

…#4721) [SPMD] Persist tensor sharding with XLA sharding propagation

yeounoh mentioned this pull request May 16, 2023

[SPMD] Output param sharding #5018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPMD] Preserve parameter sharding with output data sharding #4721

[SPMD] Preserve parameter sharding with output data sharding #4721

yeounoh commented Mar 3, 2023

JackCaoG commented Mar 7, 2023

yeounoh commented Mar 7, 2023

JackCaoG left a comment

jonb377 left a comment

jonb377 Mar 7, 2023

[SPMD] Preserve parameter sharding with output data sharding #4721

[SPMD] Preserve parameter sharding with output data sharding #4721

Conversation

yeounoh commented Mar 3, 2023

JackCaoG commented Mar 7, 2023

yeounoh commented Mar 7, 2023

JackCaoG left a comment

Choose a reason for hiding this comment

jonb377 left a comment

Choose a reason for hiding this comment

jonb377 Mar 7, 2023

Choose a reason for hiding this comment