Skip to content

KolinGuo/xla

 
 

Repository files navigation

How To Build And Run PyTorch For TPU

We also provide pre-build docker images and wheels so if you'd like to directly consume those refer to the Using Pre Built Releases section.

To build from source:

  • Clone the PyTorch repo as per instructions.

    git clone --recursive https://github.com/pytorch/pytorch
    cd pytorch/
  • Clone the PyTorch/XLA repo:

    git clone --recursive https://github.com/pytorch/xla.git

Building docker image

  • We provide a Dockerfile in docker/ that you can use to build images as the following:

    docker build -t torch-xla -f docker/Dockerfile .

Building with script

  • To build and install torch and torch_xla:

    xla/scripts/build_torch_wheels.sh

Building manually

  • If a file named xla/.torch_commit_id exists, use its content to checkout the PyTorch commit ID:

    git checkout $(cat xla/.torch_commit_id)
  • Apply PyTorch patches:

    xla/scripts/apply_patches.sh
  • Install the Lark parser used for automatic code generation:

    pip install lark-parser
  • Currently PyTorch does not build with GCC 6.x, 7.x, and 8.x (various kind of ICEs). CLANG 7.x is known to be working, so install that in your VM:

    sudo apt-get install clang-7 clang  -7
    export CC=clang-7 CXX=clang  -7

    You may need to add the following line to your /etc/apt/sources.list file:

    deb http://deb.debian.org/debian/ testing main

    And run the following command before trying again to install CLANG:

    sudo apt-get update
  • Build PyTorch from source following the regular instructions.

    python setup.py install
  • Install Bazel following the instructions. You should be installing version >= 0.24.1.

  • Build the PyTorch/XLA source:

    cd xla/
    python setup.py install

To run the tests, follow one of the options below:

  • Run on local CPU using the XRT client:

    export XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0"
    export XRT_WORKERS="localservice:0;grpc://localhost:40934"

    Select any free TCP port you prefer instead of 40934 (totally arbitrary).

  • Run on Cloud TPU using the XRT client, use one of the following:

    • Set the XRT_TPU_CONFIG environment variable:

      export XRT_TPU_CONFIG="tpu_worker;0;<IP of the TPU node>:8470"
    • Create a $HOME/.pytorch_tpu.conf file with the following content: worker: tpu_worker <IP of the TPU node>:8470

Note that the IP of the TPU node can change if the TPU node is reset. If PyTorch seem to hang at startup, verify that the IP of your TPU node is still the same of the one you have configured.

If you are planning to be building from source and hence using the latest PyTorch/TPU code base, it is suggested for you to select the Nightly builds when you create a Cloud TPU instance.

Then run test/run_tests.sh and test/cpp/run_tests.sh to verify the setup is working.

CircleCI

Using Pre Built Releases

Pre Built Docker Images (recommended)

Docker images with torch and torch_xla preinstalled in the pytorch conda environment are distributed under: gcr.io/tpu-pytorch/xla. This image has two type of tags which take the forms of:

  • gcr.io/tpu-pytorch/xla:nightly
  • gcr.io/tpu-pytorch/xla:YYYYMMDD (ex. gcr.io/tpu-pytorch/xla:nightly_20190531)

With these images, for example, you can train mnist on TPUs by following these steps. First pull the distributed docker image:

docker pull gcr.io/tpu-pytorch/xla:nightly

After pulling the image you can either:

  • Run the container with a command:
docker run --shm-size 16G -e XRT_TPU_CONFIG="tpu_worker;0;<IP of the TPU node>:8470" gcr.io/tpu-pytorch/xla:nightly python pytorch/xla/test/test_train_mnist.py
  • Run the script in an interactive shell:
docker run -it --shm-size 16G gcr.io/tpu-pytorch/xla:nightly
(pytorch) root@CONTAINERID:/# export XRT_TPU_CONFIG="tpu_worker;0;<IP of the TPU node>:8470"
(pytorch) root@CONTAINERID:/# python pytorch/xla/test/test_train_mnist.py

Pre Built PyTorch TPU Wheels

It is recommended to use Conda environments to isolate PyTorch/TPU packages from the others. To install Anaconda follow the instructions. Then create an environment dedicated to PyTorch/TPU and activate it (activation should happen every time you want to work in such environment):

conda create --name pytorch_tpu --clone base
source activate pytorch_tpu

Install the gsutil package to allow access to GCS (Google Cloud Storage) following the instructions.

Then run:

scripts/update_torch_wheels.sh

The same script can be run again when you want to update the PyTorch/TPU wheels.

Debugging

Sometimes bad things happen and a deeper look into the PyTorch/TPU stack is necessary. In order to do that, PyTorch/TPU has a series of environment variables and function calls which can help understading its internal behavior.

Note that the infromation in this section is subject to be removed in future releases of the PyTorch/TPU software, since many of them are peculiar to a given internal implementation which might change.

The PyTorch/TPU stack keeps a series of metrics and counters during its execution, and the following API returns a string representation of them:

torch_xla._XLAC._xla_metrics_report()

Printing out that information can help during the debug phases and while reporting issues.

The information included within the metrics report include things like how many time we issue XLA compilations, how long they take, how many times we execute, for how long, how many device data handles we create/destroy, etc... These information is reported in terms of percentiles of the samples. An example is:

Metric: CompileTime
  TotalSamples: 202
  Counter: 06m09s401ms746.001us
  ValueRate: 778ms572.062us / second
  Rate: 0.425201 / second
  Percentiles: 1%=001ms32.778us; 5%=001ms61.283us; 10%=001ms79.236us; 20%=001ms110.973us; 50%=001ms228.773us; 80%=001ms339.183us; 90%=001ms434.305us; 95%=002ms921.063us; 99%=21s102ms853.173us

The PyTorch/TPU stack also has counters, which are named integer variables tracks internal software status. Example:

Counter: CachedSyncTensors
  Value: 395

Counters are also useful to understand which operations the PyTorch/TPU stack is routing back to the CPU engine of PyTorch. Things which looks like a C namespace are part of this category:

Counter: aten::nonzero
  Value: 33

There are also a number of environment variables which control the behavior of the PyTorch/TPU software stack. Setting such variables will cause different degrees of performance degradation, so they should only be enabled for debugging.

  • XLA_IR_DEBUG: Enables the Python stack trace to be catpured where creating IR nodes, hence allowing to understand which PyTorch operation was responsible of generating such IR.

  • XLA_HLO_DEBUG: Enables the Python stack frame captured when XLA_IR_DEBUG is active, to be propagated to the XLA HLO metadata.

  • XLA_SAVE_TENSORS_FILE: The path to a file which will be used to dump the IR graphs during execution. Note that the file can become really big if the option is left enabled and the PyTorch program let run for long time. The graphs are appended to the file, so to have a clean sheet from run to run, the file should be explicitly removed.

  • XLA_SAVE_TENSORS_FMT: The format of the graphs stored within the XLA_SAVE_TENSORS_FILE file. Can be text (the default), dot (the Graphviz format) or hlo.

  • GET_TENSORS_OPBYOP: Enables pure OpByOp dispatch. The PyTorch/TPU software tries to fuse together many PyTorch operations into a single computation graph, but sometimes, either for debugging, or in case the PyTorch code have a very dynamic nature (in shapes or graph terms), it is better to force the execution in OpByOp mode (every IR node is lowered into a separate XLA computation, and chain-executed). This environment variable, if set to 1, enables OpByOp during the "get tensors" operation (the operation used by PyTorch/TPU to fetch intermediate values back from the TPU device into PyTorch CPU tensors).

  • SYNC_TENSORS_OPBYOP: The same as GET_TENSORS_OPBYOP but for "sync tensors" operation (the operation used at the end of a step, to flush pending IR computations and materialize them into TPU device data).

  • XLA_USE_BF16: If set to 1, tranforms all the PyTorch Float values into BiFloat16 when sending to the TPU device.

  • XLA_USE_32BIT_LONG: If set to 1, maps PyTorch Long types to XLA 32bit type. On the versions of the TPU HW at the time of writing, 64bit integer computations are expensive, so setting this flag might help. It should be verified by the user that truncating to 32bit values is a valid operation according to the use of PyTorch Long values in it.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 88.4%
  • Python 9.6%
  • Shell 1.1%
  • Other 0.9%