Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run osu-micro-benchmarks collectives with OMPI-v5.0.5 #12717

Closed
goutham-kuncham opened this issue Jul 26, 2024 · 13 comments
Closed

Unable to run osu-micro-benchmarks collectives with OMPI-v5.0.5 #12717

goutham-kuncham opened this issue Jul 26, 2024 · 13 comments

Comments

@goutham-kuncham
Copy link

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OMPI Version: v5.0.5
UCX Version: v1.17.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed from git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

 8ab6d680b90afd6e61766220a8724065a1b554a7 3rd-party/openpmix (v5.0.3)
 b68a0acb32cfc0d3c19249e5514820555bcf438b 3rd-party/prrte (v3.0.6)
 dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: RHEL CentOS 7
  • Computer hardware: arch - x86_64 (CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz) GPU: Tesla V100-PCIE-32GB
  • Network type: InfiniBand

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I am unable to run osu-micro-benchamarks collectives GPU version (Specifically I am interested in osu_reduce, osu_allreduce, osu_allgather)

Below is my configuration and run commands:

OMPI Configure
 ./configure --prefix=$PWD/build --with-ucx=UCX/ucx-v1.17.0/build --with-cuda=/opt/cuda/11.2 --with-cuda-libdir=/opt/cuda/11.2/lib64/stubs/ --enable-mca-no-build=btl-uct
OMB-Configure:
./configure --prefix=$PWD/build CC=ompi/build/bin/mpicc CXX=ompi/build/bin/mpicxx --enable-cuda --with-cuda-include=/opt/cuda/11.2/include --with-cuda-libpath=/opt/cuda/11.2/lib64
Run command:
mpirun -np 4 -hostfile hosts  $OMB_HOME/collective/osu_allreduce -d cuda
output:
kuncham.2@gpu06:~$ mpirun -np 4 -hostfile hosts  $OMB_HOME/collective/osu_allreduce -d cuda
[gpu06:8992 :0:8992] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2ae0d9a00000)
[gpu08:14399:0:14399] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b1e3da00000)
[gpu09:14950:0:14950] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b6339a00000)
[gpu10:23047:0:23047] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2adac3a00000)

# OSU MPI-CUDA Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
==== backtrace (tid:  23047) ====
 0 0x00000000001574f0 __memcpy_ssse3_back()  :0
 1 0x000000000003006b coll_ml_allreduce_small()  ???:0
 2 0x0000000000028b7c _coll_ml_allreduce()  coll_ml_allreduce.c:0
 3 0x0000000000141014 mca_coll_hcoll_allreduce()  ???:0
 4 0x00000000000a87fa MPI_Allreduce()  ???:0
 5 0x0000000000403503 main()  /home/kuncham.2/OMB-DIST/omb-ompi-v5.0.5-ucx-cuda/c/mpi/collective/blocking/osu_allreduce.c:164
 6 0x0000000000022555 __libc_start_main()  ???:0
 7 0x0000000000403ced _start()  ???:0
=================================
==== backtrace (tid:   8992) ====
 0 0x00000000001574f0 __memcpy_ssse3_back()  :0
 1 0x000000000003006b coll_ml_allreduce_small()  ???:0
 2 0x0000000000028b7c _coll_ml_allreduce()  coll_ml_allreduce.c:0
 3 0x0000000000141014 mca_coll_hcoll_allreduce()  ???:0
 4 0x00000000000a87fa MPI_Allreduce()  ???:0
 5 0x0000000000403503 main()  /home/kuncham.2/OMB-DIST/omb-ompi-v5.0.5-ucx-cuda/c/mpi/collective/blocking/osu_allreduce.c:164
 6 0x0000000000022555 __libc_start_main()  ???:0
 7 0x0000000000403ced _start()  ???:0
=================================
==== backtrace (tid:  14399) ====
 0 0x00000000001574f0 __memcpy_ssse3_back()  :0
 1 0x000000000003006b coll_ml_allreduce_small()  ???:0
 2 0x0000000000028b7c _coll_ml_allreduce()  coll_ml_allreduce.c:0
 3 0x0000000000141014 mca_coll_hcoll_allreduce()  ???:0
 4 0x00000000000a87fa MPI_Allreduce()  ???:0
 5 0x0000000000403503 main()  /home/kuncham.2/OMB-DIST/omb-ompi-v5.0.5-ucx-cuda/c/mpi/collective/blocking/osu_allreduce.c:164
 6 0x0000000000022555 __libc_start_main()  ???:0
 7 0x0000000000403ced _start()  ???:0
=================================
==== backtrace (tid:  14950) ====
 0 0x00000000001574f0 __memcpy_ssse3_back()  :0
 1 0x000000000003006b coll_ml_allreduce_small()  ???:0
 2 0x0000000000028b7c _coll_ml_allreduce()  coll_ml_allreduce.c:0
 3 0x0000000000141014 mca_coll_hcoll_allreduce()  ???:0
 4 0x00000000000a87fa MPI_Allreduce()  ???:0
 5 0x0000000000403503 main()  /home/kuncham.2/OMB-DIST/omb-ompi-v5.0.5-ucx-cuda/c/mpi/collective/blocking/osu_allreduce.c:164
 6 0x0000000000022555 __libc_start_main()  ???:0
 7 0x0000000000403ced _start()  ???:0
=================================
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

$ cat hosts
gpu06 slots=1
gpu08 slots=1
gpu09 slots=1
gpu10 slots=1

 
kuncham.2@gpu06:~$ mpirun -np 4 -hostfile hosts hostname
gpu06.cluster
gpu09.cluster
gpu08.cluster
gpu10.cluster
osu_gather is working

kuncham.2@gpu06:~$ mpirun -np 4 -hostfile hosts  $OMB_HOME/collective/osu_gather -d cuda

# OSU MPI-CUDA Gather Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                       8.19
2                       8.08
4                       8.06
8                       8.04
16                      8.11
32                      8.22
64                      8.26
128                     8.30
256                     9.50
512                     8.45
1024                    8.57
2048                    8.83
4096                   10.29
8192                   13.40
16384                  21.52
32768                  36.77
65536                  67.72
131072                136.87
262144                416.93
524288                306.03
1048576               404.60

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world
@yosefe
Copy link
Contributor

yosefe commented Jul 27, 2024

The error is coming from hcoll library, because of failure to detect GPU memory. It seems the Cuda version on the system does not match the Cuda version supported by the HPC-X package (and hcoll library in it) that is used.
Since hcoll is closed source it cannot be rebuilt; however, omni and ucx components from HPC-X have to be rebuilt with the system's cuda version using hpcx-rebuild.sh script.

@goutham-kuncham
Copy link
Author

@yosefe Thanks for the comments.

CUDA Version thats installed in this machine is v11.2

I haven't installed HPC-X. I just cloned ucx-v1.17.0 and OMPI-v5.0.5 from git
git clone https://github.com/open-mpi/ompi.git
git clone https://github.com/openucx/ucx.git

These are my config commands.

UCX Config
../contrib/configure-release --prefix=$PWD/build --with-cuda=/opt/cuda/11.2 --with-gdrcopy=/usr/local

@yosefe
Copy link
Contributor

yosefe commented Jul 27, 2024

@goutham-kuncham perhaps there is hcoll installed on the system, from MLNX_OFED? since the backtrace shows mca_coll_hcoll_allreduce.
Was the Cuda version on the machine. changed after MLNX_OFED installation?
Can you pls try building openmpi without hcoll (--without hcoll) or disable it at runtime (-mca coll ^hcoll)?

@goutham-kuncham
Copy link
Author

@yosefe Its seems to work if I disable hcoll at runtime.

kuncham.2@gpu02:~$ mpirun -np 2 -hostfile hosts --map-by node -mca coll ^hcoll $OMB_HOME/collective/osu_allreduce -d cuda

# OSU MPI-CUDA Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
1                      20.06
2                      20.92
4                      20.93
8                      19.94
16                     20.30
32                     20.33
64                     20.53
128                    20.75
256                    21.21
512                    21.47
1024                   22.44
2048                   23.65
4096                   26.99
8192                   30.52
16384                  40.51
32768                  56.65
65536                  85.29
131072                145.63
262144                260.48
524288                480.49
1048576               910.55

But when I run the benchmark with validation, it fails.

kuncham.2@gpu02:~$ mpirun -np 2 -hostfile hosts --map-by node -mca coll ^hcoll $OMB_HOME/collective/osu_allreduce -c -d cuda

# OSU MPI-CUDA Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)        Validation
1                      20.31              Pass
2                      21.24              Pass
4                      21.38              Pass
8                      20.27              Pass
16                     20.67              Fail
DATA VALIDATION ERROR: /home/kuncham.2/OMB-DIST/omb-ompi-v5.0.5-ucx-cuda/build/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce exited with status 1 on message size 16.

@yosefe
Copy link
Contributor

yosefe commented Jul 28, 2024

@goutham-kuncham does the data validation error happen only with cuda memory?
what is the output of "ofed_info -s"?
can you try without ucx: mpirun -mca pml ob1 -mca btl self,vader,tcp ...

@goutham-kuncham
Copy link
Author

@yosefe

does the data validation error happen only with cuda memory?

I got the same validation failure when I run CPU benchmark as well when hcoll is disabled.

kuncham.2@gpu02:~$ mpirun -np 2 -hostfile hosts --map-by node -mca coll ^hcoll  $OMB_HOME/collective/osu_allreduce -c

# OSU MPI Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)        Validation
1                       1.53              Pass
2                       2.58              Pass
4                       2.57              Pass
8                       1.51              Pass
16                      1.51              Fail
DATA VALIDATION ERROR: /home/kuncham.2/OMB-DIST/omb-ompi-v5.0.5-ucx-cuda/build/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce exited with status 1 on message size 16.
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.

However, when I enable hcoll back, CPU benchmark validation Passes

kuncham.2@gpu02:~$ mpirun -np 2 -hostfile hosts --map-by node $OMB_HOME/collective/osu_allreduce -c

# OSU MPI Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)        Validation
1                       1.38              Pass
2                       1.36              Pass
4                       1.36              Pass
8                       1.37              Pass
16                      1.37              Pass
32                      1.40              Pass
64                      1.52              Pass
128                     1.56              Pass
256                     1.92              Pass
512                     2.06              Pass
1024                    2.30              Pass
2048                    3.07              Pass
4096                    3.93              Pass
8192                    6.85              Pass
16384                   9.20              Pass
32768                  12.94              Pass
65536                  20.11              Pass
131072                 34.17              Pass
262144                 60.40              Pass
524288                103.48              Pass
1048576               188.12              Pass

what is the output of "ofed_info -s"?

MLNX_OFED_LINUX-5.0-2.1.8.0:

can you try without ucx:

I got same behavior with and without hcoll

kuncham.2@gpu02:~$ mpirun -np 2 -hostfile hosts --map-by node -mca coll ^hcoll -mca pml ob1 -mca btl self,vader,tcp $OMB_HOME/collective/osu_allreduce -d cuda

# OSU MPI-CUDA Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: gpu02
  PID:        10184
--------------------------------------------------------------------------
2 more processes have sent help message help-mpi-btl-tcp.txt / server accept cannot find guid
1 more process has sent help message help-mpi-btl-tcp.txt / server accept cannot find guid
^C[gpu03:00000] *** An error occurred in Socket closed
[gpu03:00000] *** reported by process [1460731905,1]
[gpu03:00000] *** on a NULL communicator
[gpu03:00000] *** Unknown error
[gpu03:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gpu03:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
^CAbort is in progress...hit ctrl-c again to forcibly terminate

@yosefe
Copy link
Contributor

yosefe commented Jul 29, 2024

regarding the TCP issue, can you try setting the network device using -mca btl_tcp_if_include <dev>

@goutham-kuncham
Copy link
Author

@yosefe Sorry, I missed that.

I get the same validation failure after setting device. I tried with both en0 and ib0

kuncham.2@gpu02:~$ mpirun -np 2 -hostfile hosts --map-by node -mca coll ^hcoll -mca pml ob1 -mca btl self,vader,tcp -mca btl_tcp_if_include en0 $OMB_HOME/collective/osu_allreduce -c -d cuda

# OSU MPI-CUDA Allreduce Latency Test v7.4
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)        Validation
1                      56.32              Pass
2                     104.52              Pass
4                     104.30              Pass
8                      56.00              Pass
16                     56.30              Fail
DATA VALIDATION ERROR: /home/kuncham.2/OMB-DIST/omb-ompi-v5.0.5-ucx-cuda/build/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce exited with status 1 on message size 16.
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

@yosefe
Copy link
Contributor

yosefe commented Jul 29, 2024

so it seems to be some issue with OpenMPI collective component, does it happen on older OpenMPI versions?

@tmh97
Copy link

tmh97 commented Aug 14, 2024

I've noticed a very similar data validation issue at 16B for osu_collectives!

I am using OpenMPI 5.0.3 with the OPX OFI provider.

This 16B data validation only occurs with OSU 7.4, not OSU 7.3. I've tested on AMD cpus and Intel cpus, both reproduce.

Also, it only occurs for me with the MPI_TYPE of MPI_CHAR, which is the default for the osu collective tests. If I use the -T option to select MPI_TYPE of mpi_float or mpi_init I do not have this issue.

This issue occurs for me with or without the -d cuda option.

I have been debating whether or not to report this to mvapich community.

@tmh97
Copy link

tmh97 commented Aug 14, 2024

@wenduwan woops, my comment removed the State-Awaiting user info label. Sorry about that!

Copy link

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

@github-actions github-actions bot added the Stale label Aug 29, 2024
Copy link

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants