|
Thanks. I can't spot anything regarding your Software Setup. As the performance difference between CUDA-aware MPI and regular MPI on a single node is about 2x and CUDA-aware MPI is faster for 2 processes on two nodes I am suspecting there is an issue with the GPU affinity handling. I.e. ENV_LOCAL_RANK defined the wrong way (but you seem to have that right) or that on the system you are using CUDA_VISIBLE_DEVICES is set in a funky way. As this code has not been updated for quite some time can you try with https://github.com/NVIDIA/multi-gpu-programming-models (also a Jacobi solver but with a simpler code that I regularly use in tutorials).
I also checked the math for the bandwidth: The formula used does not consider caches, see https://github.com/NVIDIA-develo ... ple/src/Host.c#L291 which explains why you are seeing too large memory bandwidths.
—
Reply to this email directly, view it on GitHub, or unsubscribe. |
|