<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/90343>90343</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Initialising LLM on multiple GPUs stuck at "Started a local Ray instance"
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          timbmg
      </td>
    </tr>
</table>

<pre>
    I am trying to run an LLM on multiple GPUs (2x H100), however, it seems stuck at after starting the local Ray instance. I have also tried setting `NCCL_P2P_DISABLE=1` as suggest in other issues, however, this did not solve my problem.

```python3
import os
from vllm import LLM, SamplingParams
llm = LLM(
 model="CohereForAI/c4ai-command-r-v01", 
    max_model_len=2**15, # limit context to 32k
    gpu_memory_utilization=0.95,
 dtype="float16",
    tensor_parallel_size=2

# 2024-04-27 15:26:35,737       INFO worker.py:1749 -- Started a local Ray instance.
```

`nvidia-smi` gives:
```
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
| |                      |               MIG M. |
|=========================================+======================+======================|
| 0  NVIDIA H100 PCIe               On  | 00000000:61:00.0 Off | 0 |
| N/A   31C    P0              47W / 350W |      2MiB / 81559MiB |      0%      Default |
|                                         | |             Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1  NVIDIA H100 PCIe               On  | 00000000:E1:00.0 Off | 0 |
| N/A   30C    P0              46W / 350W |      2MiB / 81559MiB |      0%      Default |
|                                         | |             Disabled |
+-----------------------------------------+----------------------+----------------------+
 
+---------------------------------------------------------------------------------------+
| Processes: |
|  GPU   GI   CI        PID   Type   Process name GPU Memory |
|        ID   ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
```

```
torch==2.2.1
vllm==0.4.1
vllm-nccl-cu12==2.18.1.0.4.0
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsF12T2jbw14iXHXvkFQb8wIPB4eqZ48L0cskjI2wBamTLI8lcyK_vyIY7oNdOkk7bPHTHd9j7rd3VasWtlbtaiCmJZyTOBrx1e22mTlabajfY6PI4zYFX4MxR1jtwGkxbA6_h_n4JuoaqVU42SsDd6skCwQl-gV8iSgkmBOew18_iIIx_lQ6sEJUF69riM3AHfOuEAeu4cZ3uvQClC67gV34EWVvH60KEkMOeHwRwZTU4I0UJVrhOgozow3x-v17hap3lj-ns_h1hWURGFLgF2-52wjqQNWi3Fwakta2wN265vbRQyhJq7cBqdRBQHaExeqNEFRKaEZqe_o9o_zRHt9c167GyarRxoG3_uTW6goNSFZwI9_dLb-aRV42S9W7FDa9OvJ6LsKxnmfQ4qHQpFGEZQZzrvTBioU2aE1wUQy6DQlcVr8vABAcaEUSv-iQIABX_su7k10rUhGVIMCWYRnHHhgyUrKSDQtdOfHE-mQw_v4rvmnZdiUqb47p1Usmv3Ent9dAw8SpOnKU7NqL3cKs0d9God-RVkRO11WbdcMOVEmpt5VfRudPHERM4vzFAisOADgMcQxQTluKIsJR5c2M2JjTJHxbv4Vmbz8KEzZGwNBoPEwgCePR1I0rgbxXNTcZu0lgfZCl5YCvpS2UnD8ISdpvls4uz4J8BgrOTifEcHj7mWZ4Gj8scYhaHESYhZXAJmZEHYeCjMLZLSnrNOH_K0ktihCECGc9fTHyPX99PeF3J3eoJ4IFXAm5h5b2zTtSFCJbeN5i1NsjLlxVK24RpR_ioFXdSCXiqC21MCO_m86vVwILXAB9E1fSKtxdmng1h6ZPlO0FwMedNp3HZ1XXQoeHkZ_DkpAKY66ppnYBleG3C_70JfyQs87tbcb9B_qsHZz-Lkst4UjiVeXdAwGqe39bI-7oPLj0BYekoIiylNKTwfrvtiddZeiC48E2HRV1SVvRa5XD8CQgugMX002vicClnHXoSxXHSfZxJlGB8qkex5a1y1-a-Fd4qoExavlGivND4Hd3lb-7L6Eei_-6bo0_fjv7o_-in8K-eJCujC2Ftd6LdBK9rzXCX--MiP4dllWcA8OHY-HI4CUPt-7dn79vmm0no5PIMXlrqT9H6fp7napt0Y3PtZ9bmnB_Y6rb-wWr80fr4k3noEum0Kfb9EjDEMOqxfqLtkTQcXiKDuihUULQRnmWiSRiFnove6B-UU1YmLOEDMY3G0RBxOGGjwX66SWhCWcxEMplgVI5xxARutsgLFg3LCQ7k1E-JdIjjiDEWj8KSlROcTJBOkoRtoy0ZUlFxqUKlDlWozW7QzfrThLIhGyi-Ecp2NxzEWjz3FwE_tMbZwEy9TLBpd5YMqZLW2VctTjolpnktneRKWp-_N-89L1cagvhXcylBHLRGTffONd3IiQuCi510-3YTFroiuPC2Tz9BY_RvonAEF-ery6Jb0e8BAAD__yQSUGo">