<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/90343>90343</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Initialising LLM on multiple GPUs stuck at "Started a local Ray instance"
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
timbmg
</td>
</tr>
</table>
<pre>
I am trying to run an LLM on multiple GPUs (2x H100), however, it seems stuck at after starting the local Ray instance. I have also tried setting `NCCL_P2P_DISABLE=1` as suggest in other issues, however, this did not solve my problem.
```python3
import os
from vllm import LLM, SamplingParams
llm = LLM(
model="CohereForAI/c4ai-command-r-v01",
max_model_len=2**15, # limit context to 32k
gpu_memory_utilization=0.95,
dtype="float16",
tensor_parallel_size=2
)
# 2024-04-27 15:26:35,737 INFO worker.py:1749 -- Started a local Ray instance.
```
`nvidia-smi` gives:
```
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:61:00.0 Off | 0 |
| N/A 31C P0 47W / 350W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 PCIe On | 00000000:E1:00.0 Off | 0 |
| N/A 30C P0 46W / 350W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
```
```
torch==2.2.1
vllm==0.4.1
vllm-nccl-cu12==2.18.1.0.4.0
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsF12T2jbw14iXHXvkFQb8wIPB4eqZ48L0cskjI2wBamTLI8lcyK_vyIY7oNdOkk7bPHTHd9j7rd3VasWtlbtaiCmJZyTOBrx1e22mTlabajfY6PI4zYFX4MxR1jtwGkxbA6_h_n4JuoaqVU42SsDd6skCwQl-gV8iSgkmBOew18_iIIx_lQ6sEJUF69riM3AHfOuEAeu4cZ3uvQClC67gV34EWVvH60KEkMOeHwRwZTU4I0UJVrhOgozow3x-v17hap3lj-ns_h1hWURGFLgF2-52wjqQNWi3Fwakta2wN265vbRQyhJq7cBqdRBQHaExeqNEFRKaEZqe_o9o_zRHt9c167GyarRxoG3_uTW6goNSFZwI9_dLb-aRV42S9W7FDa9OvJ6LsKxnmfQ4qHQpFGEZQZzrvTBioU2aE1wUQy6DQlcVr8vABAcaEUSv-iQIABX_su7k10rUhGVIMCWYRnHHhgyUrKSDQtdOfHE-mQw_v4rvmnZdiUqb47p1Usmv3Ent9dAw8SpOnKU7NqL3cKs0d9God-RVkRO11WbdcMOVEmpt5VfRudPHERM4vzFAisOADgMcQxQTluKIsJR5c2M2JjTJHxbv4Vmbz8KEzZGwNBoPEwgCePR1I0rgbxXNTcZu0lgfZCl5YCvpS2UnD8ISdpvls4uz4J8BgrOTifEcHj7mWZ4Gj8scYhaHESYhZXAJmZEHYeCjMLZLSnrNOH_K0ktihCECGc9fTHyPX99PeF3J3eoJ4IFXAm5h5b2zTtSFCJbeN5i1NsjLlxVK24RpR_ioFXdSCXiqC21MCO_m86vVwILXAB9E1fSKtxdmng1h6ZPlO0FwMedNp3HZ1XXQoeHkZ_DkpAKY66ppnYBleG3C_70JfyQs87tbcb9B_qsHZz-Lkst4UjiVeXdAwGqe39bI-7oPLj0BYekoIiylNKTwfrvtiddZeiC48E2HRV1SVvRa5XD8CQgugMX002vicClnHXoSxXHSfZxJlGB8qkex5a1y1-a-Fd4qoExavlGivND4Hd3lb-7L6Eei_-6bo0_fjv7o_-in8K-eJCujC2Ftd6LdBK9rzXCX--MiP4dllWcA8OHY-HI4CUPt-7dn79vmm0no5PIMXlrqT9H6fp7napt0Y3PtZ9bmnB_Y6rb-wWr80fr4k3noEum0Kfb9EjDEMOqxfqLtkTQcXiKDuihUULQRnmWiSRiFnove6B-UU1YmLOEDMY3G0RBxOGGjwX66SWhCWcxEMplgVI5xxARutsgLFg3LCQ7k1E-JdIjjiDEWj8KSlROcTJBOkoRtoy0ZUlFxqUKlDlWozW7QzfrThLIhGyi-Ecp2NxzEWjz3FwE_tMbZwEy9TLBpd5YMqZLW2VctTjolpnktneRKWp-_N-89L1cagvhXcylBHLRGTffONd3IiQuCi510-3YTFroiuPC2Tz9BY_RvonAEF-ery6Jb0e8BAAD__yQSUGo">