PytorchEngine multi-node support v2 #3147

grimoire · 2025-02-17T08:54:16Z

build ray cluster:

# on driver node
ray start --head --port 6379

# on other node
ray start --address=$DRIVER_ADDR:6379

run pipeline on driver, model should be placed on same location.

from lmdeploy import pipeline, PytorchEngineConfig
with pipeline(model_path, backend_config=PytorchEngineConfig(tp=16)) as pipe:
    ....

Users can choice the distributed_executor_backend when using a single node.

Note

It is encouraged to build the cluster in docker container with --network host like vLLM to ensure each node has the same environment.
and automatic model downloading is not supported for now. Models should be pre-downloaded on each node.

benchmark/profile_throughput.py

lmdeploy/messages.py

lvhan028 · 2025-02-21T04:38:13Z

Fantastic job!
@RunningLeon Please help check if it works for dsv3.

lvhan028 · 2025-02-21T04:38:43Z

cc @jinminxi104

lvhan028 · 2025-02-21T04:40:07Z

Please prepare a guide about the multi-node deployment, including offline pipeline and online serving

grimoire · 2025-02-21T11:12:16Z

Please prepare a guide about the multi-node deployment, including offline pipeline and online serving

https://github.com/InternLM/lmdeploy/blob/f1a8a08064087ee360c42e264f69d76cde6582d2/docs/en/advance/pytorch_multinodes.md

grimoire · 2025-02-24T03:12:18Z

Automatic model downloading has not been supported on the ray backend. The user should put the model in the same location on each node.

RunningLeon · 2025-02-24T10:58:24Z

Fantastic job! @RunningLeon Please help check if it works for dsv3.

@lvhan028 tested Ok with deepseek-ai/DeepSeek-R1 + tp16. The conversation output from serving is normal after this commit 95b2249

lvhan028 · 2025-02-27T04:17:55Z

May merge the latest main so that I can request a full test

grimoire and others added 30 commits January 15, 2025 11:21

better dist context

0d006e2

can not exit

e5baae6

Merge branch 'main' into torch-multinode

e561e6d

multinode support

8a61faf

better exception

ae7a742

Merge branch 'main' into torch-multinode

2f75ee9

merge main

709d293

refactor

cefbf98

fix local rank

ada2cc3

replace group

713aad5

merge main

2ec93c8

fix dist

374b52a

remove useless code

727ea86

remove finish flag

b454c39

Merge branch 'main' into torch-multinode

3baab2d

refactor engine and model agent

742af3b

uni executor

420ab0f

wip

e2b82a7

tp

40251d5

fix

300263e

less async

ec10731

Merge branch 'main' into torch-multinode-v2

2ee3ca8

circle buf

ece1313

event per block

911d9ab

fast mp

3b6fa54

fix error handler

a560cd9

remove safe wait

6071e55

context in model agent

13d1187

fix on stop

894edb8

check before init

1b8d35e

grimoire added 6 commits February 19, 2025 17:04

dag

fbdffd2

optimize mp & lint

5871c27

merge main

5c46f97

optimize mp

adfa3f3

add base workerwrapper

0f63336

fix gather, update flags

ff64f11

grimoire marked this pull request as ready for review February 21, 2025 04:04

better return mask

4ba2561

lvhan028 reviewed Feb 21, 2025

View reviewed changes

benchmark/profile_throughput.py Outdated Show resolved Hide resolved

lvhan028 reviewed Feb 21, 2025

View reviewed changes

lmdeploy/messages.py Outdated Show resolved Hide resolved

lvhan028 requested a review from RunningLeon February 21, 2025 04:38

grimoire added 6 commits February 21, 2025 12:43

add choice

026ebe4

enable mp,ray with worldsize=1

a4639e9

fix mp exit

c76dcfc

fix mp vlm

c5661ed

chat exit

af83ff2

add docs

f1a8a08

grimoire added 4 commits February 24, 2025 11:13

lint

730c9e9

doc

ef2811b

dp check

82f0f21

fix blocked fp8 moe

d28d690

remove mask

95b2249

Merge branch 'main' into torch-multinode-v2

411d12e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PytorchEngine multi-node support v2 #3147

PytorchEngine multi-node support v2 #3147

grimoire commented Feb 17, 2025 •

edited

Loading

lvhan028 commented Feb 21, 2025

lvhan028 commented Feb 21, 2025

lvhan028 commented Feb 21, 2025

grimoire commented Feb 21, 2025

grimoire commented Feb 24, 2025

RunningLeon commented Feb 24, 2025 •

edited

Loading

lvhan028 commented Feb 27, 2025

PytorchEngine multi-node support v2 #3147

Are you sure you want to change the base?

PytorchEngine multi-node support v2 #3147

Conversation

grimoire commented Feb 17, 2025 • edited Loading

lvhan028 commented Feb 21, 2025

lvhan028 commented Feb 21, 2025

lvhan028 commented Feb 21, 2025

grimoire commented Feb 21, 2025

grimoire commented Feb 24, 2025

RunningLeon commented Feb 24, 2025 • edited Loading

lvhan028 commented Feb 27, 2025

grimoire commented Feb 17, 2025 •

edited

Loading

RunningLeon commented Feb 24, 2025 •

edited

Loading