Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PytorchEngine multi-node support v2 #3147

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Feb 17, 2025

  1. build ray cluster:
# on driver node
ray start --head --port 6379

# on other node
ray start --address=$DRIVER_ADDR:6379
  1. run pipeline on driver, model should be placed on same location.
from lmdeploy import pipeline, PytorchEngineConfig
with pipeline(model_path, backend_config=PytorchEngineConfig(tp=16)) as pipe:
    ....

Users can choice the distributed_executor_backend when using a single node.

Note

It is encouraged to build the cluster in docker container with --network host like vLLM to ensure each node has the same environment.
and automatic model downloading is not supported for now. Models should be pre-downloaded on each node.

@grimoire grimoire marked this pull request as ready for review February 21, 2025 04:04
@lvhan028
Copy link
Collaborator

Fantastic job!
@RunningLeon Please help check if it works for dsv3.

@lvhan028 lvhan028 requested a review from RunningLeon February 21, 2025 04:38
@lvhan028
Copy link
Collaborator

cc @jinminxi104

@lvhan028
Copy link
Collaborator

Please prepare a guide about the multi-node deployment, including offline pipeline and online serving

@grimoire
Copy link
Collaborator Author

Please prepare a guide about the multi-node deployment, including offline pipeline and online serving

https://github.com/InternLM/lmdeploy/blob/f1a8a08064087ee360c42e264f69d76cde6582d2/docs/en/advance/pytorch_multinodes.md

@grimoire
Copy link
Collaborator Author

Automatic model downloading has not been supported on the ray backend. The user should put the model in the same location on each node.

@RunningLeon
Copy link
Collaborator

RunningLeon commented Feb 24, 2025

Fantastic job! @RunningLeon Please help check if it works for dsv3.

@lvhan028 tested Ok with deepseek-ai/DeepSeek-R1 + tp16. The conversation output from serving is normal after this commit 95b2249

@lvhan028
Copy link
Collaborator

May merge the latest main so that I can request a full test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants