Name		Name	Last commit message	Last commit date
parent directory ..
batch_manager/aarch64-linux-gnu		batch_manager/aarch64-linux-gnu
executor/aarch64-linux-gnu		executor/aarch64-linux-gnu
nvrtcWrapper/aarch64-linux-gnu		nvrtcWrapper/aarch64-linux-gnu
LICENSE		LICENSE
README.md		README.md
build_from_source_changes.patch		build_from_source_changes.patch
json_modifier.py		json_modifier.py
setup_from_source.sh		setup_from_source.sh

README.md

TensorRT-LLM on DRIVE Orin

TensorRT-LLM is a toolbox for optimizing Large Language Model (LLM) inference. It offers cutting-edge optimizations such as custom attention kernels, plugins, and various quantization techniques, enabling efficient inference on NVIDIA GPUs. In this repository, we demonstrate how to deploy Large Lanuage Model (LLM) on DRIVE Orin platform for developers who is interested in using TensorRT-LLM. Following this repository, we detail the deployment of Llama-3.1-8B model on Orin with TensorRT-LLM 0.13. This repository is for evaluation purposes only.

Orin Environment:

Please make sure you have the following Orin environment setup on your device. You can follow the DRIVE OS Install Guidance to prepare the device. To get the required DRIVE OS and TensorRT version, please refer to details on the NVIDIA DRIVE Downloads site.

Drive OS 0.6.9.0
Python 3.8
CUDA 11.4
Ubuntu 20.04
TensorRT 10.4.0.11

Generate LLaMA-3.1-8B Model with INT4_AWQ and INT8-KV-Cache on x86

In this example, we are recommending quantizing the Llama-3.1-8B model during the deployment. To quantize this Llama model, please follow this link to install the TensorRT-LLM on an x86 system with at least 16GB GPU memory the following sample commands:

python -m venv trtllm
source trtllm/bin/activate
pip install tensorrt_llm==0.16

Please check if your venv has the following dependency:

TensorRT-LLM 0.16
modelOpt 0.19.0

After confirming the above setup, follow the commands below to quantize the Llama model:

cd $PWD/TensorRT-LLM/examples/llama 
python convert_checkpoint.py --model_dir $input_model --output_dir $output_model --dtype float16 --use_weight_only --weight_only_precision=int4_awq --int8_kv_cache

Build TensorRT-LLM from Source

After generating the quantized Llama-3.1-8B model on x86 system, please copy Llama-3.1-8B and Llama-3.1-8B_int4_awq_kv_int8 folders from host to your target DRIVE Orin working directory. Then please run the following command to build TensorRT-LLM from source.

./setup_from_source.sh

The overall working directory after running setup_from_source.sh will be as follows:

work_dir
├── batch_manager 
├── executor
├── build_from_scoure_changes.patch 
├── setup_from_source.sh 
├── TensorRT-LLM
├── TensorRT-10.4.0.11
├── Llama-3.1-8B
├── Llama-3.1-8B_int4_awq_kv_int8 
├── trtllm_0.13 (vitual enviornment)
├── nccl
├── json_modifier.py

Note that on Drive Orin, currently TensorRT-LLM v0.13 is used for inference.

Build LLM Engine and Sample Inference on Drive Orin

After successfully build the TensorRT-LLM from source on target Drive Orin, the following command will build the corresponding LLM engine:

trtllm-build --checkpoint_dir $path_engine --output_dir $path_engine/1-gpu/ --gemm_plugin auto --max_batch_size 1

Using the above engine, you can run LLM inference using the following commands:

cd TensorRT-LLM/examples/
python run.py --max_output_len 128 --engine_dir $path_engine/1-gpu/ --tokenizer_dir $path_model

Here is a sample inference output for Llama-3.1-8B model

[11/21/2024-16:10:58] [TRT-LLM] [I] Load engine takes: 62.147863149642944 sec
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " chef in Paris before moving to London in 1848. He was appointed chef to the Prince of Wales in 1850, and later became chef to Queen Victoria. He was a pioneer of French cuisine in England, and his cookery books were very popular. He died in 1868."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3.1-8B-trtllm

Llama-3.1-8B-trtllm

README.md

TensorRT-LLM on DRIVE Orin

Orin Environment:

Generate LLaMA-3.1-8B Model with INT4_AWQ and INT8-KV-Cache on x86

Build TensorRT-LLM from Source

Build LLM Engine and Sample Inference on Drive Orin

Files

Llama-3.1-8B-trtllm

Directory actions

More options

Directory actions

More options

Latest commit

History

Llama-3.1-8B-trtllm

Folders and files

parent directory

README.md

TensorRT-LLM on DRIVE Orin

Orin Environment:

Generate LLaMA-3.1-8B Model with INT4_AWQ and INT8-KV-Cache on x86

Build TensorRT-LLM from Source

Build LLM Engine and Sample Inference on Drive Orin