ChatLLM.cpp

Inference of a bunch of models from less than 1B to more than 300B, for real-time chatting with RAG on your computer (CPU & GPU), pure C++ implementation based on @ggerganov's ggml.

| Supported Models | Download Quantized Models |

What's New:

2025-03-23: Llama-3.3-Nemotron-Super-49B-v1
2025-03-16: DeepHermes-Mistral
2025-03-13: Gemma-3 (Language model)
2025-03-12: Reka-Flash-3
2025-03-10: Instella
2025-03-05: Baichuan-M1
2025-03-03: HunYuan Dense
2025-02-27: Granite-3.2, Phi-4 Mini
2025-02-24: Moonlight
2025-02-21: Distributed inference
2025-02-19: MoE CPU offloading, tool calling with Watt-tool
2025-02-17: ggml updated again
2025-02-10: GPU acceleration 🔥
2025-01-25: MiniCPM Embedding & ReRanker
2025-01-21: DeepSeek-R1-Distill-Llama & Qwen
2025-01-15: InternLM3
2025-01-13: OLMo-2
2025-01-11: Phi-4
2025-01-06: (Naive) Beam search
2024-12-09: Reversed role
2024-11-21: Continued generation
2024-11-01: generation steering
2024-07-14: ggml updated
2024-06-15: Tool calling
2024-05-29: ggml is forked instead of submodule
2024-05-14: OpenAI API, CodeGemma Base & Instruct supported
2024-05-08: Layer shuffling

Features

Accelerated memory-efficient CPU/GPU inference with int4/int8 quantization, optimized KV cache and parallel computing;
Use OOP to address the similarities between different Transformer based models;
Streaming generation with typewriter effect;
Continuous chatting (content length is virtually unlimited)

Two methods are available: Restart and Shift. See --extending options.
Retrieval Augmented Generation (RAG) 🔥
LoRA;
Python/JavaScript/C/Nim Bindings, web demo, and more possibilities.

Quick Start

As simple as main_nim -i -m :model_id. Check it out.

Usage

Preparation

Clone the ChatLLM.cpp repository into your local machine:

git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatllm.cpp folder:

git submodule update --init --recursive

Quantize Model

Some quantized models can be downloaded on demand.

Install dependencies of convert.py:

pip install -r requirements.txt

Use convert.py to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:

# For models such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin

# For some models such as CodeLlaMA, model type should be provided by `-a`
# Find `-a ...` option for each model in `docs/models.md`.
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA

Use -l to specify the path of the LoRA model to be merged, such as:

python3 convert.py -i path/to/model -l path/to/lora/model -o quantized.bin

Note: Appropriately, only HF format is supported (with a few exceptions); Format of the generated .bin files is different from the one (GGUF) used by llama.cpp.

Build

In order to build this project you have several different options.

Using CMake:

cmake -B build
cmake --build build -j --config Release

The executable is ./build/bin/main.

Run

Now you may chat with a quantized model by running:

./build/bin/main -m chatglm-ggml.bin                            # ChatGLM-6B
# 你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。
./build/bin/main -m llama2.bin  --seed 100                      # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....

To run the model in interactive mode, add the -i flag. For example:

# On Windows
.\build\bin\Release\main -m model.bin -i

# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Acknowledgements

This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.
Thank those who have released their the model sources and checkpoints.

Note

This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.

Name	Name	Last commit message	Last commit date
Latest commit foldl support Llama-3.3-Nemotron-Super-49B-v1 Mar 23, 2025 67fefc5 · Mar 23, 2025 History 466 Commits
bindings	bindings	fix backend_dl for non-native bindings	Mar 11, 2025
docs	docs	support Llama-3.3-Nemotron-Super-49B-v1	Mar 23, 2025
ggml	ggml	add `ggml_backend_tensor_set_from_object_t`	Mar 8, 2025
models	models	support Llama-3.3-Nemotron-Super-49B-v1	Mar 23, 2025
scripts	scripts	support Llama-3.3-Nemotron-Super-49B-v1	Mar 23, 2025
src	src	support Llama-3.3-Nemotron-Super-49B-v1	Mar 23, 2025
.gitignore	.gitignore	fix building against CUDA (issue #13 )	Feb 8, 2025
.gitmodules	.gitmodules	remove ggml	May 29, 2024
CMakeLists.txt	CMakeLists.txt	fix building against CUDA (issue #13 )	Feb 8, 2025
LICENSE	LICENSE	init	Dec 2, 2023
README.md	README.md	support Llama-3.3-Nemotron-Super-49B-v1	Mar 23, 2025
README_ja.md	README_ja.md	MoE CPU offloading	Feb 19, 2025
README_zh.md	README_zh.md	MoE CPU offloading	Feb 19, 2025
convert.py	convert.py	support Llama-3.3-Nemotron-Super-49B-v1	Mar 23, 2025
requirements.txt	requirements.txt	be a minimalist	Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChatLLM.cpp

Features

Quick Start

Usage

Preparation

Quantize Model

Build

Run

Acknowledgements

Note

About

Releases 9

Packages

Contributors 2

Languages

License

foldl/chatllm.cpp

Folders and files

Latest commit

History

Repository files navigation

ChatLLM.cpp

Features

Quick Start

Usage

Preparation

Quantize Model

Build

Run

Acknowledgements

Note

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 2

Languages

Packages