Sambal Shikhar, Mohammed Irfan K, Sahal Shaji Mullappilly, Fahad Khan, Jean Lahoud, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
video_demo_play.mp4
LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency. Our approach achieves significantly lower Word Error Rate compared to speech-enabled LLMs while operating at comparable latency and speech quality.
Key features:
- π Lightweight & Fast: Only 30M parameters, delivering speech with end-to-end latency as low as 300ms
- π LLM-Agnostic: Just plug with any existing LLM and Vision-Language Models without requiring fine-tuning or architectural modifications.
- π Multi-Queue Streaming: Enables continuous, low-latency speech generation and infinite-length dialogues
- π Multilingual Support: Easily adaptable to new languages with only dataset adaptation
# System requirements
# - CUDA 11.7 or higher
# - Flash Attention 2.0+ compatible GPU (Ampere architecture or newer)
# Clone the repository
git clone https://github.com/mbzuai-oryx/LLMVoX.git
cd LLMVoX
# Create and activate a conda environment
conda create -n llmvox python=3.9
conda activate llmvox
# Install PyTorch with CUDA 11.8 support
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install Flash Attention
pip install flash-attn --no-build-isolation
# Install remaining dependencies
pip install -r requirements.txt
# Add path to wavtokenizer to avoid importing errors
export PYTHONPATH=./WavTokenizer/:$PYTHONPATH
# Download checkpoints (if not already in the repository)
mkdir -p CHECKPOINTS
# Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
# and place them in the CHECKPOINTS directory
Download the necessary model checkpoints from Hugging Face:
π€ Hugging Face Repository: MBZUAI/LLMVoX
LLMVoX requires a few base paths to be set correctly in the inference configuration file at configs/inference_config.py
:
wavtokenizer_model_path
: Path to the pretrained WavTokenizer model checkpointllmvox_checkpoint_path
: Path to the trained LLMVoX model checkpoint
LLMVoX supports voice-based conversations through its streaming server. Here's how to configure and use the voice chat functionality:
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
LLMVoX uses a multi-queue approach with two TTS model replicas. You can specify which GPUs to use:
# Run TTS models on separate GPUs
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --tts_device_1 1 --tts_device_2 2
# Or run both on the same GPU (if memory allows)
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --tts_device_1 0 --tts_device_2 0
# Specify GPU for LLM separately
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2
Control the balance between latency and quality:
# Lower latency setup (faster initial response but potentially lower quality)
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --initial_dump_size_1 5 --initial_dump_size_2 40 --max_dump_size 320
# Higher quality setup (slightly higher latency but better speech)
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --initial_dump_size_1 20 --initial_dump_size_2 320 --max_dump_size 2560
# Default balanced setup
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
initial_dump_size_1
: Number of speech tokens for the first chunk (smaller = faster first response)initial_dump_size_2
: Initial chunk size for the second TTS model (can be larger as it runs while first chunk plays)max_dump_size
: Maximum chunk size that the system will scale up to (larger = better quality)
Different LLMs use different end-of-sequence tokens:
# For LLaMA models
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --eos_token "<|eot_id|>" --llm_max_tokens 1000
# For Mistral models
python streaming_server.py --chat_type voice --llm_checkpoint "mistralai/Mistral-7B-Instruct-v0.2" --eos_token "<|im_end|>" --llm_temperature 0.7
# For other models (check your model's documentation)
python streaming_server.py --chat_type voice --llm_checkpoint "your-model-name" --eos_token "<|end|>"
LLMVoX uses Whisper for converting speech to text:
# Use a larger Whisper model for better transcription
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --asr_model "medium" --asr_device "cuda:3"
# Use a smaller model for faster processing
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --asr_model "tiny" --asr_device "cuda:0"
Control the LLM's response style:
# For concise responses
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --system_prompt "You are a friendly voicebot that answers questions in a concise way and do not use abbreviation. Keep responses brief."
# For more detailed explanations
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --system_prompt "You are a helpful AI assistant that provides detailed, thorough explanations. Avoid abbreviations when speaking."
Here's a complete example with all key parameters configured:
python streaming_server.py \
--chat_type voice \
--llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
--llm_device "cuda:0" \
--tts_device_1 1 \
--tts_device_2 2 \
--asr_model "small" \
--asr_device "cuda:3" \
--initial_dump_size_1 10 \
--initial_dump_size_2 160 \
--max_dump_size 1280 \
--max_audio_length 8000 \
--eos_token "<|eot_id|>" \
--system_prompt "You are a friendly voicebot that answers questions concisely without abbreviations."
When you run voice chat:
- The ASR model transcribes your speech input
- The LLM generates a response text stream
- Two LLMVoX instances alternate processing text chunks at sentence boundaries
- Initial chunks are smaller for faster response, while later chunks are larger for better quality
- Audio is played in real-time while the rest of the response is still being generated
This multi-queue architecture enables both low latency (as fast as 300ms) and high-quality speech output.
# Basic text chat with LLaMA 3.1 8B
python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --llm_device "cuda:0"
# Customize LLM generation parameters
python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --llm_temperature 0.5 --llm_top_p 0.9 --llm_top_k 30
# Using Qwen 2.5 VL as the vision-language model
python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" --llm_device "cuda:0" --asr_model "small" --eos_token "<|im_end|>"
# Using Phi-4-multimodal-instruct which has multimodal input with speech , images and text
python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" --llm_device "cuda:0" --sytem_prompt ""Answer the question in short responses." --eos_token "<|end|>"
# Using LLaVA
python streaming_server.py --chat_type multimodal --llm_checkpoint "llava-hf/llava-1.5-7b-hf" --llm_device "cuda:0"
Once the streaming server starts , the following is the client side API reference.
Endpoint | Purpose | Required Parameters |
---|---|---|
/tts |
Text-to-speech | text : String to convert |
/voicechat |
Voice conversations | audio_base64 : Encoded speechsource_language : Input language for Whisper ASRtarget_language : Output language for Whisper ASR |
/multimodalchat |
Voice + multiple images | audio_base64 : Encoded speechimage_list : Array of base64 images |
/vlmschat |
Voice + single image | audio_base64 : Encoded speechimage_base64 : Single imagesource_language : Input language for Whisper ASRtarget_language : Output language for Whisper ASR |
# Text-to-Speech
api_call("tts", {"text": "Hello world"})
# Voice Chat
api_call("voicechat", {
"audio_base64": audio_data,
"source_language": "English",
"target_language": "English"
})
# Multimodal Chat (voice + images) for models like Phi-4-multimodal-instruct
api_call("multimodalchat", {
"audio_base64": audio_data,
"image_list": [image_base64]
})
# Visual Language Model (voice + image)
api_call("vlmschat", {
"audio_base64": audio_data,
"image_base64": image_base64,
"source_language": "English",
"target_language": "English"
})
import threading
import queue
import requests
import torch
import torchaudio
import numpy as np
from pyaudio import PyAudio, paFloat32
import time
def tts(text, SERVER_IP,PORT,audio_outpath='output_audio.wav'):
server_ip = SERVER_IP
endpoint = f'http://{server_ip}:PORT/tts'
# Audio playback parameters
sample_rate = 24000
channels = 1
chunk_size = 10000
frames_per_buffer = 10000 # Must match the server's chunk size if possible
audio_queue = queue.Queue() # Buffer for audio playback
audio_chunks = [] # Buffer to store audio for saving
start=time.time()
def stream_audio():
"""Streams audio from the server and enqueues it for playback and saving."""
try:
with requests.post(endpoint, json={"text": text}, stream=True) as stream:
stream.raise_for_status() # Raise an error for bad status codes
for chunk in stream.iter_content(chunk_size=None):
if chunk:
try:
# Enqueue the chunk for playback
print(f"Got chunk at {time.time()-start}")
audio_queue.put(chunk, timeout=1)
# Store the chunk for saving
audio_chunks.append(chunk)
except queue.Full:
print("Audio queue is full. Dropping chunk.")
except requests.exceptions.RequestException as e:
print(f"RequestException: {e}")
finally:
# Signal the end of streaming
audio_queue.put(None)
def play_audio():
"""Plays audio chunks from the queue using PyAudio."""
p = PyAudio()
try:
player = p.open(format=paFloat32,
channels=channels,
rate=sample_rate,
output=True,)
while True:
chunk = audio_queue.get()
if chunk is None:
print("End of streaming.")
break # End of streaming
if not chunk:
print("Received an empty chunk. Skipping.")
continue # Skip empty chunks
try:
print("Playing chunk")
player.write(chunk)
except Exception as e:
print(f"Error during playback: {e}")
break
finally:
player.stop_stream()
player.close()
p.terminate()
# Start streaming and playback in separate threads
stream_thread = threading.Thread(target=stream_audio, daemon=True)
play_thread = threading.Thread(target=play_audio, daemon=True)
stream_thread.start()
play_thread.start()
# Wait for both threads to finish
stream_thread.join()
play_thread.join()
# After streaming is complete, process and save the audio
if audio_chunks:
print("Saving audio to file...")
# Concatenate all audio kjb jb into a single bytes object
audio_bytes = b''.join(audio_chunks)
# Convert bytes to a NumPy array of float32
audio_np = np.frombuffer(audio_bytes, dtype=np.float32)
# Optional: Normalize if necessary
# Uncomment the following lines if you need to normalize the audio
# max_val = np.max(np.abs(audio_np))
# if max_val > 1.0:
# audio_np = audio_np / max_val
# Reshape the array to (channels, samples)
if channels > 1:
audio_np = audio_np.reshape(-1, channels).T # Transpose to (channels, samples)
else:
audio_np = audio_np.reshape(1, -1) # (1, samples)
# Convert float32 to int16 for 'PCM_S' encoding
# Ensure the float32 data is in the range [-1.0, 1.0]
audio_np = np.clip(audio_np, -1.0, 1.0)
audio_int16 = (audio_np * 32767).astype(np.int16)
# Convert NumPy array to PyTorch tensor
audio_tensor = torch.from_numpy(audio_int16)
# Save the audio using torchaudio with specified settings
try:
torchaudio.save(
audio_outpath,
audio_tensor,
sample_rate=sample_rate,
encoding='PCM_S',
bits_per_sample=16,
format='wav'
)
print(f"Audio successfully saved to {audio_outpath}")
except Exception as e:
print(f"Error saving audio: {e}")
else:
print("No audio chunks received. Nothing to save.")
if __name__ == '__main__':
text="hello how are you?"
tts(text,SERVER_IP,PORT,audio_outpath="output.wav")
You can easily create text-streamer for your custom model and integrate it with LLMVoX in streaming_server.py
, refer to inference
folder to see the streamer template for LLMs ,VLMs and Multimodal LLMs.
This local demo UI is built using PyQt5.
#Run a streaming server
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --eos_token "<|eot_id|>" --llm_max_tokens 1000 --api_port PORT
#Run local demo UI
python run_ui.py --ip STREAMING_SERVER_IP --port PORT
- Voice Chat
- Text Chat
- Visual Speech
- Dataset: VoiceAssistant-400K
- Format: JSON file with entries mapping text to audio files:
[
{
"speech_folder": "/path/to/audio/files",
"speech_file": "audio1.wav",
"answer_text": "Text transcript",
"id": "unique_id_1"
}
]
The training configuration is in configs/train_config.py
:
-
Model Architecture:
n_layer
: Number of transformer layers (default: 4)n_head
: Number of attention heads (default: 8)n_embd
: Embedding dimension (default: 768)block_size
: Context length (default: 8192)
-
Training Settings:
gradient_accumulation_steps
: Accumulate gradients before updating (default: 4)batch_size
: Batch size per GPU (default: 2)learning_rate
: Peak learning rate (default: 3e-4)max_iters
: Maximum iterations (default: 2600000)
-
Paths:
data_path
: Path to dataset JSONspeech_data_folder
: Path to audio filesout_dir
: Output directory for checkpointsencoder_model_path
: Path to ByT5 model for multilingual grapheme-to-phoneme conversion from CharsiuG2P that provides phoneme embeddings for words
python train.py \
--n_layer=4 \
--n_head=8 \
--n_embd=768 \
--block_size=8192 \
--dropout=0.0 \
--bias=False \
--data_path="/path/to/dataset.json" \
--speech_data_folder="/path/to/audio_files" \
--encoder_model_path="charsiu/g2p_multilingual_byT5_tiny_16_layers_100" \
--tokenizer_path="google/byt5-small" \
--wav_config_path="WavTokenizer/configs/wavtokenizer_smalldata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml" \
--wav_model_path="/path/to/wavtokenizer_large_speech_320_24k.ckpt" \
--out_dir="my_llmvox_model" \
--batch_size=2 \
--gradient_accumulation_steps=4 \
--learning_rate=3e-4 \
--weight_decay=1e-1 \
--warmup_iters=50000 \
--lr_decay_iters=2600000 \
--min_lr=3e-6 \
--eval_interval=1000 \
--compile=True \
--wandb_log=True \
--wandb_project="speech_stream" \
--wandb_run_name="llmvox_training_run"
This comprehensive command shows all configurable parameters for training a LLMVoX model. Adjust values based on your hardware capabilities and specific requirements.
#Single GPU
python train.py --batch_size=8 --learning_rate=5e-5 --n_layer=6
#Distributed Training
torchrun --standalone --nproc_per_node=4 train.py --batch_size=16
If you find our work useful, please consider citing:
@article{shikhar2025llmvox,
title={LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM},
author={Shikhar, Sambal and Kurpath, Mohammed Irfan and Mullappilly, Sahal Shaji and Lahoud, Jean and Khan, Fahad and Anwer, Rao Muhammad and Khan, Salman and Cholakkal, Hisham},
journal={arXiv preprint arXiv:2503.04724},
year={2025}
}
LLMVoX is released under the CC-BY-NC-SA 4.0 License. For more details, please refer to the LICENSE file included in this repository.
We thank the reviewers and colleagues who provided valuable feedback on this work. We also acknowledge the open-source contributions that made this project possible:
- Andrej Karpathy's NanoGPT - Training code for LLMVoX is based on this repository
- WavTokenizer - For audio tokenization
- Whisper - Used for ASR in our pipeline
- Neural G2P - For the multilingual phoneme embeddings