Name	Name	Last commit message	Last commit date
Latest commit jameslahm update Mar 9, 2025 316ba46 · Mar 9, 2025 History 2,039 Commits
.github	.github	GPU CI Fix wrong Ultralytics Installation (#17883)	Dec 1, 2024
CLIP	CLIP	add setup.py	Dec 16, 2024
docker	docker	Update Dockerfile-cpu to `ubuntu:latest` (#17670)	Nov 20, 2024
docs	docs	Fix typo in Sony IMX500 doc (#17871)	Dec 1, 2024
examples	examples	Add BGR to RGB conversion in LibTorch example (#17864)	Nov 28, 2024
figures	figures	update	Mar 9, 2025
logs	logs	update	Mar 9, 2025
lvis-api	lvis-api	add setup.py	Dec 16, 2024
ml-mobileclip	ml-mobileclip	add mobileclip	Dec 21, 2024
sam2	sam2	update	Feb 12, 2025
tests	tests	update	Mar 9, 2025
tools	tools	update	Mar 9, 2025
ultralytics	ultralytics	update	Mar 9, 2025
.gitignore	.gitignore	add speed bench	Dec 27, 2024
CITATION.cff	CITATION.cff	`ultralytics 8.3.0` YOLO11 Models Release (#16539)	Sep 30, 2024
CONTRIBUTING.md	CONTRIBUTING.md	Docs improvements and redirect fixes (#16287)	Sep 14, 2024
LICENSE	LICENSE	Update LICENSE to AGPL-3.0 (#2031)	Apr 14, 2023
README.md	README.md	update	Mar 9, 2025
README.zh-CN.md	README.zh-CN.md	Update Tasks banner spacing (#17843)	Nov 28, 2024
benchmark.sh	benchmark.sh	add speed bench	Dec 27, 2024
convert_segm2det.py	convert_segm2det.py	update	Mar 9, 2025
export.py	export.py	update	Mar 9, 2025
install.sh	install.sh	update	Mar 9, 2025
mkdocs.yml	mkdocs.yml	Revert Docs minify attempt (#17831)	Nov 27, 2024
predict.py	predict.py	update	Mar 9, 2025
predict_vp.py	predict_vp.py	update	Mar 9, 2025
pyproject.toml	pyproject.toml	Pin `numpy<=2.0.0` on macOS (#17221)	Oct 28, 2024
requirements.txt	requirements.txt	update	Feb 12, 2025
train.py	train.py	update	Mar 9, 2025
train_pe.py	train_pe.py	update	Mar 9, 2025
train_pe_all.py	train_pe_all.py	update	Mar 9, 2025
train_pe_free.py	train_pe_free.py	update	Mar 9, 2025
train_seg.py	train_seg.py	update	Mar 9, 2025
train_seg_vp.py	train_seg_vp.py	update	Mar 9, 2025
val.py	val.py	update	Mar 9, 2025
val_coco.py	val_coco.py	update	Mar 9, 2025
val_pe_free.py	val_pe_free.py	update	Mar 9, 2025
val_pe_free_recall.py	val_pe_free_recall.py	update	Mar 9, 2025
val_vp.py	val_vp.py	update	Mar 9, 2025

YOLOE: Real-Time Seeing Anything

Official PyTorch implementation of YOLOE.

Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and YOLO-Worldv2 in terms of open text prompts.

YOLOE: Real-Time Seeing Anything.
Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding\

We introduce YOLOE(ye), a highly efficient, unified, and open object detection and segmentation model, like human eye, under different prompt mechanisms, like texts, visual inputs, and prompt-free paradigm.

Abstract

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models will be publicly available.

Performance

Zero-shot detection evaluation

Fixed AP is reported on LVIS minival set with text (T) / visual (V) prompts.
Training time is for text prompts based on 8 Nvidia RTX4090 GPUs.
FPS is measured on T4 with TensorRT and iPhone 12 with CoreML, respectively.
For training data, OG denotes Objects365v1 and GoldG.
YOLOE can become YOLOs after re-parameterization with zero inference and transferring overhead.

Model	Prompt	Params	Data	Time	FPS	$A P$	$A P_{r}$	$A P_{c}$	$A P_{f}$	Log
YOLOE-v8-S	T / V	12M / 13M	OG	12.0h	305.8 / 64.3	27.9 / 26.2	22.3 / 21.3	27.8 / 27.7	29.0 / 25.7	T / V
YOLOE-v8-M	T / V	27M / 30M	OG	17.0h	156.7 / 41.7	32.6 / 31.0	26.9 / 27.0	31.9 / 31.7	34.4 / 31.1	T / V
YOLOE-v8-L	T / V	45M / 50M	OG	22.5h	102.5 / 27.2	35.9 / 34.2	33.2 / 33.2	34.8 / 34.6	37.3 / 34.1	T / V
YOLOE-11-S	T / V	10M / 12M	OG	13.0h	301.2 / 73.3	27.5 / 26.3	21.4 / 22.5	26.8 / 27.1	29.3 / 26.4	T / V
YOLOE-11-M	T / V	21M / 27M	OG	18.5h	168.3 / 39.2	33.0 / 31.4	26.9 / 27.1	32.5 / 31.9	34.5 / 31.7	T / V
YOLOE-11-L	T / V	26M / 32M	OG	23.5h	130.5 / 35.1	35.2 / 33.7	29.1 / 28.1	35.0 / 34.6	36.5 / 33.8	T / V

Zero-shot segmentation evaluation

The model is the same as above in Zero-shot detection evaluation.
Standard AP^m is reported on LVIS val set with text (T) / visual (V) prompts.

Model	Prompt	$A P^{m}$	$A P_{r}^{m}$	$A P_{c}^{m}$	$A P_{f}^{m}$
YOLOE-v8-S	T / V	17.7 / 16.8	15.5 / 13.5	16.3 / 16.7	20.3 / 18.2
YOLOE-v8-M	T / V	20.8 / 20.3	17.2 / 17.0	19.2 / 20.1	24.2 / 22.0
YOLOE-v8-L	T / V	23.5 / 22.0	21.9 / 16.5	21.6 / 22.1	26.4 / 24.3
YOLOE-11-S	T / V	17.6 / 17.1	16.1 / 14.4	15.6 / 16.8	20.5 / 18.6
YOLOE-11-M	T / V	21.1 / 21.0	17.2 / 18.3	19.6 / 20.6	24.4 / 22.6
YOLOE-11-L	T / V	22.6 / 22.5	19.3 / 20.5	20.9 / 21.7	26.0 / 24.1

Prompt-free evaluation

The model is the same as above in Zero-shot detection evaluation except the specialized prompt embedding.
Fixed AP is reported on LVIS minival set and FPS is measured on Nvidia T4 GPU with Pytorch.

Model	Params	$A P$	$A P_{r}$	$A P_{c}$	$A P_{f}$	FPS	Log
YOLOE-v8-S	13M	21.0	19.1	21.3	21.0	95.8	PF
YOLOE-v8-M	29M	24.7	22.2	24.5	25.3	45.9	PF
YOLOE-v8-L	47M	27.2	23.5	27.0	28.0	25.3	PF
YOLOE-11-S	11M	20.6	18.4	20.2	21.3	93.0	PF
YOLOE-11-M	24M	25.5	21.6	25.5	26.1	42.5	PF
YOLOE-11-L	29M	26.3	22.7	25.8	27.5	34.9	PF

Downstream transfer on COCO

During transferring, YOLOE-v8 / YOLOE-11 is exactly the same as YOLOv8 / YOLO11.
For Linear probing, only the last conv in classification head is trainable and for Full tuning, all parameters are trainable.

Model	Epochs	$A P^{b}$	$A P_{50}^{b}$	$A P_{75}^{b}$	$A P^{m}$	$A P_{50}^{m}$	$A P_{75}^{m}$	Log
Linear probing
YOLOE-v8-S	10	35.6	51.5	38.9	30.3	48.2	32.0	LP
YOLOE-v8-M	10	42.2	59.2	46.3	35.5	55.6	37.7	LP
YOLOE-v8-L	10	45.4	63.3	50.0	38.3	59.6	40.8	LP
YOLOE-11-S	10	37.0	52.9	40.4	31.5	49.7	33.5	LP
YOLOE-11-M	10	43.1	60.6	47.4	36.5	56.9	39.0	LP
YOLOE-11-L	10	45.1	62.8	49.5	38.0	59.2	40.6	LP
Full tuning
YOLOE-v8-S	160	45.0	61.6	49.1	36.7	58.3	39.1	FT
YOLOE-v8-M	80	50.4	67.0	55.2	40.9	63.7	43.5	FT
YOLOE-v8-L	80	53.0	69.8	57.9	42.7	66.5	45.6	FT
YOLOE-11-S	160	46.2	62.9	50.0	37.6	59.3	40.1	FT
YOLOE-11-M	80	51.3	68.3	56.0	41.5	64.8	44.3	FT
YOLOE-11-L	80	52.6	69.7	57.5	42.4	66.2	45.2	FT

Installation

conda virtual environment is recommended.

conda create -n yoloe python=3.10 -y
conda activate yoloe

pip install -r requirements.txt
pip install -e .
pip install -e lvis-api
pip install -e ml-mobileclip

Demo

python app.py
# Please visit http://127.0.0.1:7860

Prediction

Text prompt

python predict.py

Visual prompt

python predict_vp.py

Prompt free

python predict_pf.py

Validation

Data

Please download LVIS following here or lvis.yaml.
We use this minival.txt with background images for evaluation.

# For evaluation with visual prompt, please obtain the referring data.
python tools/generate_lvis_visual_prompt_data.py

Zero-shot evaluation on LVIS

For text prompts, python val.py.
For visual prompts, python val_vp.py

Prompt-free evaluation

python val_pe_free.py
python tools/eval_open_ended.py --json ../datasets/lvis/annotations/lvis_v1_minival.json --pred runs/detect/val/predictions.json --fixed

Downstream transfer on COCO

python val_coco.py

Training

The training includes three stages:

YOLOE is trained with text prompts for detection and segmentation for 30 epochs.
Only visual prompt encoder (SAVPE) is trained with visual prompts for 2 epochs.
Only specialized prompt embedding for prompt free is trained for 1 epochs.

Data

Images	Raw Annotations	Processed Annotations
Objects365v1	objects365_train.json	objects365_train_segm.json
GQA	final_mixed_train_noo_coco.json	final_mixed_train_noo_coco_segm.json
Flickr30k	final_flickr_separateGT_train.json	final_flickr_separateGT_train_segm.json

For annotations, you can directly use our preprocessed ones or use the following script to obtain the processed annotations with segmentation masks.

# Generate segmentation data
python tools/generate_sam_masks.py --img-path ../datasets/Objects365v1/images/train --json-path ../datasets/Objects365v1/annotations/objects365_train.json --batch
python tools/generate_sam_masks.py --img-path ../datasets/flickr/full_images/ --json-path ../datasets/flickr/annotations/final_flickr_separateGT_train.json
python tools/generate_sam_masks.py --img-path ../datasets/mixed_grounding/gqa/images --json-path ../datasets/mixed_grounding/annotations/final_mixed_train_no_coco.json

# Generate objects365v1 labels
python tools/generate_objects365v1.py

Then, please generate the data and embedding cache for training.

# Generate grounding segmentation cache
python tools/generate_grounding_cache.py --img-path ../datasets/flickr/full_images/ --json-path ../datasets/flickr/annotations/final_flickr_separateGT_train_segm.json
python tools/generate_grounding_cache.py --img-path ../datasets/mixed_grounding/gqa/images --json-path ../datasets/mixed_grounding/annotations/final_mixed_train_no_coco_segm.json

# Generate train label embeddings
python tools/generate_label_embedding.py
python tools/generate_global_neg_cat.py

Text prompt

python train_seg.py

Visual prompt

python train_seg_vp.py

Prompt free

python train_pe_free.py

Transferring

After pretraining, YOLOE-v8 / YOLOE-11 can be re-parameterized into the same architecture as YOLOv8 / YOLO11, with zero overhead for transferring.

Linear probing

Only the last conv, ie., the prompt embedding, is trainable.

python train_pe.py

Full tuning

All parameters are trainable, for better performance

python train_pe_all.py

Export

After re-parameterization, YOLOE-v8 / YOLOE-11 can be exported into the identical format as YOLOv8 / YOLO11

python export.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YOLOE: Real-Time Seeing Anything

Performance

Zero-shot detection evaluation

Zero-shot segmentation evaluation

Prompt-free evaluation

Downstream transfer on COCO

Installation

Demo

Prediction

Text prompt

Visual prompt

Prompt free

Validation

Data

Zero-shot evaluation on LVIS

Prompt-free evaluation

Downstream transfer on COCO

Training

Data

Text prompt

Visual prompt

Prompt free

Transferring

Linear probing

Full tuning

Export

Acknowledgement

Citation

About

Releases

Packages

Contributors 4

Languages

License

THU-MIG/yoloe

Folders and files

Latest commit

History

Repository files navigation

YOLOE: Real-Time Seeing Anything

Performance

Zero-shot detection evaluation

Zero-shot segmentation evaluation

Prompt-free evaluation

Downstream transfer on COCO

Installation

Demo

Prediction

Text prompt

Visual prompt

Prompt free

Validation

Data

Zero-shot evaluation on LVIS

Prompt-free evaluation

Downstream transfer on COCO

Training

Data

Text prompt

Visual prompt

Prompt free

Transferring

Linear probing

Full tuning

Export

Acknowledgement

Citation

About

Resources

License

Citation

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages