Skip to content

THU-MIG/yoloe

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Mar 9, 2025
316ba46 · Mar 9, 2025
Dec 1, 2024
Dec 16, 2024
Nov 20, 2024
Dec 1, 2024
Nov 28, 2024
Mar 9, 2025
Mar 9, 2025
Dec 16, 2024
Dec 21, 2024
Feb 12, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Dec 27, 2024
Sep 30, 2024
Sep 14, 2024
Apr 14, 2023
Mar 9, 2025
Nov 28, 2024
Dec 27, 2024
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Nov 27, 2024
Mar 9, 2025
Mar 9, 2025
Oct 28, 2024
Feb 12, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025
Mar 9, 2025

Repository files navigation

Official PyTorch implementation of YOLOE.


Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and YOLO-Worldv2 in terms of open text prompts.

YOLOE: Real-Time Seeing Anything.
Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding\

We introduce YOLOE(ye), a highly efficient, unified, and open object detection and segmentation model, like human eye, under different prompt mechanisms, like texts, visual inputs, and prompt-free paradigm.


Abstract Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models will be publicly available.

Performance

Zero-shot detection evaluation

  • Fixed AP is reported on LVIS minival set with text (T) / visual (V) prompts.
  • Training time is for text prompts based on 8 Nvidia RTX4090 GPUs.
  • FPS is measured on T4 with TensorRT and iPhone 12 with CoreML, respectively.
  • For training data, OG denotes Objects365v1 and GoldG.
  • YOLOE can become YOLOs after re-parameterization with zero inference and transferring overhead.
Model Prompt Params Data Time FPS A P A P r A P c A P f Log
YOLOE-v8-S T / V 12M / 13M OG 12.0h 305.8 / 64.3 27.9 / 26.2 22.3 / 21.3 27.8 / 27.7 29.0 / 25.7 T / V
YOLOE-v8-M T / V 27M / 30M OG 17.0h 156.7 / 41.7 32.6 / 31.0 26.9 / 27.0 31.9 / 31.7 34.4 / 31.1 T / V
YOLOE-v8-L T / V 45M / 50M OG 22.5h 102.5 / 27.2 35.9 / 34.2 33.2 / 33.2 34.8 / 34.6 37.3 / 34.1 T / V
YOLOE-11-S T / V 10M / 12M OG 13.0h 301.2 / 73.3 27.5 / 26.3 21.4 / 22.5 26.8 / 27.1 29.3 / 26.4 T / V
YOLOE-11-M T / V 21M / 27M OG 18.5h 168.3 / 39.2 33.0 / 31.4 26.9 / 27.1 32.5 / 31.9 34.5 / 31.7 T / V
YOLOE-11-L T / V 26M / 32M OG 23.5h 130.5 / 35.1 35.2 / 33.7 29.1 / 28.1 35.0 / 34.6 36.5 / 33.8 T / V

Zero-shot segmentation evaluation

Model Prompt A P m A P r m A P c m A P f m
YOLOE-v8-S T / V 17.7 / 16.8 15.5 / 13.5 16.3 / 16.7 20.3 / 18.2
YOLOE-v8-M T / V 20.8 / 20.3 17.2 / 17.0 19.2 / 20.1 24.2 / 22.0
YOLOE-v8-L T / V 23.5 / 22.0 21.9 / 16.5 21.6 / 22.1 26.4 / 24.3
YOLOE-11-S T / V 17.6 / 17.1 16.1 / 14.4 15.6 / 16.8 20.5 / 18.6
YOLOE-11-M T / V 21.1 / 21.0 17.2 / 18.3 19.6 / 20.6 24.4 / 22.6
YOLOE-11-L T / V 22.6 / 22.5 19.3 / 20.5 20.9 / 21.7 26.0 / 24.1

Prompt-free evaluation

  • The model is the same as above in Zero-shot detection evaluation except the specialized prompt embedding.
  • Fixed AP is reported on LVIS minival set and FPS is measured on Nvidia T4 GPU with Pytorch.
Model Params A P A P r A P c A P f FPS Log
YOLOE-v8-S 13M 21.0 19.1 21.3 21.0 95.8 PF
YOLOE-v8-M 29M 24.7 22.2 24.5 25.3 45.9 PF
YOLOE-v8-L 47M 27.2 23.5 27.0 28.0 25.3 PF
YOLOE-11-S 11M 20.6 18.4 20.2 21.3 93.0 PF
YOLOE-11-M 24M 25.5 21.6 25.5 26.1 42.5 PF
YOLOE-11-L 29M 26.3 22.7 25.8 27.5 34.9 PF

Downstream transfer on COCO

  • During transferring, YOLOE-v8 / YOLOE-11 is exactly the same as YOLOv8 / YOLO11.
  • For Linear probing, only the last conv in classification head is trainable and for Full tuning, all parameters are trainable.
Model Epochs A P b A P 50 b A P 75 b A P m A P 50 m A P 75 m Log
Linear probing
YOLOE-v8-S 10 35.6 51.5 38.9 30.3 48.2 32.0 LP
YOLOE-v8-M 10 42.2 59.2 46.3 35.5 55.6 37.7 LP
YOLOE-v8-L 10 45.4 63.3 50.0 38.3 59.6 40.8 LP
YOLOE-11-S 10 37.0 52.9 40.4 31.5 49.7 33.5 LP
YOLOE-11-M 10 43.1 60.6 47.4 36.5 56.9 39.0 LP
YOLOE-11-L 10 45.1 62.8 49.5 38.0 59.2 40.6 LP
Full tuning
YOLOE-v8-S 160 45.0 61.6 49.1 36.7 58.3 39.1 FT
YOLOE-v8-M 80 50.4 67.0 55.2 40.9 63.7 43.5 FT
YOLOE-v8-L 80 53.0 69.8 57.9 42.7 66.5 45.6 FT
YOLOE-11-S 160 46.2 62.9 50.0 37.6 59.3 40.1 FT
YOLOE-11-M 80 51.3 68.3 56.0 41.5 64.8 44.3 FT
YOLOE-11-L 80 52.6 69.7 57.5 42.4 66.2 45.2 FT

Installation

conda virtual environment is recommended.

conda create -n yoloe python=3.10 -y
conda activate yoloe

pip install -r requirements.txt
pip install -e .
pip install -e lvis-api
pip install -e ml-mobileclip

Demo

python app.py
# Please visit http://127.0.0.1:7860

Prediction

Text prompt

python predict.py

Visual prompt

python predict_vp.py

Prompt free

python predict_pf.py

Validation

Data

# For evaluation with visual prompt, please obtain the referring data.
python tools/generate_lvis_visual_prompt_data.py

Zero-shot evaluation on LVIS

  • For text prompts, python val.py.
  • For visual prompts, python val_vp.py

Prompt-free evaluation

python val_pe_free.py
python tools/eval_open_ended.py --json ../datasets/lvis/annotations/lvis_v1_minival.json --pred runs/detect/val/predictions.json --fixed

Downstream transfer on COCO

python val_coco.py

Training

The training includes three stages:

  • YOLOE is trained with text prompts for detection and segmentation for 30 epochs.
  • Only visual prompt encoder (SAVPE) is trained with visual prompts for 2 epochs.
  • Only specialized prompt embedding for prompt free is trained for 1 epochs.

Data

Images Raw Annotations Processed Annotations
Objects365v1 objects365_train.json objects365_train_segm.json
GQA final_mixed_train_noo_coco.json final_mixed_train_noo_coco_segm.json
Flickr30k final_flickr_separateGT_train.json final_flickr_separateGT_train_segm.json

For annotations, you can directly use our preprocessed ones or use the following script to obtain the processed annotations with segmentation masks.

# Generate segmentation data
python tools/generate_sam_masks.py --img-path ../datasets/Objects365v1/images/train --json-path ../datasets/Objects365v1/annotations/objects365_train.json --batch
python tools/generate_sam_masks.py --img-path ../datasets/flickr/full_images/ --json-path ../datasets/flickr/annotations/final_flickr_separateGT_train.json
python tools/generate_sam_masks.py --img-path ../datasets/mixed_grounding/gqa/images --json-path ../datasets/mixed_grounding/annotations/final_mixed_train_no_coco.json

# Generate objects365v1 labels
python tools/generate_objects365v1.py

Then, please generate the data and embedding cache for training.

# Generate grounding segmentation cache
python tools/generate_grounding_cache.py --img-path ../datasets/flickr/full_images/ --json-path ../datasets/flickr/annotations/final_flickr_separateGT_train_segm.json
python tools/generate_grounding_cache.py --img-path ../datasets/mixed_grounding/gqa/images --json-path ../datasets/mixed_grounding/annotations/final_mixed_train_no_coco_segm.json

# Generate train label embeddings
python tools/generate_label_embedding.py
python tools/generate_global_neg_cat.py

Text prompt

python train_seg.py

Visual prompt

python train_seg_vp.py

Prompt free

python train_pe_free.py

Transferring

After pretraining, YOLOE-v8 / YOLOE-11 can be re-parameterized into the same architecture as YOLOv8 / YOLO11, with zero overhead for transferring.

Linear probing

Only the last conv, ie., the prompt embedding, is trainable.

python train_pe.py

Full tuning

All parameters are trainable, for better performance

python train_pe_all.py

Export

After re-parameterization, YOLOE-v8 / YOLOE-11 can be exported into the identical format as YOLOv8 / YOLO11

python export.py

Acknowledgement

The code base is built with ultralytics, YOLO-World, and GenerateU.

Thanks for the great implementations!

Citation

If our code or models help your work, please cite our paper:

About

YOLOE: Real-Time Seeing Anything

Resources

License

Citation

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages