Skip to content
/ Aria-UI Public

Open-sourced, Fast and Context-aware Action Grounding from GUI Instructions for GUI/Computer-use Agents

Notifications You must be signed in to change notification settings

AriaUI/Aria-UI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Project Logo

πŸ“° News

  • [2025-02-08] We released all context-aware episode training data of Aria-UI! It has around 992K instruction-output pairs. Try it in your exicting projects at πŸ€— Aria-UI Context-aware Datasets.

  • [2025-01-23] We released the context-aware version of Aria-UI! Check it at πŸ€— Aria-UI Context-aware Models. It typically brings stronger performances under dynamic agent tasks like AndroidWorld and OSWorld.

  • [2025-01-10] We are excited to release the M3A Agent powered by Aria-UI, for AndroidWorld! Experience enhanced task success rates and seamless integration with the latest in grounding instruction understanding. Check it out under AndroidWorld/.

ubuntu_1.mp4

πŸŒ‡ Overview

✨ Versatile Grounding Instruction Understanding:
Aria-UI handles diverse grounding instructions, excelling in interpreting varied formats, ensuring robust adaptability across dynamic scenarios or when paired with diverse planning agents.

πŸ“ Context-aware Grounding:
Aria-UI effectively leverages historical input, whether in pure text or text-image-interleaved formats, to improve grounding accuracy.

⚑ Lightweight and Fast:
Aria-UI is a mixture-of-expert model with 3.9B activated parameters per token. It efficiently encodes GUI input of variable sizes and aspect ratios, with ultra-resolution support.

πŸŽ‰ Superior Performances:
Aria-UI sets new state-of-the-art results on offline and online agent benchmarks.
πŸ† 1st place on AndroidWorld with 44.8% task success rate and
πŸ₯‰ 3rd place on OSWorld with 15.2% task success rate (Dec. 2024).

Aria-UI Overview

πŸš€ Quick Start

Installation

pip install transformers==4.45.0 accelerate==0.34.1 sentencepiece==0.2.0 torchvision requests torch Pillow
pip install flash-attn --no-build-isolation
# For better inference performance, you can install grouped-gemm, which may take 3-5 minutes to install
pip install grouped_gemm==0.1.6

Inference with vllm (strongly recommended)

First, make sure you install the appropriate version (for example, vllm==0.6.6.dev3+g866fa455) of vLLM so that it supports Aria-UI

export VLLM_COMMIT=866fa4550d572f4ff3521ccf503e0df2e76591a1 # use full commit hash from the main branch
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

Here is a code snippet for Aria-UI with vllm.

from PIL import Image, ImageDraw
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
import ast
model_path = "Aria-UI/Aria-UI-base"
def main():
    llm = LLM(
        model=model_path,
        tokenizer_mode="slow",
        dtype="bfloat16",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_path, trust_remote_code=True, use_fast=False
    )
    instruction = "Try Aria."
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {
                    "type": "text",
                    "text": "Given a GUI image, what are the relative (0-1000) pixel point coordinates for the element corresponding to the following instruction or description: " + instruction,
                }
            ],
        }
    ]
    message = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    outputs = llm.generate(
        {
            "prompt_token_ids": message,
            "multi_modal_data": {
                "image": [
                    Image.open("examples/aria.png"),
                ],
                "max_image_size": 980,  # [Optional] The max image patch size, default `980`
                "split_image": True,  # [Optional] whether to split the images, default `True`
            },
        },
        sampling_params=SamplingParams(max_tokens=50, top_k=1, stop=["<|im_end|>"]),
    )
    for o in outputs:
        generated_tokens = o.outputs[0].token_ids
        response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
        print(response)
        coords = ast.literal_eval(response.replace("<|im_end|>", "").replace("```", "").replace(" ", "").strip())
        return coords
if __name__ == "__main__":
    main()

Inference with Transfomrers (not recommended)

You can also use the original transformers API for Aria-UI. For instance:

import argparse
import torch
import os
import json
from tqdm import tqdm
import time
from PIL import Image, ImageDraw
from transformers import AutoModelForCausalLM, AutoProcessor
import ast

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model_path = "Aria-UI/Aria-UI-base"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_file = "./examples/aria.png"
instruction = "Try Aria."
image = Image.open(image_file).convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"text": None, "type": "image"},
            {"text": instruction, "type": "text"},
        ],
    }
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.inference_mode(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=50,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        # do_sample=True,
        # temperature=0.9,
    )
output_ids = output[0][inputs["input_ids"].shape[1] :]
response = processor.decode(output_ids, skip_special_tokens=True)
print(response)

coords = ast.literal_eval(response.replace("<|im_end|>", "").replace("```", "").replace(" ", "").strip())

Citation

If you find our work helpful, please consider citing:

@article{ariaui,
      title={Aria-UI: Visual Grounding for GUI Instructions}, 
      author={Yuhao Yang and Yue Wang and Dongxu Li and Ziyang Luo and Bei Chen and Chao Huang and Junnan Li},
      year={2024},
      journal={arXiv preprint arXiv:2412.16256},
}

Acknowledgments

We thank Tianbao Xie, Yiheng Xu for their valuable discussion and suggestions.

More demos

mobile.mp4
ubuntu_2.mp4

About

Open-sourced, Fast and Context-aware Action Grounding from GUI Instructions for GUI/Computer-use Agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages