GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement.
In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition.
Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework.
- [Oct, 24 2025] Release the UI-Ins models, data processing code, training code and evaluation code.
- Setup the SFT environment by followiing instructions here
- Setup the RL environment by following instructions here
We provide a high-quality data processing pipeline detail in here.
We provide the SFT and RL code of UI-Ins.
We provide the evaluation code detail in here
You can inference UI-Ins simply by the following script:
import torch
import re
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"
IMAGE_PATH = "path/to/your/image.jpg"
INSTRUCTION = "Click the 'Search' button"
def parse_coordinates(raw_string: str) -> tuple[int, int]:
matches = re.findall(r'\[(\d+),\s*(\d+)\]', raw_string)
if matches:
return tuple(map(int, matches[0]))
return -1, -1
print("Loading model...")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto"
).eval()
processor = AutoProcessor.from_pretrained(MODEL_PATH)
image = Image.open(IMAGE_PATH).convert("RGB")
messages = [
{
"role":"system",
"content": [
{
"type": "text",
"text": "You are a helpful assistant."
},
{
"type": "text",
"text": """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.\n\n## Output Format\nReturn a json object with a reasoning process in tags, a function name and arguments within XML tags:\n```\n\n...\n\n\n{"name": "grounding", "arguments": }\n\n```\n represents the following item of the action space:\n## Action Space{"action": "click", "coordinate": [x, y]}\nYour task is to accurately locate a UI element based on the instruction. You should first analyze instruction in tags and finally output the function in tags.\n"""
}
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": INSTRUCTION}
]
}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], images=[image], return_tensors="pt").to(model.device)
print("Running inference...")
generated_ids = model.generate(**inputs, max_new_tokens=128)
response_ids = generated_ids[0, len(inputs["input_ids"][0]):]
raw_response = processor.decode(response_ids, skip_special_tokens=True)
point_x, point_y = parse_coordinates(raw_response)
print("\n" + "="*20 + " RESULT " + "="*20)
print(f"Instruction: {INSTRUCTION}")
print(f"Raw Response: {raw_response}")
if point_x != -1:
resized_height, resized_width = inputs['pixel_values'].shape
norm_x = point_x / resized_width
norm_y = point_y / resized_height
print(f"✅ Parsed Point (on resized image): ({point_x}, {point_y})")
print(f"✅ Normalized Point (0.0 to 1.0): ({norm_x:.4f}, {norm_y:.4f})")
else:
print("❌ Could not parse coordinates from the response.")
print("="*48)
Feel free to contact liangyuchen@ruc.edu.cn if you have any questions.
This repo follows CC-BY-NC-SA 4.0 license. Please use this repo for non-commercial use ONLY.
If you use this repository or find it helpful in your research, please cite it as follows:
@article{chen2025ui,
title={UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning},
author={Chen, Liangyu and Zhou, Hanzhang and Cai, Chenglin and Zhang, Jianan and Tong, Panrong and Kong, Quyu and Zhang, Xu and Liu, Chen and Liu, Yuqi and Wang, Wenxuan and others},
journal={arXiv preprint arXiv:2510.20286},
year={2025}
}