Qwen3.5-4B GUI Grounding β v3 (SFT LoRA + Unfrozen ViT)
LoRA adapter for Qwen3.5-4B fine-tuned on GUI grounding: given a screenshot and a natural language instruction, predict the (x, y) click coordinate of the target UI element.
v3 was an experimental run that unfroze the vision encoder and increased input resolution. It resulted in a regression vs v2, demonstrating that unfreezing the full ViT on ~23.5K samples degrades pretrained visual features.
Results β ScreenSpot-V2
| Split | Accuracy |
|---|---|
| Overall | 92.7% |
Training Data
~23.5K samples from 3 GUI grounding datasets covering desktop, web, and mobile platforms.
Output Format
<|box_start|>(x,y)<|box_end|>
Usage
Requires transformers>=5.2.0 and peft.
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
import torch
base = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-4B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "dabism23/qwen35-gui-grounding_v3")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B")
Key Findings
- Unfreezing the full ViT on ~23.5K samples caused overfitting and degraded performance
- Higher input resolution (2M pixels) did not compensate for ViT degradation
- Frozen ViT remains the better approach at this dataset scale
Version History
Access
Model weights are gated. Request access to download. Training configuration details are included with the model files.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support