Qwen3-VL-2B-Instruct-Shqip
Kushtrim/Qwen3-VL-2B-Instruct-Shqip is an Albanian-focused vision-language model based on unsloth/Qwen3-VL-2B-Instruct, finetuned primarily to improve OCR / transcription and document understanding for Albanian newspaper scans (e.g. Bujku).
What’s new vs the base model
- Better Albanian OCR on scanned newspaper-style pages (noisy layouts, multi-column text, artifacts).
- More consistent responses in Albanian (
sq) for OCR and document-style prompts.
Model details
- Type: vision-language (image + text → text)
- Base model:
unsloth/Qwen3-VL-2B-Instruct - Primary language: Albanian (
sq) - License: Apache-2.0 (same as the base model)
Intended use
Use this model for:
- OCR / transcription of Albanian text from images (scanned pages, photos of documents).
- Document understanding: summarization, extraction, Q&A over an image of a page.
- General multimodal chat in Albanian (image captioning, visual Q&A), with best results on document-like inputs.
How it was trained (high level)
- Base:
unsloth/Qwen3-VL-2B-Instruct(derived fromQwen/Qwen3-VL-2B-Instruct) - Finetuning method: LoRA adapters (SFT-style finetuning)
- Tooling: Unsloth + TRL
- Primary dataset:
Kushtrim/bujku_vl_ocr(page image + Albanian transcription)- Local snapshot used during development contains 34 000 images (train split).
- Training instruction:
Transcribe the text in this image.
Practical defaults used by the trainer scripts in this folder (may vary by run):
max_length: 2048- learning rate: 2e-4
- warmup ratio: 0.03
- LoRA:
r=16,alpha=16, dropout 0.0
Usage
🤗 Transformers (recommended)
Qwen3-VL support is in recent
transformers. If you hit import errors, install from source:
pip install git+https://github.com/huggingface/transformers
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
BASE_ID = "Qwen/Qwen3-VL-2B-Instruct" # <-- the real base model
ADAPTER_ID = "Kushtrim/Qwen3-VL-2B-Instruct-Shqip"
model = Qwen3VLForConditionalGeneration.from_pretrained(
BASE_ID,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, ADAPTER_ID, token=token)
# Optional (often nicer for inference): merge adapter into weights
model = model.merge_and_unload()
processor = AutoProcessor.from_pretrained(
BASE_ID,
trust_remote_code=True,
)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "BU-19970125_34.png",
},
{"type": "text", "text": "Transcribe the text in this image."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Limitations
- Finetuning focus is OCR/document-style inputs; performance on general visual reasoning may differ from the base model.
- OCR quality depends heavily on image quality (blur, skew, low resolution, heavy compression).
- Historical newspapers may include OCR-hard typography, artifacts, and mixed-language snippets.
Citation
If you use this model in academic work, please cite the base Qwen3 technical report and clearly reference this Albanian finetuned variant.
- Downloads last month
- 136
Model tree for Kushtrim/Qwen3-VL-2B-Instruct-Shqip
Dataset used to train Kushtrim/Qwen3-VL-2B-Instruct-Shqip
Collection including Kushtrim/Qwen3-VL-2B-Instruct-Shqip
Collection
15 items
•
Updated
•
5
