You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Qwen3-VL-2B-Instruct-Shqip

Kushtrim/Qwen3-VL-2B-Instruct-Shqip is an Albanian-focused vision-language model based on unsloth/Qwen3-VL-2B-Instruct, finetuned primarily to improve OCR / transcription and document understanding for Albanian newspaper scans (e.g. Bujku).

What’s new vs the base model

Better Albanian OCR on scanned newspaper-style pages (noisy layouts, multi-column text, artifacts).
More consistent responses in Albanian (sq) for OCR and document-style prompts.

Model details

Type: vision-language (image + text → text)
Base model: unsloth/Qwen3-VL-2B-Instruct
Primary language: Albanian (sq)
License: Apache-2.0 (same as the base model)

Intended use

Use this model for:

OCR / transcription of Albanian text from images (scanned pages, photos of documents).
Document understanding: summarization, extraction, Q&A over an image of a page.
General multimodal chat in Albanian (image captioning, visual Q&A), with best results on document-like inputs.

How it was trained (high level)

Base: unsloth/Qwen3-VL-2B-Instruct (derived from Qwen/Qwen3-VL-2B-Instruct)
Finetuning method: LoRA adapters (SFT-style finetuning)
Tooling: Unsloth + TRL
Primary dataset: Kushtrim/bujku_vl_ocr (page image + Albanian transcription)
- Local snapshot used during development contains 34 000 images (train split).
- Training instruction: Transcribe the text in this image.

Practical defaults used by the trainer scripts in this folder (may vary by run):

max_length: 2048
learning rate: 2e-4
warmup ratio: 0.03
LoRA: r=16, alpha=16, dropout 0.0

Usage

🤗 Transformers (recommended)

Qwen3-VL support is in recent transformers. If you hit import errors, install from source:

pip install git+https://github.com/huggingface/transformers

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

BASE_ID = "Qwen/Qwen3-VL-2B-Instruct"          # <-- the real base model
ADAPTER_ID = "Kushtrim/Qwen3-VL-2B-Instruct-Shqip"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    BASE_ID,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(model, ADAPTER_ID, token=token)

# Optional (often nicer for inference): merge adapter into weights
model = model.merge_and_unload()

processor = AutoProcessor.from_pretrained(
    BASE_ID,
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "BU-19970125_34.png",
            },
            {"type": "text", "text": "Transcribe the text in this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Limitations

Finetuning focus is OCR/document-style inputs; performance on general visual reasoning may differ from the base model.
OCR quality depends heavily on image quality (blur, skew, low resolution, heavy compression).
Historical newspapers may include OCR-hard typography, artifacts, and mixed-language snippets.