You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Qwen3-VL-2B-Instruct-Shqip

Qwen3-Embedding-4B-Shqip

Kushtrim/Qwen3-VL-2B-Instruct-Shqip is an Albanian-focused vision-language model based on unsloth/Qwen3-VL-2B-Instruct, finetuned primarily to improve OCR / transcription and document understanding for Albanian newspaper scans (e.g. Bujku).

What’s new vs the base model

  • Better Albanian OCR on scanned newspaper-style pages (noisy layouts, multi-column text, artifacts).
  • More consistent responses in Albanian (sq) for OCR and document-style prompts.

Model details

  • Type: vision-language (image + text → text)
  • Base model: unsloth/Qwen3-VL-2B-Instruct
  • Primary language: Albanian (sq)
  • License: Apache-2.0 (same as the base model)

Intended use

Use this model for:

  • OCR / transcription of Albanian text from images (scanned pages, photos of documents).
  • Document understanding: summarization, extraction, Q&A over an image of a page.
  • General multimodal chat in Albanian (image captioning, visual Q&A), with best results on document-like inputs.

How it was trained (high level)

  • Base: unsloth/Qwen3-VL-2B-Instruct (derived from Qwen/Qwen3-VL-2B-Instruct)
  • Finetuning method: LoRA adapters (SFT-style finetuning)
  • Tooling: Unsloth + TRL
  • Primary dataset: Kushtrim/bujku_vl_ocr (page image + Albanian transcription)
    • Local snapshot used during development contains 34 000 images (train split).
    • Training instruction: Transcribe the text in this image.

Practical defaults used by the trainer scripts in this folder (may vary by run):

  • max_length: 2048
  • learning rate: 2e-4
  • warmup ratio: 0.03
  • LoRA: r=16, alpha=16, dropout 0.0

Usage

🤗 Transformers (recommended)

Qwen3-VL support is in recent transformers. If you hit import errors, install from source:

pip install git+https://github.com/huggingface/transformers

import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

BASE_ID = "Qwen/Qwen3-VL-2B-Instruct"          # <-- the real base model
ADAPTER_ID = "Kushtrim/Qwen3-VL-2B-Instruct-Shqip"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    BASE_ID,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(model, ADAPTER_ID, token=token)

# Optional (often nicer for inference): merge adapter into weights
model = model.merge_and_unload()

processor = AutoProcessor.from_pretrained(
    BASE_ID,
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "BU-19970125_34.png",
            },
            {"type": "text", "text": "Transcribe the text in this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Limitations

  • Finetuning focus is OCR/document-style inputs; performance on general visual reasoning may differ from the base model.
  • OCR quality depends heavily on image quality (blur, skew, low resolution, heavy compression).
  • Historical newspapers may include OCR-hard typography, artifacts, and mixed-language snippets.

Citation

If you use this model in academic work, please cite the base Qwen3 technical report and clearly reference this Albanian finetuned variant.

Downloads last month
136
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kushtrim/Qwen3-VL-2B-Instruct-Shqip

Adapter
(2)
this model

Dataset used to train Kushtrim/Qwen3-VL-2B-Instruct-Shqip

Collection including Kushtrim/Qwen3-VL-2B-Instruct-Shqip