Kyrgyz Whisper Medium — LoRA Adapter (PEFT)
This repository contains a LoRA/PEFT adapter for Kyrgyz automatic speech recognition (ASR).
Links
What is this?
This repo provides adapter weights only. For inference, you must load the base model and then attach this adapter via PEFT.
If you want a single, standalone checkpoint, use the merged model linked above.
Dataset
- Training/evaluation dataset:
fsicoli/common_voice_22_0 (config: ky)
Results
Evaluation on Common Voice 22.0 Kyrgyz (test split):
WER (normalized): 16.2061
WER_ortho (orthographic): 19.1491
test_loss: 0.1722
Quick check (200 random test samples):
WER: 16.1677
WER_ortho: 19.6021
Note: WER depends on text normalization (punctuation/case), decoding settings, and audio preprocessing.
Training details
LoRA fine-tuning summary:
- LoRA:
r=8, lora_alpha=16, lora_dropout=0.1
- Target modules:
q_proj, v_proj
- Steps:
max_steps=4000
- Best checkpoint by WER:
checkpoint-4000 (WER=16.21)
Training progress (selected checkpoints):
| Step |
Train loss |
Val loss |
WER_ortho |
WER |
| 500 |
0.7980 |
0.7911 |
44.3501 |
42.0754 |
| 1000 |
0.3980 |
0.2043 |
28.9947 |
27.8551 |
| 1500 |
0.1712 |
0.1821 |
20.7479 |
17.7343 |
| 2000 |
0.1734 |
0.1770 |
20.7569 |
17.6977 |
| 2500 |
0.1935 |
0.1743 |
19.7995 |
16.8192 |
| 3000 |
0.3406 |
0.1728 |
19.8988 |
16.9656 |
| 3500 |
0.3192 |
0.1724 |
19.3840 |
16.4074 |
| 4000 |
0.1499 |
0.1722 |
19.1491 |
16.2061 |
How to use
Install
pip install -U "transformers" "peft" "accelerate" "torch"
Inference (Transformers pipeline + PEFT)
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
adapter_id = "AleksTv/whisper-medium-ky-lora"
peft_cfg = PeftConfig.from_pretrained(adapter_id)
base_id = peft_cfg.base_model_name_or_path
device = 0 if torch.cuda.is_available() else -1
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
base_id,
torch_dtype=dtype,
device_map="auto" if torch.cuda.is_available() else None,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
processor = AutoProcessor.from_pretrained(base_id, trust_remote_code=True)
asr = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=device,
)
print(asr("path/to/audio.wav")["text"])
Merge adapter into the base model (standalone weights)
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
adapter_id = "AleksTv/whisper-medium-ky-lora"
peft_cfg = PeftConfig.from_pretrained(adapter_id)
base_id = peft_cfg.base_model_name_or_path
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
base_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
merged = model.merge_and_unload()
out_dir = "whisper-medium-ky-merged"
merged.save_pretrained(out_dir, safe_serialization=True)
AutoProcessor.from_pretrained(base_id, trust_remote_code=True).save_pretrained(out_dir)
Limitations
- Quality may degrade on very noisy audio, far-field microphones, strong accents, code-switching, or long recordings without segmentation.
- For production, you typically want VAD/segmentation + post-processing.
License
Apache-2.0.