Sori-4B-Base
Korean speech-to-text model combining Qwen3-Omni's Audio Transformer (AuT) with Qwen3-4B LLM.
GitHub: SeungyounShin/Sori
Architecture
Audio (16kHz) -> Mel Spectrogram (128 bins) -> Audio Encoder (647M, AuT) -> audio_proj MLP -> Qwen3-4B LLM -> Text
| Component | Params | Source |
|---|---|---|
| Audio Encoder | 647M | Qwen3-Omni AuT (pretrained on 7M+ hours) |
| audio_proj | 12M | 2-layer MLP (2048 -> 2560 -> 2560), trained from scratch |
| LLM | 4B | Qwen3-4B-Instruct (frozen in Stage 1) |
Quick Start
git clone https://github.com/SeungyounShin/Sori.git
cd Sori
pip install torch torchaudio transformers peft accelerate safetensors
Transcription
from modeling_sori_speech import SoriSpeechForConditionalGeneration
from processing_sori_speech import SoriSpeechProcessor
from sori_speech_utils import process_mm_info
import torch
model = SoriSpeechForConditionalGeneration.from_pretrained(
"Seungyoun/Sori-4B-Base", torch_dtype=torch.bfloat16, device_map="cuda",
trust_remote_code=True,
)
model.eval()
processor = SoriSpeechProcessor.from_pretrained("Seungyoun/Sori-4B")
conversation = [
{"role": "system", "content": "You are a helpful voice assistant."},
{"role": "user", "content": [
{"type": "audio", "audio": "path/to/audio.wav"},
{"type": "text", "text": "Transcribe the audio."}
]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device).to(model.dtype) if v.dtype == torch.float32 else v.to(model.device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(output[0], skip_special_tokens=True).split("assistant")[-1].strip())
Voice-Driven Tool Calling
Since the backbone LLM (Qwen3-4B-Instruct) natively supports tool use, Sori can understand spoken Korean and trigger tool calls -- even at Stage 1 with a frozen LLM.
SYSTEM_PROMPT = """You are a helpful voice assistant that can understand Korean speech and respond helpfully.
When the user asks a question, answer it directly. If you need external information, use the available tools.
# Tools
## get_weather
Get current weather information for a city.
Parameters:
- city (string, required): The city name (e.g. "μμΈ", "λΆμ°")
## search_web
Search the web for information.
Parameters:
- query (string, required): The search query
To use a tool, respond with:
<tool_call>
{"name": "tool_name", "arguments": {"param": "value"}}
</tool_call>
"""
conversation = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "audio", "audio": "weather.mp3"}, # "νΉμ μ§κΈ μμΈ λ μ¨κ° μ΄λ»κ²λΌ?"
]},
]
# ... same inference code as above ...
Output:
You seem to be asking about the current weather in Seoul. Let me check that for you.
<tool_call>
{"name": "get_weather", "arguments": {"city": "μμΈ"}}
</tool_call>
The model correctly understands the spoken Korean question about Seoul's weather and generates the appropriate get_weather tool call. See inference_if.py for the full example.
Sample Results (Stage 1, Step 6000)
| Expected | Predicted |
|---|---|
| λ¨Όμ μ½λ‘λ νμ§μ νν©λΆν° μ§μ΄λ³΄μ£ . | λ¨Όμ μ½λ‘λ νμ§μ νν©λΆν° μ§ν λ³΄μ£ . |
| λ€ μλ νμΈμ. | λ€ μλ νμΈμ? |
| λ¬Έμλ릴 κ² μμ΄μ μ ν λλ Έμ΄μ. | μ΄λ κ² λ€μκ² μμ΄μ μ ν λλ Έμ΄μ. |
| λ€ κ³ κ°λ νΉμ κ°λͺ μ μ΄λ¦κ³Ό μ νλ²νΈ λ§μ λΆνλ립λλ€ | λ€, κ³ κ°λ νΉμ κ°λͺ μ μ΄λ¦κ³Ό μ νλ²νΈ λ§μ λΆνλ립λλ€. |
This is a Stage 1 (alignment only) checkpoint where only the 12M audio_proj MLP was trained. The LLM is frozen. Stage 2 (LoRA fine-tuning) will improve accuracy significantly.
Training
Two-Stage Approach
Following LLaVA and Qwen3-Omni's methodology:
Stage 1 - Alignment (this release): Train only audio_proj to map audio features into the LLM's embedding space.
| Setting | Value |
|---|---|
| Trainable params | audio_proj only (12M / 4.7B = 0.25%) |
| Audio Encoder | Frozen |
| LLM | Frozen |
| Learning rate | 1e-4 |
| Effective batch size | 1024 (8 x 8 GPUs x 16 accum) |
| Loss | Cross-entropy with label masking |
| Steps | 6,000 |
| Hardware | 8x H100 80GB |
Stage 2 - Fine-tuning (planned): Unfreeze LLM with LoRA (r=16, alpha=32) + continue training audio_proj.
Dataset
4.1M Korean speech samples:
| Dataset | Samples | Ratio |
|---|---|---|
| Zeroth-STT-Korean | 102K | 2.5% |
| AIHub 012 - Counseling Speech | 831K | 20.0% |
| AIHub 71592 - Job Interview | 76K | 1.8% |
| AIHub 71481 - In-depth Interview | 802K | 19.3% |
| AIHub 464 - Meeting Speech | 2.3M | 56.3% |
Loss Curve
Two distinct phases:
- Steps 0-2500: Loss plateaus around 3.0 as audio_proj learns initial mapping
- Steps 2500+: Sharp drop to ~1.0 as alignment clicks into place; transcription quality jumps dramatically
Key Technical Detail
The mel spectrogram must match Qwen3-Omni's WhisperFeatureExtractor exactly (Slaney mel scale + log10 + normalization). Using torchaudio defaults (HTK scale + natural log) produces completely wrong features for the pretrained AuT encoder - this was the root cause of initial training failure.
License
Apache 2.0
- Downloads last month
- 132
