Sori-4B-Base

Korean speech-to-text model combining Qwen3-Omni's Audio Transformer (AuT) with Qwen3-4B LLM.

GitHub: SeungyounShin/Sori

Architecture

Audio (16kHz) -> Mel Spectrogram (128 bins) -> Audio Encoder (647M, AuT) -> audio_proj MLP -> Qwen3-4B LLM -> Text
Component Params Source
Audio Encoder 647M Qwen3-Omni AuT (pretrained on 7M+ hours)
audio_proj 12M 2-layer MLP (2048 -> 2560 -> 2560), trained from scratch
LLM 4B Qwen3-4B-Instruct (frozen in Stage 1)

Quick Start

git clone https://github.com/SeungyounShin/Sori.git
cd Sori
pip install torch torchaudio transformers peft accelerate safetensors

Transcription

from modeling_sori_speech import SoriSpeechForConditionalGeneration
from processing_sori_speech import SoriSpeechProcessor
from sori_speech_utils import process_mm_info
import torch

model = SoriSpeechForConditionalGeneration.from_pretrained(
    "Seungyoun/Sori-4B-Base", torch_dtype=torch.bfloat16, device_map="cuda",
    trust_remote_code=True,
)
model.eval()
processor = SoriSpeechProcessor.from_pretrained("Seungyoun/Sori-4B")

conversation = [
    {"role": "system", "content": "You are a helpful voice assistant."},
    {"role": "user", "content": [
        {"type": "audio", "audio": "path/to/audio.wav"},
        {"type": "text", "text": "Transcribe the audio."}
    ]},
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device).to(model.dtype) if v.dtype == torch.float32 else v.to(model.device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=256)

print(processor.decode(output[0], skip_special_tokens=True).split("assistant")[-1].strip())

Voice-Driven Tool Calling

Since the backbone LLM (Qwen3-4B-Instruct) natively supports tool use, Sori can understand spoken Korean and trigger tool calls -- even at Stage 1 with a frozen LLM.

SYSTEM_PROMPT = """You are a helpful voice assistant that can understand Korean speech and respond helpfully.
When the user asks a question, answer it directly. If you need external information, use the available tools.

# Tools

## get_weather
Get current weather information for a city.
Parameters:
- city (string, required): The city name (e.g. "μ„œμšΈ", "λΆ€μ‚°")

## search_web
Search the web for information.
Parameters:
- query (string, required): The search query

To use a tool, respond with:
<tool_call>
{"name": "tool_name", "arguments": {"param": "value"}}
</tool_call>
"""

conversation = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": [
        {"type": "audio", "audio": "weather.mp3"},  # "ν˜Ήμ‹œ μ§€κΈˆ μ„œμšΈ 날씨가 μ–΄λ–»κ²ŒλΌ?"
    ]},
]

# ... same inference code as above ...

Output:

You seem to be asking about the current weather in Seoul. Let me check that for you.

<tool_call>
{"name": "get_weather", "arguments": {"city": "μ„œμšΈ"}}
</tool_call>

The model correctly understands the spoken Korean question about Seoul's weather and generates the appropriate get_weather tool call. See inference_if.py for the full example.

Sample Results (Stage 1, Step 6000)

Expected Predicted
λ¨Όμ € μ½”λ‘œλ‚˜ ν™•μ§„μž ν˜„ν™©λΆ€ν„° μ§šμ–΄λ³΄μ£ . λ¨Όμ € μ½”λ‘œλ‚˜ ν™•μ§„μž ν˜„ν™©λΆ€ν„° μ§‘νžˆ 보죠.
λ„€ μ•ˆλ…•ν•˜μ„Έμš”. λ„€ μ•ˆλ…•ν•˜μ„Έμš”?
λ¬Έμ˜λ“œλ¦΄ 게 μžˆμ–΄μ„œ μ „ν™” λ“œλ Έμ–΄μš”. μ΄λ ‡κ²Œ λ“€μ„κ²Œ μžˆμ–΄μ„œ μ „ν™” λ“œλ Έμ–΄μš”.
λ„€ κ³ κ°λ‹˜ ν˜Ήμ‹œ 개λͺ… μ „ 이름과 μ „ν™”λ²ˆν˜Έ 말씀 λΆ€νƒλ“œλ¦½λ‹ˆλ‹€ λ„€, κ³ κ°λ‹˜ ν˜Ήμ‹œ 개λͺ… μ „ 이름과 μ „ν™”λ²ˆν˜Έ 말씀 λΆ€νƒλ“œλ¦½λ‹ˆλ‹€.

This is a Stage 1 (alignment only) checkpoint where only the 12M audio_proj MLP was trained. The LLM is frozen. Stage 2 (LoRA fine-tuning) will improve accuracy significantly.

Training

Two-Stage Approach

Following LLaVA and Qwen3-Omni's methodology:

Stage 1 - Alignment (this release): Train only audio_proj to map audio features into the LLM's embedding space.

Setting Value
Trainable params audio_proj only (12M / 4.7B = 0.25%)
Audio Encoder Frozen
LLM Frozen
Learning rate 1e-4
Effective batch size 1024 (8 x 8 GPUs x 16 accum)
Loss Cross-entropy with label masking
Steps 6,000
Hardware 8x H100 80GB

Stage 2 - Fine-tuning (planned): Unfreeze LLM with LoRA (r=16, alpha=32) + continue training audio_proj.

Dataset

4.1M Korean speech samples:

Dataset Samples Ratio
Zeroth-STT-Korean 102K 2.5%
AIHub 012 - Counseling Speech 831K 20.0%
AIHub 71592 - Job Interview 76K 1.8%
AIHub 71481 - In-depth Interview 802K 19.3%
AIHub 464 - Meeting Speech 2.3M 56.3%

Loss Curve

Training Loss

Two distinct phases:

  • Steps 0-2500: Loss plateaus around 3.0 as audio_proj learns initial mapping
  • Steps 2500+: Sharp drop to ~1.0 as alignment clicks into place; transcription quality jumps dramatically

Key Technical Detail

The mel spectrogram must match Qwen3-Omni's WhisperFeatureExtractor exactly (Slaney mel scale + log10 + normalization). Using torchaudio defaults (HTK scale + natural log) produces completely wrong features for the pretrained AuT encoder - this was the root cause of initial training failure.

License

Apache 2.0

Downloads last month
132
Safetensors
Model size
5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Seungyoun/Sori-4B-Base

Finetunes
1 model