Sori-4B-Base

Korean speech-to-text model combining Qwen3-Omni's Audio Transformer (AuT) with Qwen3-4B LLM.

Architecture

Audio (16kHz) -> Mel Spectrogram (128 bins) -> Audio Encoder (647M, AuT) -> audio_proj MLP -> Qwen3-4B LLM -> Text

Component	Params	Source
Audio Encoder	647M	Qwen3-Omni AuT (pretrained on 7M+ hours)
audio_proj	12M	2-layer MLP (2048 -> 2560 -> 2560), trained from scratch
LLM	4B	Qwen3-4B-Instruct (frozen in Stage 1)

Quick Start

git clone https://github.com/SeungyounShin/Sori.git
cd Sori
pip install torch torchaudio transformers peft accelerate safetensors

Transcription

from modeling_sori_speech import SoriSpeechForConditionalGeneration
from processing_sori_speech import SoriSpeechProcessor
from sori_speech_utils import process_mm_info
import torch

model = SoriSpeechForConditionalGeneration.from_pretrained(
    "Seungyoun/Sori-4B-Base", torch_dtype=torch.bfloat16, device_map="cuda",
    trust_remote_code=True,
)
model.eval()
processor = SoriSpeechProcessor.from_pretrained("Seungyoun/Sori-4B")

conversation = [
    {"role": "system", "content": "You are a helpful voice assistant."},
    {"role": "user", "content": [
        {"type": "audio", "audio": "path/to/audio.wav"},
        {"type": "text", "text": "Transcribe the audio."}
    ]},
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text, audio=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device).to(model.dtype) if v.dtype == torch.float32 else v.to(model.device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=256)

print(processor.decode(output[0], skip_special_tokens=True).split("assistant")[-1].strip())

Voice-Driven Tool Calling

Since the backbone LLM (Qwen3-4B-Instruct) natively supports tool use, Sori can understand spoken Korean and trigger tool calls -- even at Stage 1 with a frozen LLM.

SYSTEM_PROMPT = """You are a helpful voice assistant that can understand Korean speech and respond helpfully.
When the user asks a question, answer it directly. If you need external information, use the available tools.

# Tools

## get_weather
Get current weather information for a city.
Parameters:
- city (string, required): The city name (e.g. "서울", "부산")

## search_web
Search the web for information.
Parameters:
- query (string, required): The search query

To use a tool, respond with:
<tool_call>
{"name": "tool_name", "arguments": {"param": "value"}}
</tool_call>
"""

conversation = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": [
        {"type": "audio", "audio": "weather.mp3"},  # "혹시 지금 서울 날씨가 어떻게돼?"
    ]},
]

# ... same inference code as above ...

Output:

You seem to be asking about the current weather in Seoul. Let me check that for you.

<tool_call>
{"name": "get_weather", "arguments": {"city": "서울"}}
</tool_call>

The model correctly understands the spoken Korean question about Seoul's weather and generates the appropriate get_weather tool call. See inference_if.py for the full example.

Sample Results (Stage 1, Step 6000)

Expected	Predicted
먼저 코로나 확진자 현황부터 짚어보죠.	먼저 코로나 확진자 현황부터 집히 보죠.
네 안녕하세요.	네 안녕하세요?
문의드릴 게 있어서 전화 드렸어요.	이렇게 들을게 있어서 전화 드렸어요.
네 고객님 혹시 개명 전 이름과 전화번호 말씀 부탁드립니다	네, 고객님 혹시 개명 전 이름과 전화번호 말씀 부탁드립니다.

This is a Stage 1 (alignment only) checkpoint where only the 12M audio_proj MLP was trained. The LLM is frozen. Stage 2 (LoRA fine-tuning) will improve accuracy significantly.

Training

Two-Stage Approach

Following LLaVA and Qwen3-Omni's methodology:

Stage 1 - Alignment (this release): Train only audio_proj to map audio features into the LLM's embedding space.

Setting	Value
Trainable params	audio_proj only (12M / 4.7B = 0.25%)
Audio Encoder	Frozen
LLM	Frozen
Learning rate	1e-4
Effective batch size	1024 (8 x 8 GPUs x 16 accum)
Loss	Cross-entropy with label masking
Steps	6,000
Hardware	8x H100 80GB

Stage 2 - Fine-tuning (planned): Unfreeze LLM with LoRA (r=16, alpha=32) + continue training audio_proj.

Dataset

4.1M Korean speech samples:

Dataset	Samples	Ratio
Zeroth-STT-Korean	102K	2.5%
AIHub 012 - Counseling Speech	831K	20.0%
AIHub 71592 - Job Interview	76K	1.8%
AIHub 71481 - In-depth Interview	802K	19.3%
AIHub 464 - Meeting Speech	2.3M	56.3%

Loss Curve

Two distinct phases:

Steps 0-2500: Loss plateaus around 3.0 as audio_proj learns initial mapping
Steps 2500+: Sharp drop to ~1.0 as alignment clicks into place; transcription quality jumps dramatically

Key Technical Detail

The mel spectrogram must match Qwen3-Omni's WhisperFeatureExtractor exactly (Slaney mel scale + log10 + normalization). Using torchaudio defaults (HTK scale + natural log) produces completely wrong features for the pretrained AuT encoder - this was the root cause of initial training failure.

License

Apache 2.0

Downloads last month: 132

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for Seungyoun/Sori-4B-Base

Finetunes

1 model