How to make sure transcription_delay_ms is changed when serving with vLLM?

#10

by goncalomcorreia - opened 16 days ago

16 days ago

I created a docker for serving Voxtral-Mini-4B-Realtime through vLLM and edited the tekken.json directly inside the container. I then restarted the docker. When inspecting the container again after restarting, the tekken.json was still changed, but the delay between transcriptions still seemed like they were 480ms instead of 2400ms.

How can ensure that transcription_delay_ms is being correctly changed? In my use case, I do not need the streaming to be as fast as 480ms, 2400ms of delay is ok.

patrickvonplaten

Mistral AI_ org 12 days ago

Once you've edited the tekken.json correctly (for example in your HF cache where the file was downloaded) the corresponding delay should be applied automatically

bugtoo

8 days ago

@patrickvonplaten I am also trying to modify transcription_delay_ms - I edited tekken.json in the model cache where it was downloaded but no matter what value I put in, it seems it's always waiting 480ms before transcribing (i tried to go all the way to 80 and 2400). I am sure I am editing the correct file: if I change 'streaming_n_left_pad_tokens' to 0 I can see it is effetive as the TTFT latency drops by 80ms ca.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment