| --- |
| license: other |
| base_model: facebook/sam-audio-small |
| tags: |
| - onnx |
| - audio |
| - sam-audio |
| - source-separation |
| - audio-visual |
| --- |
| |
| # SAM-Audio ONNX (Small) |
|
|
| ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation. |
|
|
| ## Model Files |
|
|
| | File | Description | Size | |
| |------|-------------|------| |
| | `dacvae_encoder.onnx` | Audio encoder (48kHz → latent) | ~110 MB | |
| | `dacvae_decoder.onnx` | Audio decoder (latent → 48kHz) | ~320 MB | |
| | `t5_encoder.onnx` | Text encoder (T5-base) | ~440 MB | |
| | `dit_single_step.onnx` | DiT denoiser (single ODE step) | ~2 GB | |
| | `vision_encoder.onnx` | Vision encoder (CLIP-based) | ~1.2 GB | |
| | `peaframe.onnx` | PEAFrame span predictor (audio-text similarity) | ~5.8 GB | |
| | `tokenizer/` | SentencePiece tokenizer files (T5) | - | |
| | `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - | |
| | `peaframe_config.json` | PEAFrame scaling parameters | - | |
| | `clap_audio_encoder.onnx` | CLAP audio encoder (HTSAT-tiny) | ~118 MB | |
| | `clap_text_encoder.onnx` | CLAP text encoder (RoBERTa-base) | ~481 MB | |
| | `clap_tokenizer/` | RoBERTa tokenizer files (CLAP) | - | |
| | `clap_config.json` | CLAP audio preprocessing parameters | - | |
|
|
| ## Installation |
|
|
| ```bash |
| pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers |
| # For CUDA support: |
| pip install onnxruntime-gpu |
| ``` |
|
|
| ## Usage Examples |
|
|
| ### Audio-Only Separation |
| ```bash |
| python onnx_inference.py \ |
| --audio input.wav \ |
| --text "a person speaking" \ |
| --output separated.wav |
| ``` |
|
|
| ### Video-Guided Separation |
| ```bash |
| python onnx_inference.py \ |
| --video input.mp4 \ |
| --text "the sound of typing" \ |
| --output separated.wav |
| ``` |
|
|
| ### Automatic Span Prediction |
| Use PEAFrame to automatically detect time spans matching your text description: |
| ```bash |
| python onnx_inference.py \ |
| --audio input.wav \ |
| --text "horn" \ |
| --predict-spans \ |
| --output separated.wav |
| ``` |
|
|
| This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments. |
|
|
| ### Manual Anchors |
| Specify exact time spans to focus on (positive anchors) or ignore (negative anchors): |
| ```bash |
| # Focus on specific time ranges |
| python onnx_inference.py \ |
| --audio input.wav \ |
| --text "person speaking" \ |
| --anchor + 4.5 7.0 \ |
| --anchor + 12.0 15.5 \ |
| --output separated.wav |
| |
| # Ignore specific time ranges |
| python onnx_inference.py \ |
| --audio input.wav \ |
| --text "background music" \ |
| --anchor - 0.0 3.0 \ |
| --output separated.wav |
| ``` |
|
|
| ### CLAP Reranking |
| Generate multiple candidates and select the best using CLAP audio-text similarity: |
| ```bash |
| python onnx_inference.py \ |
| --audio input.wav \ |
| --text "person speaking" \ |
| --rerank \ |
| --num-candidates 4 \ |
| --output separated.wav |
| ``` |
|
|
| Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time. |
|
|
| Options: |
| - `--rerank` - Enable reranking mode |
| - `--num-candidates N` - Number of candidates (default: 4) |
| - `--rerank-seed SEED` - Random seed for reproducibility |
|
|
| ### Visual Prompting with SAM3 Mask |
| ```bash |
| # First generate a mask with SAM3 (see generate_sam3_mask.py) |
| python onnx_inference.py \ |
| --video input.mp4 \ |
| --mask object_mask.mp4 \ |
| --text "" \ |
| --output isolated.wav \ |
| --output-video visualization.mp4 |
| ``` |
|
|
| ### Using a Custom Model Directory |
| ```bash |
| python onnx_inference.py \ |
| --video input.mp4 \ |
| --text "woman speaking" \ |
| --model-dir ./my_onnx_models \ |
| --output separated.wav |
| ``` |
|
|
| ## Model Specifications |
|
|
| - **Audio Sample Rate**: 48kHz |
| - **Audio Hop Length**: 1536 samples |
| - **Vision Input Size**: 336×336 pixels |
| - **Text Encoder**: T5-base (768-dim) |
| - **Vision Encoder**: PE-Core-L14-336 (1024-dim) |
| - **ODE Solver**: Midpoint method (configurable steps, default 16) |
| - **PEAFrame**: Audio-text similarity model for span detection |
| - Uses ModernBERT tokenizer |
| - Processes audio in ~3.3s chunks with 50% overlap |
| - Default threshold: 0.3 |
| - **CLAP**: Audio-text similarity model for candidate reranking |
| - Audio encoder: HTSAT-tiny |
| - Text encoder: RoBERTa-base |
| - Embedding dimension: 512 |
| - Default candidates: 4 |
|
|
| ## Exporting Models |
|
|
| Export scripts are in the `onnx_export/` directory. |
|
|
| ### Export All Models |
| ```bash |
| python -m onnx_export.export_all --output_dir ./onnx_models |
| ``` |
|
|
| ### Export Individual Components |
| ```bash |
| # DiT Transformer (supports FP16 for 50% size reduction) |
| python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small |
| python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda |
| |
| # DACVAE (encoder + decoder) |
| python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small |
| |
| # T5 Text Encoder |
| python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small |
| |
| # Vision Encoder |
| python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models |
| |
| # PEAFrame Span Predictor |
| python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify |
| |
| # CLAP Reranking (audio + text encoders) |
| python -m onnx_export.export_clap --output-dir ./onnx_models --verify |
| ``` |
|
|
| ### FP16 Quantization (for large models) |
|
|
| For the large model (sam-audio-large), use `--fp16 --device cuda` during DiT export to reduce size by 50%: |
|
|
| ```bash |
| # Export DiT in FP16 (11.7GB → 5.9GB) |
| python -m onnx_export.export_dit \ |
| --output-dir ./onnx_models_large_fp16 \ |
| --model-id facebook/sam-audio-large \ |
| --fp16 \ |
| --device cuda |
| ``` |
|
|
| The inference script automatically detects FP16 models and handles input conversion. |
|
|
| ## Export Scripts Reference |
|
|
| | Script | Description | |
| |--------|-------------| |
| | `export_all.py` | Export all components at once | |
| | `export_dit.py` | DiT transformer with FP16 support | |
| | `export_dacvae.py` | DACVAE encoder and decoder | |
| | `export_t5.py` | T5 text encoder | |
| | `export_vision.py` | Vision encoder (CLIP-based) | |
| | `export_peaframe.py` | PEAFrame span predictor + tokenizer | |
| | `export_clap.py` | CLAP audio + text encoders for reranking | |
| | `standalone_config.py` | Config classes for standalone export | |
|
|
| ## License |
|
|
| SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-small) for full terms. |
|
|
| ## Acknowledgments |
|
|
| Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio). |
|
|
|
|