Add CLAP reranking support (audio + text encoders)

23278d3 verified 6 months ago

6.79 kB

	---
	license: other
	base_model: facebook/sam-audio-small
	tags:
	- onnx
	- audio
	- sam-audio
	- source-separation
	- audio-visual
	---

	# SAM-Audio ONNX (Small)

	ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-audio) (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation.

	## Model Files

	\| File \| Description \| Size \|
	\|------\|-------------\|------\|
	\| `dacvae_encoder.onnx` \| Audio encoder (48kHz → latent) \| ~110 MB \|
	\| `dacvae_decoder.onnx` \| Audio decoder (latent → 48kHz) \| ~320 MB \|
	\| `t5_encoder.onnx` \| Text encoder (T5-base) \| ~440 MB \|
	\| `dit_single_step.onnx` \| DiT denoiser (single ODE step) \| ~2 GB \|
	\| `vision_encoder.onnx` \| Vision encoder (CLIP-based) \| ~1.2 GB \|
	\| `peaframe.onnx` \| PEAFrame span predictor (audio-text similarity) \| ~5.8 GB \|
	\| `tokenizer/` \| SentencePiece tokenizer files (T5) \| - \|
	\| `peaframe_tokenizer/` \| ModernBERT tokenizer files (PEAFrame) \| - \|
	\| `peaframe_config.json` \| PEAFrame scaling parameters \| - \|
	\| `clap_audio_encoder.onnx` \| CLAP audio encoder (HTSAT-tiny) \| ~118 MB \|
	\| `clap_text_encoder.onnx` \| CLAP text encoder (RoBERTa-base) \| ~481 MB \|
	\| `clap_tokenizer/` \| RoBERTa tokenizer files (CLAP) \| - \|
	\| `clap_config.json` \| CLAP audio preprocessing parameters \| - \|

	## Installation

	```bash
	pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers
	# For CUDA support:
	pip install onnxruntime-gpu
	```

	## Usage Examples

	### Audio-Only Separation
	```bash
	python onnx_inference.py \
	--audio input.wav \
	--text "a person speaking" \
	--output separated.wav
	```

	### Video-Guided Separation
	```bash
	python onnx_inference.py \
	--video input.mp4 \
	--text "the sound of typing" \
	--output separated.wav
	```

	### Automatic Span Prediction
	Use PEAFrame to automatically detect time spans matching your text description:
	```bash
	python onnx_inference.py \
	--audio input.wav \
	--text "horn" \
	--predict-spans \
	--output separated.wav
	```

	This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments.

	### Manual Anchors
	Specify exact time spans to focus on (positive anchors) or ignore (negative anchors):
	```bash
	# Focus on specific time ranges
	python onnx_inference.py \
	--audio input.wav \
	--text "person speaking" \
	--anchor + 4.5 7.0 \
	--anchor + 12.0 15.5 \
	--output separated.wav

	# Ignore specific time ranges
	python onnx_inference.py \
	--audio input.wav \
	--text "background music" \
	--anchor - 0.0 3.0 \
	--output separated.wav
	```

	### CLAP Reranking
	Generate multiple candidates and select the best using CLAP audio-text similarity:
	```bash
	python onnx_inference.py \
	--audio input.wav \
	--text "person speaking" \
	--rerank \
	--num-candidates 4 \
	--output separated.wav
	```

	Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.

	Options:
	- `--rerank` - Enable reranking mode
	- `--num-candidates N` - Number of candidates (default: 4)
	- `--rerank-seed SEED` - Random seed for reproducibility

	### Visual Prompting with SAM3 Mask
	```bash
	# First generate a mask with SAM3 (see generate_sam3_mask.py)
	python onnx_inference.py \
	--video input.mp4 \
	--mask object_mask.mp4 \
	--text "" \
	--output isolated.wav \
	--output-video visualization.mp4
	```

	### Using a Custom Model Directory
	```bash
	python onnx_inference.py \
	--video input.mp4 \
	--text "woman speaking" \
	--model-dir ./my_onnx_models \
	--output separated.wav
	```

	## Model Specifications

	- Audio Sample Rate: 48kHz
	- Audio Hop Length: 1536 samples
	- Vision Input Size: 336×336 pixels
	- Text Encoder: T5-base (768-dim)
	- Vision Encoder: PE-Core-L14-336 (1024-dim)
	- ODE Solver: Midpoint method (configurable steps, default 16)
	- PEAFrame: Audio-text similarity model for span detection
	- Uses ModernBERT tokenizer
	- Processes audio in ~3.3s chunks with 50% overlap
	- Default threshold: 0.3
	- CLAP: Audio-text similarity model for candidate reranking
	- Audio encoder: HTSAT-tiny
	- Text encoder: RoBERTa-base
	- Embedding dimension: 512
	- Default candidates: 4

	## Exporting Models

	Export scripts are in the `onnx_export/` directory.

	### Export All Models
	```bash
	python -m onnx_export.export_all --output_dir ./onnx_models
	```

	### Export Individual Components
	```bash
	# DiT Transformer (supports FP16 for 50% size reduction)
	python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small
	python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda

	# DACVAE (encoder + decoder)
	python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small

	# T5 Text Encoder
	python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small

	# Vision Encoder
	python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models

	# PEAFrame Span Predictor
	python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify

	# CLAP Reranking (audio + text encoders)
	python -m onnx_export.export_clap --output-dir ./onnx_models --verify
	```

	### FP16 Quantization (for large models)

	For the large model (sam-audio-large), use `--fp16 --device cuda` during DiT export to reduce size by 50%:

	```bash
	# Export DiT in FP16 (11.7GB → 5.9GB)
	python -m onnx_export.export_dit \
	--output-dir ./onnx_models_large_fp16 \
	--model-id facebook/sam-audio-large \
	--fp16 \
	--device cuda
	```

	The inference script automatically detects FP16 models and handles input conversion.

	## Export Scripts Reference

	\| Script \| Description \|
	\|--------\|-------------\|
	\| `export_all.py` \| Export all components at once \|
	\| `export_dit.py` \| DiT transformer with FP16 support \|
	\| `export_dacvae.py` \| DACVAE encoder and decoder \|
	\| `export_t5.py` \| T5 text encoder \|
	\| `export_vision.py` \| Vision encoder (CLIP-based) \|
	\| `export_peaframe.py` \| PEAFrame span predictor + tokenizer \|
	\| `export_clap.py` \| CLAP audio + text encoders for reranking \|
	\| `standalone_config.py` \| Config classes for standalone export \|

	## License

	SAM-Audio is released under the [CC-BY-NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). See [original repository](https://huggingface.co/facebook/sam-audio-small) for full terms.

	## Acknowledgments

	Original model by [Meta AI Research](https://github.com/facebookresearch/sam-audio).