Model Overview

Description:

A2SB uses a UNet architecture to perform inpainting on an audio spectrogram. It can fill in missing frequency bands above 4kHz (bandwidth extension), or fill in short temporal slices (currently supporting filling in gaps of less than 1 second). This model is for non commercial use only.

License/Terms of Use:

The model is provided under the NVIDIA OneWay NonCommercial License.

The code is under NVIDIA Source Code License - Non Commercial. Some components are adapted from other sources. The training code is adapted from I2SB under the NVIDIA Source Code License - Non Commercial. The model architecture is adapted from Improved Diffusion under the MIT License.

Deployment Geography:

Global

Use Case:

Research purposes pertaining to audio enhancement and generative modeling, as well as for general creative use such as bandwidth extension and inpainting short segments of missing audio.

Release Date:

Github 06/27/2025 via github.com/NVIDIA/diffusion-audio-restoration

Reference(s):

Model Architecture:

Architecture Type: CNN with interleaved Self-Attention Layers

Network Architecture: UNET

Input:

Input Type(s): Audio

Input Format(s): WAV/MP3/FLAC

Input Parameters: One-Dimensional (1D)

Other Properties Related to Input: All audio assumed to be single-channeled, 44.1kHz. For editing, also provide frequency cutoff for bandwidth extension sampling (resample content above this frequency), or start/end time stamps for segment inpainting.

Output:

Output Type(s): Audio

Output Format(s): WAV

Output Parameters: One-Dimensional (1D)

Other Properties Related to Output: Single-channeled 44.1kHz output file. Maximum audio output length is 1 hour.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

[PyTorch-2.2.2+cuda12.1+cudnn8]

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

[Preferred/Supported] Operating System(s): ['Linux']

Model Versions:

Training and Evaluation Datasets:

Training Datasets:

The property column below shows the total duration before license, quality, and sampling rate filtering. Our model training code ingests only raw audio samples -- no additional labels provided in the datasets listed below are used for training purposes.

DatasetName	Collection Method	Labeling Method	Properties
FMA	Human	N/A	5257.0 hrs
Medleys-solos-DB	Human	N/A	17.8 hrs
MUSAN	Human	N/A	42.6 hrs
Musical Instrument	Human	N/A	16.2 hrs
MusicNet	Human	N/A	34.5 hrs
Slakh	Hybrid	N/A	118.3 hrs
FreeSound	Human	N/A	4576.6 hrs
FSD50K	Human	N/A	75.6 hrs
GTZAN	Human	N/A	8.3 hrs
NSynth	Human	N/A	340.0 hrs

Evaluation Datasets:

DatasetName	Collection Method	Labeling Method	Properties
AAM: Artificial Audio Multitracks Dataset	Automated	N/A	4 hrs
Maestro	Human	N/A	199.2 hrs
MTD	Human	N/A	0.9 hrs
CC-Mixter	Human	N/A	3.2 hrs

Inference:

Engine: PyTorch

Test Hardware:

NVIDIA Ampere

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.