Audio Super-Resolution with Latent Bridge Models

Anonymous Authors · Under Submission

Abstract

Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-to-HR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48 kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR.

🎧 For best experience: Please use high-quality headphones to enjoy the full fidelity of the audio demos.
🔍 Compare all spectrograms of one audio at once: Try zooming out the page with "Ctrl" + "-" in your browser.

ESC-50, 8kHz input

Low


Ours


AudioSR

Low


Ours


AudioSR

Low


Ours


AudioSR

ESC-50, 16kHz input

Low


Ours


AudioSR

Low (Volume may be high)


Ours (Volume may be high)


AudioSR (Volume may be high)

Low


Ours


AudioSR

Song-Describer-Dataset, 8kHz input

Low


Ours


AudioSR

Low


Ours


AudioSR

Low


Ours


AudioSR

Song-Describer-Dataset, 16kHz input

Low


Ours


AudioSR

Low


Ours


AudioSR

Low


Ours


AudioSR

VCTK, 8kHz input

Audio 1

Ground Truth


Input


Nu-Wave2


NVSR


Frepainter


Flowhigh


AudioSR


Ours

Audio 2

Ground Truth


Input


Nu-Wave2


NVSR


Frepainter


Flowhigh


AudioSR


Ours

AudioLDM2 on AudioCaps, 16kHz input

Low


Ours


AudioSR

Low


Ours


AudioSR

Low


Ours


AudioSR

QA-MDT on MusicCaps, 16kHz input

Low


Ours


AudioSR

Low


Ours


AudioSR

Low


Ours


AudioSR

MaskGCT on LibriSpeech, 24kHz input

Low


Ours


AudioSR

Low


Ours


AudioSR

Low


Ours


AudioSR

Comparison with A2SB's Demo Page, 8kHz input

Audio 1

Input

AudioSR

A2SB

AUDIT

CQTDiff

Ours

Audio 2

Input

AudioSR

A2SB

AUDIT

CQTDiff

Ours