CLAPSep_demo

CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction

This is the demonstration page of the paper “CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction” with samples generated with the proposed method and some other baseline methods.

arXiv github Hugging Face Spaces

Abstract

Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin.

Demos

We provide a comprehensive demonstration of our proposed method across various scenarios: 1) Language-queried target sound extraction for synthetic mixtures on AudioCaps, comparing against the strong baseline model, AudioSep; 2) Audio-queried target sound extraction for synthetic mixtures on ESC50, where we benchmark our method against the audio-queried baseline model, USS; and 3) Language-queried TSE for real-world recordings from Freesound (DCASE 2024 Task 9, evaluation set), where we compare our model with AudioSep to showcase its effectiveness in handling real-world data.

Note that the mentioned negative queries are provided only to CLAPSep, as neither AudioSep nor USS support negative queries.

Language-queried TSE for synthetic mixtures on AudioCaps

Unprocessed Target Ours AudioSep
Positive: A woman singing then choking followed by birds chirping
Negative: Music is playing as a person whistles
Positive: A rooster clucking followed by a dog whimpering then a man talking and a dog barking
Negative: A loud thunder cracking
Positive: An engine booms and hums with constant rattling
Negative: Food sizzling while cooking
Positive: A woman speaks quietly, and man answers much louder, then she speaks again
Negative: A child yelling as a young boy talks during several slaps on a hard surface
Positive: Water is running, gurgling and splashing, and a quiet thump occurs
Negative: A dog barking and growling while plastic rattles and clanks against a hard surface
Positive: A loud burping followed by a laughing from young girls
Negative: A man speaking as a vehicle horn honks in the background and another man talks in the distance
Positive: A sewing machine operating during several metal clacks
Negative: A telephone ringing
Positive: A woman speaks followed by groaning and grunting
Negative: A herd of sheep baaing
Positive: A man speaks followed by a toilet flush
Negative: A person whistling
Positive: A man talking followed by wood sawing then paper shuffling
Negative: Several birds chirp with some hissing

Audio-queried TSE for synthetic mixtures on ESC50

Unprocessed Target Ours USS
Positive:
Negative:
Positive:
Negative:
Positive:
Negative:
Positive:
Negative:
Positive:
Negative:
Positive:
Negative:

Language-queried TSE for real-world recordings from Freesound (DCASE 2024 Task 9 evaluation set (real))

Unprocessed Ours AudioSep
Positive: the wind chimes are making a crisp and sweet sound.
Positive: the siren is alarming continuously.
Positive: in the forest, the birds are chirping incessantly.
Negative: a car is passing by a noisy road.
Positive: a car is passing by a noisy road.
Negative: in the forest, the birds are chirping incessantly.
Positive: thunder is raging and rumbling from afar.
Negative: the rain is falling to the surface.
Positive: the rain is falling to the surface.
Negative: thunder is raging and rumbling from afar.
Positive: a dog is barking in the distance.
Negative: the waves are beating against the shore.
Positive: the waves are beating against the shore.
Negative: a dog is barking in the distance.
Positive: a truck is driving down the road, making noise.
Negative: an alarm is ringing constantly.
Positive: an alarm is ringing constantly.
Negative: a truck is driving down the road, making noise.

Citation

@article{ma2024clapsep,
  title={CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction},
  author={Ma, Hao and Peng, Zhiyuan and Li, Xu and Shao, Mingjie and Wu, Xixin and Liu, Ju},
  journal={arXiv preprint arXiv:2402.17455},
  year={2024}
}