This is the demonstration page of the paper “CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction” with samples generated with the proposed method and some other baseline methods.
Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world recordings. This can be achieved by language-queried target sound extraction (TSE), which typically consists of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound accordingly. Existing methods commonly train models from scratch. As a consequence, substantial data and computational resources are required to make the randomly initialized model comprehend sound events and perform separation accordingly. In this paper, we propose to integrate pre-trained models into TSE models to address the above issue. To be specific, we tailor and adapt the powerful contrastive language-audio pre-trained model (CLAP) for USS, denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both positive and negative user prompts of uni- and/or multi-modalities for target sound extraction. These key features of CLAPSep can not only enhance the extraction performance but also improve the versatility of its application. We provide extensive experiments on 5 diverse datasets to demonstrate the superior performance and zero- and few-shot generalizability of our proposed CLAPSep with fast training convergence, surpassing previous methods by a significant margin.
We provide a comprehensive demonstration of our proposed method across various scenarios: 1) Language-queried target sound extraction for synthetic mixtures on AudioCaps, comparing against the strong baseline model, AudioSep; 2) Audio-queried target sound extraction for synthetic mixtures on ESC50, where we benchmark our method against the audio-queried baseline model, USS; and 3) Language-queried TSE for real-world recordings from Freesound (DCASE 2024 Task 9, evaluation set), where we compare our model with AudioSep to showcase its effectiveness in handling real-world data.
Note that the mentioned negative queries are provided only to CLAPSep, as neither AudioSep nor USS support negative queries.
Unprocessed | Target | Ours | AudioSep |
---|---|---|---|
Positive: A woman singing then choking followed by birds chirping Negative: Music is playing as a person whistles |
|||
Positive: A rooster clucking followed by a dog whimpering then a man talking and a dog barking Negative: A loud thunder cracking |
|||
Positive: An engine booms and hums with constant rattling Negative: Food sizzling while cooking |
|||
Positive: A woman speaks quietly, and man answers much louder, then she speaks again Negative: A child yelling as a young boy talks during several slaps on a hard surface |
|||
Positive: Water is running, gurgling and splashing, and a quiet thump occurs Negative: A dog barking and growling while plastic rattles and clanks against a hard surface |
|||
Positive: A loud burping followed by a laughing from young girls Negative: A man speaking as a vehicle horn honks in the background and another man talks in the distance |
|||
Positive: A sewing machine operating during several metal clacks Negative: A telephone ringing |
|||
Positive: A woman speaks followed by groaning and grunting Negative: A herd of sheep baaing |
|||
Positive: A man speaks followed by a toilet flush Negative: A person whistling |
|||
Positive: A man talking followed by wood sawing then paper shuffling Negative: Several birds chirp with some hissing |
|||
Unprocessed | Target | Ours | USS |
---|---|---|---|
Positive: Negative: |
|||
Positive: Negative: |
|||
Positive: Negative: |
|||
Positive: Negative: |
|||
Positive: Negative: |
|||
Positive: Negative: |
|||
Unprocessed | Ours | AudioSep |
---|---|---|
Positive: the wind chimes are making a crisp and sweet sound. | ||
Positive: the siren is alarming continuously. | ||
Positive: in the forest, the birds are chirping incessantly. Negative: a car is passing by a noisy road. |
||
Positive: a car is passing by a noisy road. Negative: in the forest, the birds are chirping incessantly. |
||
Positive: thunder is raging and rumbling from afar. Negative: the rain is falling to the surface. |
||
Positive: the rain is falling to the surface. Negative: thunder is raging and rumbling from afar. |
||
Positive: a dog is barking in the distance. Negative: the waves are beating against the shore. |
||
Positive: the waves are beating against the shore. Negative: a dog is barking in the distance. |
||
Positive: a truck is driving down the road, making noise. Negative: an alarm is ringing constantly. |
||
Positive: an alarm is ringing constantly. Negative: a truck is driving down the road, making noise. |
||
@article{ma2024clapsep,
title={CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction},
author={Ma, Hao and Peng, Zhiyuan and Li, Xu and Shao, Mingjie and Wu, Xixin and Liu, Ju},
journal={arXiv preprint arXiv:2402.17455},
year={2024}
}