GenerativeTSE_demo

Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR

This is the demonstration page of the paper “Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR” with samples generated with the proposed method and some other baseline methods.

arXiv

Main figure

Abstract

Target speech extraction (TSE) isolates speech for a specific speaker within the multi-talker overlapped speech mixture. Most existing TSE models are based on discriminative methods, which always predict a time-frequency spectrogram mask for the target speech. However, imperfections in the mask often lead to over-/under-suppression of target/non-target speech, degrading perceptual quality. Generative methods, on the other hand, re-synthesize target speech based on the mixture and target speaker cues, achieving superior perceptual quality. However, these methods often neglect speech intelligibility, causing changes or loss of semantic content in the re-synthesized speech. Inspired by the Whisper model’s success in target speaker ASR, we propose a generative target speech extraction framework based on the pre-trained Whisper model, integrating semantic modeling with flow-based acoustic modeling for both high intelligibility and perceptual quality. Results on multiple evaluation benchmarks show the proposed method outperforms existing generative and discriminative baselines.

Demos

We provide demonstration on speech samples from Libri2Mix and WSJ0-2mix.

Speech samples from Libri2Mix

Unprocessed Enrollment Ours Generative Discriminative Baseline Ground Truth
Text: THIS WAS SO SWEET A LADY SIR AND IN SOME MANNER I DO THINK SHE DIED
Text: LENGTH OF SERVICE FOURTEEN YEARS THREE MONTHS AND FIVE DAYS
Text: IN EIGHTEEN SIXTY TWO A LAW WAS ENACTED WITH THE PURPOSE OF SUPPRESSING PLURAL MARRIAGE AND AS HAD BEEN PREDICTED IN THE NATIONAL SENATE PRIOR TO ITS PASSAGE IT LAY FOR MANY YEARS A DEAD LETTER
Text: WE HAVE ALWAYS THOUGHT THAT IT WAS SOMETIMES A COURAGEOUS ACT AND AT LEAST A SIMPLE AND USEFUL DEED WORTHY OF THE SYMPATHETIC ATTENTION WHICH DUTY ACCEPTED AND FULFILLED MERITS
Text: HIS HOUSEKEEPER HAD THE MANAGEMENT OF EVERYTHING SHE NEVER ALLOWED HIM TO BE IN NEED OF ANYTHING AND SHE GAVE NO ACCOUNT OF HIS MONEY WHICH SHE KEPT ALTOGETHER BECAUSE HE NEVER ASKED HER TO RENDER ANY ACCOUNTS
Text: AND HE LEANED AGAINST THE WALL LOST IN REVERIE

Speech samples from WSJ0-2mix

Unprocessed Enrollment Ours Generative Discriminative Baseline Ground Truth
Text: When the federal pension insurer stepped in this fund had just seven thousand seven hundred dollars in it to meet two hundred thirty million dollars in obligations
Text: Most European traders were reportedly staying out of action until the trade figures are released
Text: R\. L\. I\. Corporation a Peoria Illinois based insurance holding company will begin trading Friday on the Big Board under the symbol R\. L\. I\.
Text: In certain cases \,COMMA the cards are given free to subscribers \.PERIOD
Text: Accepted bids ranged from six point two percent to six point two two five percent
Text: [tongue_click] [loud_breath] No one at the State Department wants to let spies in