This is the demonstration page of the paper “Enhancing Intelligibility for Generative Target Speech Extraction via Joint Optimization with Target Speaker ASR” with samples generated with the proposed method and some other baseline methods.
Target speech extraction (TSE) isolates speech for a specific speaker within the multi-talker overlapped speech mixture. Most existing TSE models are based on discriminative methods, which always predict a time-frequency spectrogram mask for the target speech. However, imperfections in the mask often lead to over-/under-suppression of target/non-target speech, degrading perceptual quality. Generative methods, on the other hand, re-synthesize target speech based on the mixture and target speaker cues, achieving superior perceptual quality. However, these methods often neglect speech intelligibility, causing changes or loss of semantic content in the re-synthesized speech. Inspired by the Whisper model’s success in target speaker ASR, we propose a generative target speech extraction framework based on the pre-trained Whisper model, integrating semantic modeling with flow-based acoustic modeling for both high intelligibility and perceptual quality. Results on multiple evaluation benchmarks show the proposed method outperforms existing generative and discriminative baselines.
We provide demonstration on speech samples from Libri2Mix and WSJ0-2mix.
Unprocessed | Enrollment | Ours Generative | Discriminative Baseline | Ground Truth |
---|---|---|---|---|
Text: THIS WAS SO SWEET A LADY SIR AND IN SOME MANNER I DO THINK SHE DIED | ||||
Text: LENGTH OF SERVICE FOURTEEN YEARS THREE MONTHS AND FIVE DAYS | ||||
Text: IN EIGHTEEN SIXTY TWO A LAW WAS ENACTED WITH THE PURPOSE OF SUPPRESSING PLURAL MARRIAGE AND AS HAD BEEN PREDICTED IN THE NATIONAL SENATE PRIOR TO ITS PASSAGE IT LAY FOR MANY YEARS A DEAD LETTER | ||||
Text: WE HAVE ALWAYS THOUGHT THAT IT WAS SOMETIMES A COURAGEOUS ACT AND AT LEAST A SIMPLE AND USEFUL DEED WORTHY OF THE SYMPATHETIC ATTENTION WHICH DUTY ACCEPTED AND FULFILLED MERITS | ||||
Text: HIS HOUSEKEEPER HAD THE MANAGEMENT OF EVERYTHING SHE NEVER ALLOWED HIM TO BE IN NEED OF ANYTHING AND SHE GAVE NO ACCOUNT OF HIS MONEY WHICH SHE KEPT ALTOGETHER BECAUSE HE NEVER ASKED HER TO RENDER ANY ACCOUNTS | ||||
Text: AND HE LEANED AGAINST THE WALL LOST IN REVERIE | ||||
Unprocessed | Enrollment | Ours Generative | Discriminative Baseline | Ground Truth |
---|---|---|---|---|
Text: When the federal pension insurer stepped in this fund had just seven thousand seven hundred dollars in it to meet two hundred thirty million dollars in obligations | ||||
Text: Most European traders were reportedly staying out of action until the trade figures are released | ||||
Text: R\. L\. I\. Corporation a Peoria Illinois based insurance holding company will begin trading Friday on the Big Board under the symbol R\. L\. I\. | ||||
Text: In certain cases \,COMMA the cards are given free to subscribers \.PERIOD | ||||
Text: Accepted bids ranged from six point two percent to six point two two five percent | ||||
Text: [tongue_click] [loud_breath] No one at the State Department wants to let spies in | ||||