In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
鉴于自动语音识别(ASR)系统的广泛应用,其安全性问题比以往任何时候都受到更多的关注,主要原因是深度神经网络的易受性。以前的研究表明,在约束条件下悄悄地生成对抗扰动能够操纵语音识别系统,从而产生恶意命令。这些攻击方法主要需要在$\ell_p$范数约束下添加噪声扰动,不可避免地留下了手动修改的残影。最近的研究通过将文本到语音(TTS)合成音频的对抗样本,缓解了这一限制。然而,基于优化目标的风格修改会显著降低音频风格的可控性和可编辑性。在本文中,我们提出了基于用户自定义风格迁移的ASR系统攻击。我们首先测试了顺序风格迁移攻击(STA)的效果。然后,作为改进,我们提出了一个迭代式风格码攻击(SCA)来保持音频质量。实验结果表明,我们的方法可以满足用户自定义风格的需求,攻击成功率为82%,同时保持声音的自然度。
https://arxiv.org/abs/2405.09470
Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech this http URL better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.
自监督学习在语音识别方面取得了巨大的成功。然而,观察发现,对学习模型的所有层进行微调会导致性能低于重置顶层。这种现象归因于“自编码器”行为:顶层包含更接近输入的信息,并且不太适合需要语言信息的任务,比如更好地理解这个行为。为了研究模型在预训练期间的高级信息演化,我们关注了表现较轻的HuBERT模型。通过实验探索可能影响训练过程的各种因素,我们的目标是改进训练程序并提高HuBERT模型在高级任务上的顶级层。此外,我们的实验还表明,这些训练过程的改进会导致更快的学习曲线和竞争力的下游任务性能。
https://arxiv.org/abs/2405.08402
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
大语言模型(LLMs)在执行需要自然语言指令的语义理解任务方面表现出了惊人的能力。最近,许多工作进一步扩展了这种能力,以感知多模态音频和文本输入,但它们的性能往往局限于特定的微调任务,如自动语音识别和翻译。因此,我们开发了SpeechVerse,一种 robust 的多任务训练和课程学习框架,通过一些可学习参数将预训练的语音和文本基础模型结合在一起,而在训练过程中冻结预训练模型。通过连续从语音基础模型中提取的潜在表示进行指令微调,模型能够在多样性的语音处理任务上实现最佳零散shot性能,并使用自然语言指令进行指导。我们对多个数据集和任务进行了广泛的基准测试,并进一步测试了模型在跨域数据集和未见过的任务上的能力。我们的实证实验证实,我们的多任务SpeechVerse模型甚至优于传统任务特定基线,在9个任务上。
https://arxiv.org/abs/2405.08295
This study presents a novel methodology utilizing a pre-trained speech recognition model for processing respiratory sound data. By incorporating medical record information, we introduce an innovative multi-modal deep-learning architecture, named Rene, which addresses the challenges of poor interpretability and underperformance in real-time clinical diagnostic response observed in previous respiratory disease-focused models. The proposed Rene architecture demonstrated significant improvements of 10.24%, 16.15%, 15.29%, and 18.90% respectively, compared to the baseline across four tasks related to respiratory event detection and audio record classification on the SPRSound database. In patient disease prediction tests on the ICBHI database, the architecture exhibited improvements of 23% in the mean of average score and harmonic score compared to the baseline. Furthermore, we developed a real-time respiratory sound discrimination system based on the Rene architecture, featuring a dual-thread design and compressed model parameters for simultaneous microphone recording and real-time dynamic decoding. Employing state-of-the-art Edge AI technology, this system enables rapid and accurate responses for respiratory sound auscultation, facilitating deployment on wearable clinical detection devices to capture incremental data, which can be synergistically evolved with large-scale models deployed on cloud servers for downstream tasks.
这项研究提出了一种利用预训练语音识别模型处理呼吸音数据的新型方法。通过结合病历信息,我们引入了一种名为Rene的创新多模态深度学习架构,该架构解决了之前呼吸疾病关注模型中观察到的实时临床诊断反应差和性能不足的问题。所提出的Rene架构在SPRSound数据库上的四个与呼吸事件检测和音频记录分类相关的任务中的基线相比,分别显示出10.24%、16.15%、15.29%和18.90%的显著改善。在ICBHI数据库上的患者疾病预测测试中,该架构显示出基线上的平均分数和和谐分数的改善率为23%。此外,我们还基于Rene架构开发了一个实时呼吸音识别系统,具有双线程设计和压缩模型参数的实时动态解码功能。通过采用最先进的边缘人工智能技术,该系统能够快速且准确地响应呼吸音听诊,促进在可穿戴式临床检测设备上部署,以捕捉连续数据,并与在云服务器上部署的大规模模型协同进化,实现下游任务的协同进化。
https://arxiv.org/abs/2405.07442
The application of Automatic Speech Recognition (ASR) technology in soccer offers numerous opportunities for sports analytics. Specifically, extracting audio commentaries with ASR provides valuable insights into the events of the game, and opens the door to several downstream applications such as automatic highlight generation. This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts, enhancing video content with rich layers of textual information derived from the game audio using ASR. These textual commentaries, generated using the Whisper model and translated with Google Translate, extend the usefulness of the SoccerNet dataset in diverse applications such as enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games. We detail the methods involved in the curation of this dataset and the integration of ASR. We also highlight the implications of a multimodal approach in sports analytics, and how the enriched dataset can support diverse applications, thus broadening the scope of research and development in the field of sports analytics.
自动语音识别(ASR)技术在足球中的应用带来了许多体育分析的机会。特别是,通过提取来自足球比赛直播的音频评论,使用ASR提取音频评论提供了对比赛事件的宝贵见解,并打开了几个下游应用的大门,如自动高光提取。本文介绍了SoccerNet-Echoes,一种通过从足球比赛直播中自动生成音频评论来扩充SoccerNet数据集的增强视频内容,利用ASR生成的文本信息丰富地增强了视频内容的应用。这些文本评论使用Whisper模型生成,并使用谷歌翻译翻译。通过结合文本数据和视觉和听觉内容,SoccerNet-Echoes旨在成为开发专门捕捉足球比赛动态的算法的全面资源。我们详细介绍了这个数据集的编辑方法和ASR的集成。我们还强调了体育分析中多模态方法的意义,以及丰富数据集如何支持各种应用,从而扩大体育分析领域的研究和开发范围。
https://arxiv.org/abs/2405.07354
Automatic speech recognition (ASR) systems, increasingly prevalent in education, healthcare, employment, and mobile technology, face significant challenges in inclusivity, particularly for the 80 million-strong global community of people who stutter. These systems often fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations. This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. The synthetic dataset, uniquely designed to incorporate various stuttering events, enables an in-depth analysis of each ASR's handling of disfluent speech. Our comprehensive assessment includes metrics such as word error rate (WER), character error rate (CER), and semantic accuracy of the transcripts. The results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions. These findings highlight a critical gap in current ASR technologies, underscoring the need for effective bias mitigation strategies. Addressing this bias is imperative not only to improve the technology's usability for people who stutter but also to ensure their equitable and inclusive participation in the rapidly evolving digital landscape.
自动语音识别(ASR)系统在教育、医疗、就业和移动技术等领域越来越普遍,但它们在包容性方面面临重大挑战,特别是对于全球8000万说普通话的人来说。这些系统往往无法准确解释从说普通话的人那里收集到的发音模式,导致关键可用性问题和误解。这项研究评估了六个领先ASR,分析它们在真实世界数据集和基于广泛使用的LibriSpeech基准的合成数据上的表现。合成数据独特地设计,以包含各种口音事件,能够深入分析每个ASR对口音语言的处理。我们的全面评估包括单词错误率(WER)、字符错误率和文本的语义准确度等指标。结果显示,所有ASR对口音语言的准确性存在一致且统计显著的偏差,在转录中表现出明显的语法和语义不准确。这些发现突出了当前ASR技术的一个关键缺陷,强调需要有效的偏差缓解策略。解决这个偏差不仅是提高口语音步障碍人士技术可用性的必要条件,而且还要确保他们在快速发展的数字环境中平等和包容地参与。
https://arxiv.org/abs/2405.06150
Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as $\texttt{<endoftext>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<endoftext>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.
近年来,如Whisper这样的大型语言模型的发展已经导致它们在许多自动语音识别(ASR)应用程序中得到了广泛应用。这些系统在其词汇中包含一些特殊标记,如<endoftext>,以指导其语言生成过程。然而,我们发现这些标记可以被攻击者用于操纵模型的行为。为了学习Whisper的<endoftext>标记的通用音素实现,我们提出了一种简单而有效的方法,该方法允许模型在将特殊标记附加到任何语音信号之前,忽略 speech,而仅转录特殊标记,从而有效地“沉默”模型。我们的实验证明,同样的通用 0.64 秒 adversarial 音频片段可以成功沉默目标 Whisper ASR 模型超过 97% 的语音样本。此外,我们发现,这个通用 adversarial 音频片段通常会转移到新的数据集和任务中。总的来说,这项工作证明了Whisper模型的易受“静音”攻击的风险,这些攻击在现实环境中有可能既带来风险又带来潜在好处。例如,攻击可以用于绕过 speech 审核系统,反之,攻击也可以用于保护个人语音数据。
https://arxiv.org/abs/2405.06134
This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93\% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88\% on the track 2 evaluation set.
本文针对In-Car Multi-Channel Automatic Speech Recognition(ICMC-ASR)挑战提交了我的系统。该挑战关注于在复杂多说话人场景中的说话人识别和语音识别。为解决这些挑战,我们开发了端到端说话人识别模型,与官方基线相比,降低了49.58%的说话人识别误差率(DER)。对于语音识别,我们利用自监督学习表示来训练端到端ASR模型。通过整合这些模型,我们在轨道1评估集中实现了16.93%的词错误率(CER),在轨道2评估集中实现了25.88%的最小排列字符误差率(cpCER)。
https://arxiv.org/abs/2405.05498
Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
自监督学习(SSL)在各种语音任务中已经证明非常有用。然而,这些方法在数据、内存和计算资源方面通常非常具有挑战性。基于BERT的语音预训练随机投影量化器(BEST-RQ)是一种 SSL 方法,它在自动语音识别(ASR)方面表现出与其他 SSL 方法相似但更简单的效果。尽管BEST-RQ在ASR方面的表现非常出色,但原始论文中的细节缺乏,例如预训练过程中使用的GPU/TPU小时数,以及没有官方易用的开源实现。此外,BEST-RQ尚未在除ASR和语音翻译以外的其他下游任务上进行评估。在这项工作中,我们描述了一个随机投影量化器的重新实现,并比较了它与 wav2vec 2.0 在四个下游任务上的效果。我们讨论了我们实现的具体细节和与 wav2vec 2.0 的差异。我们证明了,与 wav2vec 2.0 相比,随机投影量化器可以实现类似的下游性能,同时将训练时间降低到原来的两倍。
https://arxiv.org/abs/2405.04296
Large general-purpose transformer models have recently become the mainstay in the realm of speech analysis. In particular, Whisper achieves state-of-the-art results in relevant tasks such as speech recognition, translation, language identification, and voice activity detection. However, Whisper models are not designed to be used in real-time conditions, and this limitation makes them unsuitable for a vast plethora of practical applications. In this paper, we introduce Whispy, a system intended to bring live capabilities to the Whisper pretrained models. As a result of a number of architectural optimisations, Whispy is able to consume live audio streams and generate high level, coherent voice transcriptions, while still maintaining a low computational cost. We evaluate the performance of our system on a large repository of publicly available speech datasets, investigating how the transcription mechanism introduced by Whispy impacts on the Whisper output. Experimental results show how Whispy excels in robustness, promptness, and accuracy.
近年来,大型通用Transformer模型已成为语音分析领域的主导力量。特别是,Whisper在语音识别、翻译、语言识别和语音活动检测等任务中实现了最先进的结果。然而,Whisper模型不是为了实时使用而设计的,这一限制使得它们不适合大量的实际应用场景。在本文中,我们介绍了Whispy系统,旨在为Whisper预训练模型带来实时功能。通过进行多项架构优化,Whispy能够消耗实时音频流并生成高质量、连贯的语音文本,同时保持较低的计算成本。我们对Whispy系统的性能进行了评估,研究了Whispy引入的转录机制如何影响Whisper的输出。实验结果表明,Whispy在鲁棒性、 promptness和准确性方面表现出色。
https://arxiv.org/abs/2405.03484
Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-accent scenarios. In this paper, we explore the application of GER in multi-accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-accent scenarios, making it a prominent solution. In this work, we propose a unified ASR-AR GER model, named MMGER, leveraging multi-modal correction, and multi-granularity correction. Multi-task ASR-AR learning is employed to provide dynamic 1-best hypotheses and accent embeddings. Multi-modal correction accomplishes fine-grained frame-level correction by force-aligning the acoustic features of speech with the corresponding character-level 1-best hypothesis sequence. Multi-granularity correction supplements the global linguistic information by incorporating regular 1-best hypotheses atop fine-grained multi-modal correction to achieve coarse-grained utterance-level correction. MMGER effectively mitigates the limitations of GER and tailors LLM-based ASR error correction for the multi-accent scenarios. Experiments conducted on the multi-accent Mandarin KeSpeech dataset demonstrate the efficacy of MMGER, achieving a 26.72% relative improvement in AR accuracy and a 27.55% relative reduction in ASR character error rate, compared to a well-established standard baseline.
尽管在自动语音识别(ASR)方面取得了显著的进步,但面对不良条件时,性能往往会出现下降。生成误差纠正(GER)利用大型语言模型(LLM)的非凡文本理解能力,在ASR错误纠正方面取得了令人印象深刻的性能,其中N最佳假设提供了转录预测的有价值信息。然而,GER遇到了诸如固定N最佳假设、对语音信息的充分利用程度有限以及不适用于多重口音场景等问题。在本文中,我们探讨了在多重口音场景中应用GER的可能性。口音表示与标准发音规范的偏差,多任务学习框架同时进行ASR和口音识别(AR)有效地解决了多重口音场景,使得该方法成为一种突出的解决方案。本文,我们提出了一个名为MMGER的统一ASR-AR GER模型,利用多模态纠正和多粒度纠正。多任务ASR-AR学习提供动态1最佳假设和口音嵌入。多模态纠正通过将语音特征与相应字符级别1最佳假设序列对齐来完成细粒度帧级纠正。多粒度纠正通过在细粒度多模态纠正的基础上包含规范的1最佳假设,以实现粗粒度句子级别纠正,从而补充全局语言信息。MMGER有效地减轻了GER的限制,并针对多重口音场景对基于LLM的ASR错误纠正进行了调整。在多重口音的 Mandarin KeSpeech 数据集上进行的实验证明了MMGER的有效性,实现了 AR准确度的26.72%的相对改进和ASR字符错误率的27.55%的相对降低,与已知标准基线相比。
https://arxiv.org/abs/2405.03152
As interest in large language models (LLMs) grows, the importance of accuracy in automatic speech recognition (ASR) has become more pronounced. This is particularly true for lectures that include specialized terminology, where the success rate of traditional ASR models tends to be low, posing a challenging problem. A method to improve ASR performance for specialized terminology using the word frequency difference approach has been proposed. Through experiments and data analysis, we investigate whether this proposal effectively addresses the issue. Additionally, we introduce the power law as the theoretical foundation for the relative frequency
随着大型语言模型(LLMs)的兴趣不断增长,自动语音识别(ASR)中准确性的重要性变得更加突出。尤其是在包括专业术语的讲座中,传统ASR模型的成功率往往较低,这构成了具有挑战性的问题。提出了一种利用词频差异方法提高专业术语ASR性能的方法。通过实验和数据分析,我们研究了这一建议是否有效解决了这个问题。此外,我们还介绍了幂律作为相对频率的理论基础。
https://arxiv.org/abs/2405.02995
This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. Mixat was developed to address the shortcomings of current speech recognition resources when applied to Emirati speech, and in particular, to bilignual Emirati speakers who often mix and switch between their local dialect and English. The data set consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers, one of which is in the form of conversations between the host and a guest. Therefore, the collection contains examples of Emirati-English code-switching in both formal and natural conversational contexts. In this paper, we describe the process of data collection and annotation, and describe some of the features and statistics of the resulting data set. In addition, we evaluate the performance of pre-trained Arabic and multi-lingual ASR systems on our dataset, demonstrating the shortcomings of existing models on this low-resource dialectal Arabic, and the additional challenge of recognizing code-switching in ASR. The dataset will be made publicly available for research use.
本文介绍了Mixat数据集:这是用英语对阿联酋语音进行混合的数据集。Mixat数据集是为了解决应用到阿联酋语音的现有语音识别资源的不足而开发的,尤其是针对双语阿联酋 speakers,他们经常混合和切换本地方言和英语。数据集包括来自两个公共播客的非母语阿联酋人士的15小时语音,其中一个是以主持人与嘉宾之间的对话形式呈现的。因此,数据集中包含了阿联酋-英语代码转换在正式和非正式会话背景中的例子。在本文中,我们描述了数据收集和注释的过程,并描述了数据集中的某些特征和统计数字。此外,我们还评估了预训练的阿拉伯语和多语言 ASR系统在我们的数据集上的性能,证明了对于这种低资源的中东阿拉伯语,现有模型的不足以及识别代码转换在 ASR 中的挑战。该数据集将公开发布,供研究使用。
https://arxiv.org/abs/2405.02578
Large Language Models have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition is becoming a mainstream paradigm. Building upon this momentum, our research delves into an indepth examination of this paradigm on a large opensource Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoderLLM ASR paradigm. Furthermore, we introduce a threestage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL1, TestNet, and TestMeeting test sets. Our analysis presents an empirical foundation for future research in LLMbased ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pretrained models and training logs to promote reproducible research.
大规模语言模型已经在各种自然语言处理任务中展示了无与伦比的效果,并将自动语音识别与大规模语言模型集成成为主流范式。在此基础上,我们的研究深入探讨了这个范式在一个大型开源中文数据集上的影响。具体来说,我们的研究旨在评估在语音基础编码器LLM ASR范式中各种配置的影响。此外,我们还介绍了一种三阶段训练方法,专门设计来增强模型对 align auditory 和文本信息的能力。这种方法的实现与ASR组件的策略集成,使我们能够在AISHELL1、TestNet和TestMeeting测试集上实现最先进的性能。我们的分析提供了LLM为基础的ASR系统未来研究的实证基础,并为使用中文数据集优化性能提供了见解。我们将公开发布所有用于数据准备、训练、推理和评分的脚本,以及预训练模型和训练日志,以促进可重复的研究。
https://arxiv.org/abs/2405.02132
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.
本文探讨了使用经过中间CTC(InterCTC)训练的混合编码器-解码器模型(Hybrid CTC/Attention encoder-decoder)在爱尔兰(盖尔语)低资源 speech recognition(ASR)和 dialect identification(DID)任务中的应用。结果与目前最佳训练的 ASR(TDNN-HMM)和 DID(ECAPA-TDNN)模型进行了比较。首先,通过使用 Conformer 编码器建立了一个最优的 InterCTC 设置。然后,使用 E-branchfinder 编码器训练了一个模型,并比较了两种架构的性能。为语言模型(LM)采用多任务微调。实验结果表明,与基线 ECAPA-TDNN相比,DID 准确度提高了 10.8%,而 WER 性能接近于 TDNN-HMM 模型。这种多任务方法在爱尔兰低资源 ASR 和 DID 任务中具有前景。
https://arxiv.org/abs/2405.01293
Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems, while also offering an opportunity to audit these models with regard to user data. This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models. To the best of our knowledge, this approach has not yet been investigated. We compare our proposed features with commonly used error-based features and find that the proposed features greatly enhance performance for sample-level MI. For speaker-level MI, these features improve results, though by a smaller margin, as error-based features already obtained a high performance for this task. Our findings emphasise the importance of considering different feature sets and levels of access to target models for effective MI in ASR systems, providing valuable insights for auditing such models.
翻译:Membership Inference (MI) 对自动语音识别(ASR)系统的训练数据提出了实质性的隐私威胁,同时为审计这些模型与用户数据有关提供了机会。本文探讨了基于损失的特征与高斯和对抗扰动在ASR模型中进行MI的有效性。据我们所知,这种方法尚未被研究过。我们将提出的特征与常见的基于错误的特征进行比较,发现所提出的特征在样本级MI方面极大地提高了性能。在说话人级别MI方面,这些特征提高了结果,但相对较小,因为基于错误的特征已经在这一任务上取得了很高的性能。我们的研究结果强调了在ASR系统中有效MI时考虑不同特征集和访问目标模型的重要性,为审计这些模型提供了宝贵的见解。
https://arxiv.org/abs/2405.01207
Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [this https URL].
近年来,基于Transformer的自动语音识别(ASR)模型已经实现了词错误率(WER)低于4%,超过了人类注释者的工作准确率,然而它们需要大量的服务器资源,导致显著的碳排放足迹。传统的基于服务器的ASR架构也存在隐私问题,以及由于网络依赖关系而导致的可靠性和延迟问题。相比之下,在设备级(边缘)ASR上,通过有效平衡能源消耗和准确性,提高了隐私,增强了性能,促进了可持续性。 本研究探讨了量化、内存需求和能源消耗对各种ASR模型在NVIDIA Jetson Orin Nano上的性能的影响。通过分析在干净和噪音数据集上使用FP32、FP16和INT8量化模型的WER和转录速度,我们突出了准确度、速度、量化、能源效率和内存需求之间的关键权衡。我们发现,将精度从fp32变为fp16可以减半不同模型的音频转录能源消耗,同时性能降幅很小。模型大小和参数数量越大,并不能保证对噪声的鲁棒性,也不能预测给定转录负载的能源消耗。这些以及其他发现为在能源和内存受限的环境中优化ASR系统提供了新的见解,对于实现高效的本地ASR解决方案具有关键意义。本文开源的代码和输入数据可在[本文的链接]中找到。
https://arxiv.org/abs/2405.01004
Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.
Whisper 是一个多任务、多语言的语音模型,覆盖 99 种语言。在它所覆盖的语言的范围内,它产生了可观的自动语音识别(ASR)结果,但在一些未涵盖的语言上,该模型仍然表现不佳,尤其是在较小的模型版本中,问题更加严重。在这项工作中,我们研究了其局限性,证明了说话者相关(性别、年龄)和模型相关(资源丰富性和模型大小)偏见的存在。尽管如此,我们证明了仅模型相关偏见通过量化被放大,影响更多的低资源语言和较小的模型。为了寻找更好的压缩方法,我们提出了 DistilWhisper,一种能够同时提高这些语言的ASR性能,同时保留多任务和多语言功能优势的方法。我们的方法涉及两个关键策略:通过语言特定的专家对Whisper-small进行轻量级模块化ASR微调,以及从Whisper-large-v2中进行知识蒸馏。这种双策略使我们能够在有效提高ASR性能的同时保持多任务和多语言预训练所带来的稳健性。结果表明,我们的方法比标准的微调或LoRA适配器更有效,在目标语言的测试集中提高性能,同时仅引入了微不足道的参数开销。
https://arxiv.org/abs/2405.00966
Encoder-decoder foundation models have displayed state-of-the-art performance on a range of autoregressive sequence tasks. This paper proposes a simple and lightweight modification to such systems to control the behaviour according to a specific attribute of interest. This paper proposes a novel inference-efficient approach to modifying the behaviour of an encoder-decoder system according to a specific attribute of interest. Specifically, we show that a small proxy network can be used to find a sample-by-sample perturbation of the encoder output of a frozen foundation model to trigger the decoder to generate improved decodings. This work explores a specific realization of this framework focused on improving the COMET performance of Flan-T5 on Machine Translation and the WER of Whisper foundation models on Speech Recognition. Results display consistent improvements in performance evaluated through COMET and WER respectively. Furthermore, experiments also show that the proxies are robust to the exact nature of the data used to train them and can extend to other domains.
编码器-解码器基础模型在各种自回归序列任务中展示了最先进的性能。本文提出了一种简单而轻量级的修改方法,根据感兴趣的特定属性控制系统的行为。本文提出了一种新颖的推理效率的修改方法,根据感兴趣的特定属性修改编码器-解码器系统的行为。具体来说,我们证明了可以使用一个小代理网络来寻找解冻基础模型中编码器输出的逐个样本的扰动,从而触发解码器生成更准确的解码。本文探讨了改善Flan-T5在机器翻译中的COMET性能和Whisper基础模型在语音识别中的WER性能的具体实现。结果表明,通过COMET和WER分别评估,性能都得到了显著的改善。此外,实验还表明,代理对用于训练它们的数据的准确性质非常鲁棒,可以扩展到其他领域。
https://arxiv.org/abs/2405.01601
Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications in various fields, including human-machine interaction, virtual assistants, and mental health assistance. However, existing SER methods often overlook the information gap between the pre-training speech recognition task and the downstream SER task, resulting in sub-optimal performance. Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training speech recognition task and the downstream speech emotion recognition task. Then, AL methods are employed to iteratively select a subset of the most informative and diverse samples for fine-tuning, thereby reducing time consumption. Experiments demonstrate that our proposed method \textsc{After}, using only 20\% of samples, improves accuracy by 8.45\% and reduces time consumption by 79\%. The additional extension of \textsc{After} and ablation studies further confirm its effectiveness and applicability to various real-world scenarios. Our source code is available on Github for reproducibility. (this https URL).
演讲情感识别(SER)因为其在各种领域的广泛应用而引起了越来越多的关注,包括人机交互、虚拟助手和心理健康协助。然而,现有的SER方法通常忽视了预训练语音识别任务和下游SER任务之间的信息差距,导致性能低。此外,当前方法对每个具体语音数据集进行微调的时间成本很高,比如IEMOCAP,这限制了其在具有大规模嘈杂数据的大规模场景中的有效性。为了解决这些问题,我们提出了一个基于主动学习(AL)的SER微调框架,称为\textsc{After},并利用任务适应预训练(TAPT)和AL方法来提高性能和效率。具体来说,我们首先使用TAPT最小化预训练语音识别任务和下游语音情感识别任务之间的信息差距。然后,使用AL方法迭代选择具有最信息量和多样性的样本进行微调,从而减少时间消耗。实验证明,我们提出的\textsc{After}方法,只需使用20%的样本,提高了8.45%的准确率,并将时间消耗降低了79%。此外,对\textsc{After}的扩展研究和消融分析进一步证实了其有效性和应用到各种现实场景中的可行性。我们的源代码可以在Github上获取可重复性。(就是这个https://URL)
https://arxiv.org/abs/2405.00307