In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
鉴于自动语音识别(ASR)系统的广泛应用,其安全性问题比以往任何时候都受到更多的关注,主要原因是深度神经网络的易受性。以前的研究表明,在约束条件下悄悄地生成对抗扰动能够操纵语音识别系统,从而产生恶意命令。这些攻击方法主要需要在$\ell_p$范数约束下添加噪声扰动,不可避免地留下了手动修改的残影。最近的研究通过将文本到语音(TTS)合成音频的对抗样本,缓解了这一限制。然而,基于优化目标的风格修改会显著降低音频风格的可控性和可编辑性。在本文中,我们提出了基于用户自定义风格迁移的ASR系统攻击。我们首先测试了顺序风格迁移攻击(STA)的效果。然后,作为改进,我们提出了一个迭代式风格码攻击(SCA)来保持音频质量。实验结果表明,我们的方法可以满足用户自定义风格的需求,攻击成功率为82%,同时保持声音的自然度。
https://arxiv.org/abs/2405.09470
In this work, we present Score MUsic Graph (SMUG)-Explain, a framework for generating and visualizing explanations of graph neural networks applied to arbitrary prediction tasks on musical scores. Our system allows the user to visualize the contribution of input notes (and note features) to the network output, directly in the context of the musical score. We provide an interactive interface based on the music notation engraving library Verovio. We showcase the usage of SMUG-Explain on the task of cadence detection in classical music. All code is available on this https URL.
在这项工作中,我们提出了Score Music Graph (SMUG)-Explain,一个用于生成和可视化应用到任意预测任务的图形化解释的框架,针对音乐乐谱。我们的系统允许用户在音乐乐谱的上下文中直接可视化输入音符(和音符特征)对网络输出的贡献。我们还基于音乐乐谱雕刻库Verovio提供了一个交互式的界面。我们在古典音乐中的句尾检测任务中展示了SMUG-Explain的使用。所有代码都可以在https://url.com/这个网址上找到。
https://arxiv.org/abs/2405.09241
We propose a new graph convolutional block, called MusGConv, specifically designed for the efficient processing of musical score data and motivated by general perceptual principles. It focuses on two fundamental dimensions of music, pitch and rhythm, and considers both relative and absolute representations of these components. We evaluate our approach on four different musical understanding problems: monophonic voice separation, harmonic analysis, cadence detection, and composer identification which, in abstract terms, translate to different graph learning problems, namely, node classification, link prediction, and graph classification. Our experiments demonstrate that MusGConv improves the performance on three of the aforementioned tasks while being conceptually very simple and efficient. We interpret this as evidence that it is beneficial to include perception-informed processing of fundamental musical concepts when developing graph network applications on musical score data.
我们提出了一个新的图形卷积块,称为MusGConv,专门为音乐分数数据的高效处理而设计,并受到一般感知原则的启发。它专注于音乐的两个基本维度,即音高和节奏,并考虑这两个组件的相对和绝对表示。我们对我们的方法在四个不同的音乐理解问题进行了评估:单声道声音分离,和弦分析,句末检测和作曲家识别。用抽象的话,这些问题翻译为不同的图学习问题,即节点分类,链预测和图分类。我们的实验结果表明,MusGConv在提高上述三个任务的同时,在概念上非常简单和高效。我们将这一结果解释为在开发基于音乐分数数据的图形网络应用时,有意识地处理基本音乐概念的感知信息是有益的。
https://arxiv.org/abs/2405.09224
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
控制文本到语音合成(TTS)中情感表达仍然具有挑战性。以前的研究主要集中在在句子级别学习全局音调表示,这与语言音调高度相关。我们的目标是构建一个层级情感分布(ED),有效地捕捉情感在各个粒度级别上的强度变化,包括音素、单词和句子。在TTS训练过程中,从地面真实音频中提取层级ED,并引导预测器建立情感和语言音调之间的联系。在运行时推理,TTS模型生成情感语音,同时为语音成分提供情感定量控制。客观和主观评价证实了所提出的框架在情感预测和控制方面的有效性。
https://arxiv.org/abs/2405.09171
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. We additionally perform song classification based on the generated tracks. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.
在本文中,我们探讨了使用潜在扩散模型(一种强大的生成模型家族)从脑电图(EEG)录音中重构自然主义音乐的潜力。与简单音乐且音色有限的作品(如MIDI生成的曲目或单旋律作品)相比,这里的重点是复杂音乐,具有多样化的乐器、声音和效果,丰富和谐和音色。本研究代表了一种使用非侵入性EEG数据实现高质量音乐重建的初步尝试,采用端到端训练方法直接在原始数据上进行,无需手动预处理和通道选择。我们将模型 training 放在公共 NMED-T 数据集上,并通过提出基于神经嵌入的指标进行定量评估。此外,我们还基于生成的曲目进行歌曲分类。我们的工作对神经解码和脑-机接口的持续研究做出了贡献,揭示了使用EEG数据进行复杂听觉信息重建的可行性。
https://arxiv.org/abs/2405.09062
Binaural Audio Telepresence (BAT) aims to encode the acoustic scene at the far end into binaural signals for the user at the near end. BAT encompasses an immense range of applications that can vary between two extreme modes of Immersive BAT (I-BAT) and Enhanced BAT (E-BAT). With I-BAT, our goal is to preserve the full ambience as if we were at the far end, while with E-BAT, our goal is to enhance the far-end conversation with significantly improved speech quality and intelligibility. To this end, this paper presents a tunable BAT system to vary between these two AT modes with a desired application-specific balance. Microphone signals are converted into binaural signals with prescribed ambience factor. A novel Spatial COherence REpresentation (SCORE) is proposed as an input feature for model training so that the network remains robust to different array setups. Experimental results demonstrated the superior performance of the proposed BAT, even when the array configurations were not included in the training phase.
双向声道远程呈现(BAT)旨在将远端的音频场景编码成适合近端用户的双耳信号。BAT涵盖了从极端的沉浸式双向声道(I-BAT)和增强型双向声道(E-BAT)两种模式中广泛的适用应用。使用I-BAT,我们的目标是将完整的氛围保留下来,就好像我们处在远端一样,而使用E-BAT,我们的目标是显著提高远端对话的语音质量和可懂度。为此,本文提出了一种可调节的BAT系统,以在两个AT模式之间进行平衡,并具有所需的应用程序特定平衡。 microphone信号通过指定氛围系数转换为双耳信号。 为了训练模型,还提出了一个新的空间互相关表示(SCORE)作为输入特征,以便网络保持对不同阵列设置的鲁棒性。 实验结果证明了所提出的BAT的优越性能,即使训练阶段没有包括阵列设置。
https://arxiv.org/abs/2405.08742
The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods
高级大型语言模型(如GPT-4、GPT-4o和Claude家族)的崛起使得伪造音频检测变得越来越具有挑战性。传统的微调方法很难与不断变化的合成语音格局保持同步,需要不断学习的方法来适应新的音频,同时保留检测较老类型音频的能力。持续学习作为一种有效的工具,在检测新型深度伪造音频的同时保持对较老类型的检测性能,但它缺乏一个结构良好和用户友好的评估框架。为了填补这一空白,我们引入了EVDA,一个用于评估持续学习在深度伪造音频检测中的基准。EVDA包括来自反伪造声音系列、中国伪造音频检测系列以及GPT-4和GPT-4o生成的全新深度伪造音频。它支持各种持续学习技术,例如EWC、学习不遗忘(LwF)以及像Regularized Adaptive Weight Modification(RAWM)和Radian Weight Modification(RWM)这样的最近方法。此外,EVDA通过提供一个开放的接口,促进将新的持续学习方法集成到算法中,从而推动其发展。
https://arxiv.org/abs/2405.08596
Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder has to be learned that allows for efficient transmission of the input audio signal. This discrete representation is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-the-art neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. Furthermore, we propose a new causal network architecture for neural speech coding that shows good performance at very low computational complexity.
神经音频编码已成为一个生动的研究方向,因为它在非常低的比特率下承诺提供优秀的音频质量,这是经典编码技术无法实现的。在本文中,我们提出了基于投影标量量化(SQ)的简单替代VQ的量化方法,这些量化技术不需要额外的损失、调度参数或代码本存储,从而简化了神经音频编码器的训练。此外,我们提出了一种新的因果神经网络架构,用于神经语音编码,在非常低的计算复杂度下表现出良好的性能。
https://arxiv.org/abs/2405.08417
With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.
随着生成式 AI 的快速发展,多模态深度伪造(Multimodal deepfakes)已经引起了越来越多的公众关注。目前,深度伪造检测已成为对抗这些不断增长威胁的关键策略。然而,作为训练和验证深度伪造检测算法的关键因素,大多数现有的深度伪造数据集主要关注视觉模态,而那些多模态的采用过时技术的数据集,其音频内容仅限于一种语言,因此无法代表当前深度伪造技术的尖端发展和全球趋势。为了填补这一空白,我们提出了一个新颖的多语言、多模态深度伪造数据集:PolyGlotFake。它包括来自七种语言的内容,利用了各种尖端和流行的文本转语音、语音复制和同步技术。我们对 PolyGlotFake 数据集进行了最先进的检测方法的综合实验。这些实验证明了该数据集的重大挑战,以及其在推动多模态深度伪造检测研究方面的实际价值。
https://arxiv.org/abs/2405.08838
Respiratory disease, the third leading cause of deaths globally, is considered a high-priority ailment requiring significant research on identification and treatment. Stethoscope-recorded lung sounds and artificial intelligence-powered devices have been used to identify lung disorders and aid specialists in making accurate diagnoses. In this study, audio-spectrogram vision transformer (AS-ViT), a new approach for identifying abnormal respiration sounds, was developed. The sounds of the lungs are converted into visual representations called spectrograms using a technique called short-time Fourier transform (STFT). These images are then analyzed using a model called vision transformer to identify different types of respiratory sounds. The classification was carried out using the ICBHI 2017 database, which includes various types of lung sounds with different frequencies, noise levels, and backgrounds. The proposed AS-ViT method was evaluated using three metrics and achieved 79.1% and 59.8% for 60:40 split ratio and 86.4% and 69.3% for 80:20 split ratio in terms of unweighted average recall and overall scores respectively for respiratory sound detection, surpassing previous state-of-the-art results.
呼吸疾病,全球死亡人数排名第三,被认为是需要进行大量研究的高优先疾病。通过听诊器记录的肺声音和由人工智能驱动的设备已被用于鉴定肺病并帮助专家进行准确的诊断。在这项研究中,我们开发了一种新的方法:音频光谱图视觉转换器(AS-ViT),用于识别异常呼吸声音。使用一种称为短时傅里叶变换(STFT)的技术将肺部的声音转换成视觉表示,这些图像然后用一种称为视觉转换器的模型进行分析,以识别不同类型的呼吸声音。分类是根据ICBHI 2017数据库进行的,该数据库包括不同频率、噪声水平和背景的各种类型的肺声音。所提出的AS-ViT方法通过三个指标进行评估,在60:40分比和80:20分比下的平均召回率和总得分均超过了前人最佳水平,分别实现了79.1%和59.8%。
https://arxiv.org/abs/2405.08342
Semantic communications have been utilized to execute numerous intelligent tasks by transmitting task-related semantic information instead of bits. In this article, we propose a semantic-aware speech-to-text transmission system for the single-user multiple-input multiple-output (MIMO) and multi-user MIMO communication scenarios, named SAC-ST. Particularly, a semantic communication system to serve the speech-to-text task at the receiver is first designed, which compresses the semantic information and generates the low-dimensional semantic features by leveraging the transformer module. In addition, a novel semantic-aware network is proposed to facilitate the transmission with high semantic fidelity to identify the critical semantic information and guarantee it is recovered accurately. Furthermore, we extend the SAC-ST with a neural network-enabled channel estimation network to mitigate the dependence on accurate channel state information and validate the feasibility of SAC-ST in practical communication environments. Simulation results will show that the proposed SAC-ST outperforms the communication framework without the semantic-aware network for speech-to-text transmission over the MIMO channels in terms of the speech-to-text metrics, especially in the low signal-to-noise regime. Moreover, the SAC-ST with the developed channel estimation network is comparable to the SAC-ST with perfect channel state information.
语义通信已经被用于通过传输任务相关的语义信息来执行许多智能任务,而不是比特。在本文中,我们提出了一个名为SAC-ST的单用户多输入多输出(MIMO)和多用户MIMO通信场景的语义感知语音到文本传输系统。特别是,首先针对接收者设计的语义通信系统,通过利用变换器模块压缩语义信息并生成低维语义特征。此外,还提出了一种新的语义感知网络,以实现高语义保真度的传输来识别关键语义信息并确保其准确恢复。此外,我们通过神经网络enabled的通道估计网络扩展了SAC-ST,以减轻对准确信道状态信息的依赖并验证SAC-ST在实际通信环境中的可行性。仿真结果将展示,与没有语义感知网络的语音到文本传输通信框架相比,SAC-ST在MIMO信道上的语音到文本指标方面的性能优异,尤其是在低信噪比环境中。此外,具有开发中的通道估计网络的SAC-ST与具有完美信道状态信息的SAC-ST性能相当。
https://arxiv.org/abs/2405.08096
Understanding degraded speech is demanding, requiring increased listening effort (LE). Evaluating processed and unprocessed speech with respect to LE can objectively indicate if speech enhancement systems benefit listeners. However, existing methods for measuring LE are complex and not widely applicable. In this study, we propose a simple method to evaluate speech intelligibility and LE simultaneously without additional strain on subjects or operators. We assess this method using results from two independent studies in Norway and Denmark, testing 76 (50+26) subjects across 9 (6+3) processing conditions. Despite differences in evaluation setups, subject recruitment, and processing systems, trends are strikingly similar, demonstrating the proposed method's robustness and ease of implementation into existing practices.
理解受损语音是具有挑战性的,需要增加听力和理解努力(LE)。用LE评估经过处理和未处理的语音可以客观地表明,语音增强系统是否对听者有益。然而,现有的测量LE的方法是复杂且不广泛的。在这项研究中,我们提出了一种简单的方法来同时评估语音可懂度和LE,而不会对参与者或操作者造成额外的压力。我们用来自挪威和丹麦的两个独立研究的结果来评估这种方法,测试了9个(6个+3个)处理条件中的76个(50个+26个)参与者。尽管评估设置、参与者和处理系统存在差异,但趋势是相似的,这表明所提出的方法的稳健性和易用性,以及它对现有实践的适应性。
https://arxiv.org/abs/2405.07641
The application of Automatic Speech Recognition (ASR) technology in soccer offers numerous opportunities for sports analytics. Specifically, extracting audio commentaries with ASR provides valuable insights into the events of the game, and opens the door to several downstream applications such as automatic highlight generation. This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts, enhancing video content with rich layers of textual information derived from the game audio using ASR. These textual commentaries, generated using the Whisper model and translated with Google Translate, extend the usefulness of the SoccerNet dataset in diverse applications such as enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games. We detail the methods involved in the curation of this dataset and the integration of ASR. We also highlight the implications of a multimodal approach in sports analytics, and how the enriched dataset can support diverse applications, thus broadening the scope of research and development in the field of sports analytics.
自动语音识别(ASR)技术在足球中的应用带来了许多体育分析的机会。特别是,通过提取来自足球比赛直播的音频评论,使用ASR提取音频评论提供了对比赛事件的宝贵见解,并打开了几个下游应用的大门,如自动高光提取。本文介绍了SoccerNet-Echoes,一种通过从足球比赛直播中自动生成音频评论来扩充SoccerNet数据集的增强视频内容,利用ASR生成的文本信息丰富地增强了视频内容的应用。这些文本评论使用Whisper模型生成,并使用谷歌翻译翻译。通过结合文本数据和视觉和听觉内容,SoccerNet-Echoes旨在成为开发专门捕捉足球比赛动态的算法的全面资源。我们详细介绍了这个数据集的编辑方法和ASR的集成。我们还强调了体育分析中多模态方法的意义,以及丰富数据集如何支持各种应用,从而扩大体育分析领域的研究和开发范围。
https://arxiv.org/abs/2405.07354
Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.
电生理转换为语音(ETS)已经在静默语音界面中展示了其潜在功能,通过在静默发音时从电生理(EMG)信号生成可听见的语音。 ETS 模型通常由一个 EMG 编码器和一个语音合成器组成,编码器将 EMG 信号转换为声学特征,然后合成语音信号。 由于可用数据不足和噪声干扰,合成的语音往往表现出较低的自然水平。在这项工作中,我们提出了 Diff-ETS,一种使用基于得分的扩散概率模型来增强合成语音的自然性的 ETS 模型。扩散模型应用于改善由 EMG 编码器预测的声学特征的质量。在我们的实验中,我们通过评估 Diff-ETS 对预训练 EMG 编码器的预测进行微调,并以端到端的方式训练两个模型。我们使用客观指标比较 Diff-ETS 与没有扩散的基线 ETS 模型,并进行了 listening test。实验结果表明,与基线 ETS 模型相比,提出的 Diff-ETS 在提高语音自然性方面取得了显著的进展。
https://arxiv.org/abs/2405.08021
Neural networks and deep learning are often deployed for the sake of the most comprehensive music generation with as little involvement as possible from the human musician. Implementations in aid of, or being a tool for, music practitioners are sparse. This paper proposes the integration of generative stacked autoencoder structures for rhythm generation, within a conventional melodic step-sequencer. It further aims to work towards its implementation being accessible to the average electronic music practitioner. Several model architectures have been trained and tested for their creative potential. While the currently implementations do display limitations, they do represent viable creative solutions for music practitioners.
神经网络和深度学习通常被用于实现尽可能少的 human 音乐家的参与,生成最全面的音乐。辅助实现或作为音乐家的工具的实现数量较少。本文提出了在传统旋律步骤序列器中集成生成堆叠自动编码器结构以进行节奏生成的建议。它进一步旨在实现其实现对平均电子音乐从业者的可用性。已经训练和测试了多个模型架构,以展示其创意潜力。虽然目前的实现存在局限性,但它们确实代表了音乐家的可行创意解决方案。
https://arxiv.org/abs/2405.07034
The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker embedding extractor, we create a multi-scale pooling method by combining sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP). To mitigate the scarcity of data, we have recorded a publicly available Chinese numerical corpus named SHALCAS22A (hereinafter called SHAL), which can be accessed on Open-SLR. Moreover, we employ data augmentation techniques using Tacotron2 and HiFi-GAN. Our method achieves an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.
中文数字字符串语料库是一个有价值的资源,尤其是对于金融交易等场景的说话人验证。研究表明,在短语对话场景中,基于文本的说话人验证(TD-SV)始终优于基于文本的说话人验证(TI-SV)。然而,TD-SV可能包括验证文本信息,这可能会受到阅读节奏和停顿的影响。为了解决这个问题,我们提出了一个端到端的说话人验证系统,通过解耦说话人和文本信息来增强TD-SV。我们的系统包括一个文本嵌入提取器、一个说话人嵌入提取器和一個融合模块。在文本嵌入提取器中,我们使用增强的Transformer,引入了包括文本分类损失、连接式时间分类(CTC)损失和解码器损失的三重损失;而在说话人嵌入提取器中,我们通过结合滑动窗口注意统计池化(SWASP)和注意统计池化(ASP)创建了多尺度池化方法。为了缓解数据稀缺的问题,我们已经公开了一个名为SHALCAS22A的可用中文数字语料库(以下称为SHAL),可以访问Open-SLR。此外,我们还使用Tacotron2和HiFi-GAN等数据增强技术。我们的方法在Hi-Mia和SHAL上的等错误率(EER)性能改进分别为49.2%和75.0%。
https://arxiv.org/abs/2405.07029
Extracting direct-path spatial feature is crucial for sound source localization in adverse acoustic environments. This paper proposes the IPDnet, a neural network that estimates direct-path inter-channel phase difference (DP-IPD) of sound sources from microphone array signals. The estimated DP-IPD can be easily translated to source location based on the known microphone array geometry. First, a full-band and narrow-band fusion network is proposed for DP-IPD estimation, in which alternating narrow-band and full-band layers are responsible for estimating the rough DP-IPD information in one frequency band and capturing the frequency correlations of DP-IPD, respectively. Second, a new multi-track DP-IPD learning target is proposed for the localization of flexible number of sound sources. Third, the IPDnet is extend to handling variable microphone arrays, once trained which is able to process arbitrary microphone arrays with different number of channels and array topology. Experiments of multiple-moving-speaker localization are conducted on both simulated and real-world data, which show that the proposed full-band and narrow-band fusion network and the proposed multi-track DP-IPD learning target together achieves excellent sound source localization performance. Moreover, the proposed variable-array model generalizes well to unseen microphone arrays.
提取直接路径空间特征对于在恶劣声环境中进行声源定位至关重要。本文提出了一种称为IPDnet的神经网络,它从麦克风阵列信号中估计声源的直接路径间相位差(DP-IPD)。根据已知的麦克风阵列几何形状,可以轻松地将估计的DP-IPD转换为声源位置。首先,提出了一种DP-IPD估计的全带宽和窄带合并网络,其中交替的窄带和全带层分别负责在单个频率带中估计粗DP-IPD信息和捕捉DP-IPD的频率相关性。其次,提出了一种新的多轨道DP-IPD学习目标,用于灵活确定声源的定位。第三,一旦训练完成,IPDnet可以处理具有不同通道数和阵列拓扑结构的任意麦克风阵列。在模拟和现实世界的数据上进行多项说话人定位实验,结果表明,与提出的全带宽和窄带合并网络和多轨道DP-IPD学习目标相结合,可以实现出色的声源定位性能。此外,与已知麦克风阵列不同,所提出的可变阵列模型表现出很好的泛化能力。
https://arxiv.org/abs/2405.07021
In binaural audio synthesis, aligning head-related impulse responses (HRIRs) in time has been an important pre-processing step, enabling accurate spatial interpolation and efficient data compression. The maximum correlation time delay between spatially nearby HRIRs has previously been used to get accurate and smooth alignment by solving a matrix equation in which the solution has the minimum Euclidean distance to the time delay. However, the Euclidean criterion could lead to an over-smoothing solution in practice. In this paper, we solve the smoothing issue by formulating the task as solving an integer linear programming problem equivalent to minimising an $L^1$-norm. Moreover, we incorporate 1) the cross-correlation of inter-aural HRIRs, and 2) HRIRs with their minimum-phase responses to have more reference measurements for optimisation. We show the proposed method can get more accurate alignments than the Euclidean-based method by comparing the spectral reconstruction loss of time-aligned HRIRs using spherical harmonics representation on seven HRIRs consisting of human and dummy heads. The extra correlation features and the $L^1$-norm are also beneficial in extremely noisy conditions. In addition, this method can be applied to phase unwrapping of head-related transfer functions, where the unwrapped phase could be a compact feature for downstream tasks.
在双耳音频合成中,对齐头相关脉冲响应(HRIRs)的时间步调谐是一个重要的预处理步骤,可以实现准确的空间插值和高效的数据压缩。以前,最大空间附近HRIRs之间的相关延迟被用于通过求解矩阵方程来获得准确和平滑对齐,其中解决方案对时间延迟的欧氏距离最小。然而,欧氏标准可能导致实际中过度平滑的解决方案。在本文中,我们通过将任务表示为求解等价的整数线性规划问题来解决平滑问题。此外,我们还考虑了以下两个因素:1)跨耳HRIRs之间的相关性,2)具有最小相位的HRIRs。通过使用球面余弦表示法,可以获得更多参考测量,有助于优化。我们通过比较使用七个人类头和假头 HRIRs 的球面余弦表示法进行时间对齐的 HRIRs 的频谱重建损失,证明了所提出的方法可以比基于欧氏的方法获得更准确的对齐。此外,这种方法还可以应用于对头相关传递函数的相位解码,其中解码后的相位可以是一个下游任务的紧凑特征。
https://arxiv.org/abs/2405.06804
This study explores the application of recurrent neural networks to recognize emotions conveyed in music, aiming to enhance music recommendation systems and support therapeutic interventions by tailoring music to fit listeners' emotional states. We utilize Russell's Emotion Quadrant to categorize music into four distinct emotional regions and develop models capable of accurately predicting these categories. Our approach involves extracting a comprehensive set of audio features using Librosa and applying various recurrent neural network architectures, including standard RNNs, Bidirectional RNNs, and Long Short-Term Memory (LSTM) networks. Initial experiments are conducted using a dataset of 900 audio clips, labeled according to the emotional quadrants. We compare the performance of our neural network models against a set of baseline classifiers and analyze their effectiveness in capturing the temporal dynamics inherent in musical expression. The results indicate that simpler RNN architectures may perform comparably or even superiorly to more complex models, particularly in smaller datasets. We've also applied the following experiments on larger datasets: one is augmented based on our original dataset, and the other is from other sources. This research not only enhances our understanding of the emotional impact of music but also demonstrates the potential of neural networks in creating more personalized and emotionally resonant music recommendation and therapy systems.
本研究探讨了循环神经网络(RNN)在识别音乐中传达的情感中的应用,旨在通过将音乐定制以适应听众的情感状态来增强音乐推荐系统和支持治疗干预。我们利用拉塞尔的情感四象限将音乐分为四个不同的情感区域,并开发了能够准确预测这些类别的模型。我们的方法包括使用Librosa提取全面音频特征,并应用各种循环神经网络架构,包括标准的RNN、双向RNN和长短时记忆(LSTM)网络。我们对900个音频片段的音频数据集进行了初始实验,并根据情感四象限进行分类。我们比较了我们的神经网络模型的性能与一系列基线分类器的性能,并分析了它们在捕捉音乐表达中的时间动态方面的有效性。结果显示,简单的RNN架构可能与更复杂的模型表现相当,甚至可能优于它们。我们还将在较大数据集上进行以下实验:一是基于我们原始数据集的增强,二是来自其他来源的。这项研究不仅增强了我们对音乐情感影响的了解,还展示了神经网络在创建更个性化和情感共鸣的音乐推荐和治疗系统方面的潜力。
https://arxiv.org/abs/2405.06747