This paper investigates the development and evaluation of machine translation models from Cantonese to English, where we propose a novel approach to tackle low-resource language translations. The main objectives of the study are to develop a model that can effectively translate Cantonese to English and evaluate it against state-of-the-art commercial models. To achieve this, a new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning. In addition, a monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation. Following the data collection process, several approaches, including fine-tuning models, back-translation, and model switch, have been used. The translation quality of models has been evaluated with multiple quality metrics, including lexicon-based metrics (SacreBLEU and hLEPOR) and embedding-space metrics (COMET and BERTscore). Based on the automatic metrics, the best model is selected and compared against the 2 best commercial translators using the human evaluation framework HOPES. The best model proposed in this investigation (NLLB-mBART) with model switch mechanisms has reached comparable and even better automatic evaluation scores against State-of-the-art commercial models (Bing and Baidu Translators), with a SacreBLEU score of 16.8 on our test set. Furthermore, an open-source web application has been developed to allow users to translate between Cantonese and English, with the different trained models available for effective comparisons between models from this investigation and users. CANTONMT is available at this https URL
本文研究了从粤语到英语的机器翻译模型的开发和评估,提出了一种新颖的方法来解决资源有限的语言翻译问题。研究的主要目标是开发一个能有效将粤语翻译成英语的模型,并将其与最先进的商业模型进行比较。为实现这一目标,通过将不同可用的语料库在线合并并进行预处理和清洗,创建了一个新的并行语料库。此外,通过网络爬取创建了一个单语粤语数据集,以帮助生成合成的并行语料库。在数据收集过程完成后,采用了多种方法,包括微调模型、反向翻译和模型切换,对模型的翻译质量进行了评估。评估指标包括词汇表为基础的指标(SacreBLEU和hLEPOR)和嵌入空间指标(COMET和BERTscore)。根据自动指标,最佳模型被选择并与人评估框架HOPES中的两个最佳商业翻译者进行了比较。本研究提出的最佳模型(NLLB-mBART)具有模型切换机制,在自动评估评分上已经达到了与最先进的商业模型相当甚至更好的水平(Bing和Baidu Translators),同时在我们的测试集中获得了16.8的SacreBLEU分数。此外,还开发了一个开源的Web应用程序,允许用户在粤语和英语之间进行翻译,该应用程序上有不同训练好的模型供有效的模型比较和用户。CANTONMT可在此处访问:<https://url>
https://arxiv.org/abs/2405.08172
Massively multilingual neural machine translation (MMNMT) has been proven to enhance the translation quality of low-resource languages. In this paper, we empirically investigate the translation robustness of Indonesian-Chinese translation in the face of various naturally occurring noise. To assess this, we create a robustness evaluation benchmark dataset for Indonesian-Chinese translation. This dataset is automatically translated into Chinese using four NLLB-200 models of different sizes. We conduct both automatic and human evaluations. Our in-depth analysis reveal the correlations between translation error types and the types of noise present, how these correlations change across different model sizes, and the relationships between automatic evaluation indicators and human evaluation indicators. The dataset is publicly available at this https URL.
大量多语言神经机器翻译(MMNMT)已被证明可以提高低资源语言的翻译质量。在本文中,我们通过实验研究了印度尼西亚-汉语翻译在各种自然噪声面前的翻译鲁棒性。为了评估这一,我们创建了一个印度尼西亚-汉语翻译的鲁棒性评估基准数据集。这个数据集使用四种不同大小的NNLB-200模型自动翻译成汉语。我们进行了自动和人工评估。我们深入分析揭示了翻译错误类型与噪声类型的相关性,以及这些相关性在不同模型大小上的变化,以及自动评估指标和人工评估指标之间的关系。该数据集可以在这个链接上公开获取:https://www.aclweb.org/anthology/N22-2164
https://arxiv.org/abs/2405.07673
Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models' training.
非自回归(NAR)语言模型以其在神经机器翻译(NMT)中的低延迟而闻名。然而,由于大型解码空间和准确捕捉目标单词之间关系困难,NAR模型和自回归模型之间存在性能差距。进一步,为NAR模型准备适当的训练数据是一个非 trivial 任务,通常会加剧暴露偏差。为了应对这些挑战,我们将强化学习(RL)应用于Levenshtein Transformer,一个代表性的基于编辑的NAR模型,证明了使用自我生成的数据进行RL可以提高基于编辑的NAR模型的性能。我们探讨了这两种RL方法的优缺点,并进行了实证验证。此外,我们还实验性地研究了温度设置对性能的影响,证实了为NAR模型设置适当的训练温度非常重要。
https://arxiv.org/abs/2405.01280
Neural Machine Translation (NMT) is the task of translating a text from one language to another with the use of a trained neural network. Several existing works aim at incorporating external information into NMT models to improve or control predicted translations (e.g. sentiment, politeness, gender). In this work, we propose to improve translation quality by adding another external source of information: the automatically recognized emotion in the voice. This work is motivated by the assumption that each emotion is associated with a specific lexicon that can overlap between emotions. Our proposed method follows a two-stage procedure. At first, we select a state-of-the-art Speech Emotion Recognition (SER) model to predict dimensional emotion values from all input audio in the dataset. Then, we use these predicted emotions as source tokens added at the beginning of input texts to train our NMT model. We show that integrating emotion information, especially arousal, into NMT systems leads to better translations.
神经机器翻译(NMT)是将一种语言文本翻译成另一种语言文本的任务,使用训练好的神经网络来实现。为了提高或控制预测的翻译质量(例如:情感、礼貌、性别等),许多现有作品试图将外部信息引入NMT模型中。在这项工作中,我们提出了一种通过添加另一个外部信息源来提高翻译质量的方法:说话者的情感。这项工作源于这样的假设,每个情感都与特定的词汇表相关联,这些词汇表可以在情感之间重叠。我们提出的方法分为两个阶段。首先,我们选择了一个最先进的语音情感识别(SER)模型,预测数据库中所有输入音频的维度情感值。然后,我们将这些预测的情感作为输入文本的开头添加,训练我们的NMT模型。我们证明了将情感信息,特别是兴奋,融入NMT系统会导致更好的翻译。
https://arxiv.org/abs/2404.17968
We show that Claude 3 Opus, a large language model (LLM) released by Anthropic in March 2024, exhibits stronger machine translation competence than other LLMs. Though we find evidence of data contamination with Claude on FLORES-200, we curate new benchmarks that corroborate the effectiveness of Claude for low-resource machine translation into English. We find that Claude has remarkable \textit{resource efficiency} -- the degree to which the quality of the translation model depends on a language pair's resource level. Finally, we show that advancements in LLM translation can be compressed into traditional neural machine translation (NMT) models. Using Claude to generate synthetic data, we demonstrate that knowledge distillation advances the state-of-the-art in Yoruba-English translation, meeting or surpassing strong baselines like NLLB-54B and Google Translate.
我们证明了Anthropic于2024年3月发布的Claude 3 Opus大型语言模型(LLM)表现出比其他LLM更强的机器翻译能力。尽管我们在FLORES-200上发现了Claude的数据污染的证据,但我们筛选出了一些新的基准,证实了Claude对于低资源机器翻译成英语的有效性。我们发现Claude具有出色的资源效率——翻译模型的质量取决于语言对资源的水平。最后,我们证明了LLM翻译的进步可以压缩到传统的神经机器翻译(NMT)模型中。使用Claude生成合成数据,我们证明了知识蒸馏在约鲁巴-英语翻译方面的最新进展超过了类似NLLB-54B和Google Translate等强势基线。
https://arxiv.org/abs/2404.13813
In the evolving landscape of Neural Machine Translation (NMT), the pretrain-then-finetune paradigm has yielded impressive results. However, the persistent challenge of Catastrophic Forgetting (CF) remains a hurdle. While previous work has introduced Continual Learning (CL) methods to address CF, these approaches grapple with the delicate balance between avoiding forgetting and maintaining system extensibility. To address this, we propose a CL method, named $\textbf{F-MALLOC}$ ($\textbf{F}$eed-forward $\textbf{M}$emory $\textbf{ALLOC}ation)$. F-MALLOC is inspired by recent insights highlighting that feed-forward layers emulate neural memories and encapsulate crucial translation knowledge. It decomposes feed-forward layers into discrete memory cells and allocates these memories to different tasks. By learning to allocate and safeguard these memories, our method effectively alleviates CF while ensuring robust extendability. Besides, we propose a comprehensive assessment protocol for multi-stage CL of NMT systems. Experiments conducted following this new protocol showcase the superior performance of F-MALLOC, evidenced by higher BLEU scores and almost zero forgetting.
在不断演变的自然语言处理(NMT)领域,预训练-然后微调范式已经取得了令人印象深刻的成果。然而,普遍存在的遗忘挑战(CF)仍然是一个障碍。虽然之前的工作已经引入了连续学习(CL)方法来解决CF,但这些方法在避免遗忘和保持系统可扩展性之间取得了微妙的平衡。为解决这个问题,我们提出了一个名为$\textbf{F-MALLOC}$的CL方法($\textbf{F}$eed-forward $\textbf{M}$emory $\textbf{ALLOC}$ation)。$\textbf{F-MALLOC}$受到最近研究启发,这些研究强调馈线层模拟神经记忆并封装关键翻译知识。它将馈线层分解为离散的内存单元,并将这些记忆分配给不同任务。通过学会分配和保护这些记忆,我们的方法有效地减轻了CF,同时确保了系统的稳健可扩展性。此外,我们还提出了一个用于多阶段CL的自然语言处理系统全面评估协议。遵循这一新协议进行实验,展示了$\textbf{F-MALLOC}$优越的性能,表现为更高的BLEU分数和几乎零遗忘。
https://arxiv.org/abs/2404.04846
Parameter-efficient fine-tuning (PEFT) methods are increasingly vital in adapting large-scale pre-trained language models for diverse tasks, offering a balance between adaptability and computational efficiency. They are important in Low-Resource Language (LRL) Neural Machine Translation (NMT) to enhance translation accuracy with minimal resources. However, their practical effectiveness varies significantly across different languages. We conducted comprehensive empirical experiments with varying LRL domains and sizes to evaluate the performance of 8 PEFT methods with in total of 15 architectures using the SacreBLEU score. We showed that 6 PEFT architectures outperform the baseline for both in-domain and out-domain tests and the Houlsby+Inversion adapter has the best performance overall, proving the effectiveness of PEFT methods.
参数高效的微调(PEFT)方法在适应大规模预训练语言模型为各种任务方面变得越来越重要,提供了适应性和计算效率之间的平衡。在低资源语言(LRL)神经机器翻译(NMT)中,它们可以通过最小资源提高翻译准确性。然而,它们在不同语言上的实际效果差异很大。我们对不同的LRL领域和规模进行了全面的实证研究,共使用15种架构评估了8种PEFT方法的性能。我们发现,6种PEFT架构在领域内和领域外测试中都优于基线,而Houlsby+逆适应器的表现最差,证明了PEFT方法的有效性。
https://arxiv.org/abs/2404.04212
While multilingual machine translation (MNMT) systems hold substantial promise, they also have security vulnerabilities. Our research highlights that MNMT systems can be susceptible to a particularly devious style of backdoor attack, whereby an attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages, including high-resource languages. Our experimental results reveal that injecting less than 0.01% poisoned data into a low-resource language pair can achieve an average 20% attack success rate in attacking high-resource language pairs. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings. Our aim is to bring attention to these vulnerabilities within MNMT systems with the hope of encouraging the community to address security concerns in machine translation, especially in the context of low-resource languages.
虽然多语言机器翻译(MNMT)系统具有很大的潜力,但它们也存在安全漏洞。我们的研究强调,MNMT系统可能特别易于受到一种特别狡猾的后门攻击,攻击者向低资源语言对注入有毒数据,从而导致对其他语言,包括高资源语言的恶意翻译。我们的实验结果表明,将不到0.01%的有毒数据注入到低资源语言对中,可以在攻击高资源语言对上实现平均20%的攻击成功率。这种攻击对低资源设置具有更大的攻击面,因此需要特别关注。我们希望用这些漏洞引起MNMT系统内的注意,以便鼓励社区关注机器翻译中的安全问题,特别是低资源语言。
https://arxiv.org/abs/2404.02393
Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation for morphologically-rich languages. However, existing methods such as sub-word tokenization and character-based models are limited to the surface forms of the words. In this work, we propose a framework-solution for modeling complex morphology in low-resource settings. A two-tier transformer architecture is chosen to encode morphological information at the inputs. At the target-side output, a multi-task multi-label training scheme coupled with a beam search-based decoder are found to improve machine translation performance. An attention augmentation scheme to the transformer model is proposed in a generic form to allow integration of pre-trained language models and also facilitate modeling of word order relationships between the source and target languages. Several data augmentation techniques are evaluated and shown to increase translation performance in low-resource settings. We evaluate our proposed solution on Kinyarwanda - English translation using public-domain parallel text. Our final models achieve competitive performance in relation to large multi-lingual models. We hope that our results will motivate more use of explicit morphological information and the proposed model and data augmentations in low-resource NMT.
在神经机器翻译(NMT)中,形态建模是一种实现开放词汇机器翻译的有前途的方法,特别是对于形态丰富的语言。然而,现有方法如子词切分和基于字符的模型,限制在单词的表面形式。在这项工作中,我们提出了一个用于在低资源设置中建模复杂形态的框架解决方案。我们选择了一个双层Transformer架构来对输入编码形态信息。在目标端输出,我们发现一个结合多任务多标签训练和 beam搜索解码器的 multi-task multi-label 训练方案,可以提高机器翻译性能。我们还提出了一个通用形式的注意力增强方案,允许预训练语言模型和促进源语言和目标语言之间单词顺序关系的建模。我们评估了几种数据增强技术,并证明了它们在低资源设置中可以提高翻译性能。我们对使用公共领域并行文本的Kinyarwanda - 英语翻译进行了评估。我们的最终模型与大型多语言模型相当具有竞争力。我们希望我们的结果能够激励在低资源NMT中更广泛地使用明确形态信息,并使用所提出的模型和数据增强方案。
https://arxiv.org/abs/2404.02392
This paper addresses the ethical challenges of Artificial Intelligence in Neural Machine Translation (NMT) systems, emphasizing the imperative for developers to ensure fairness and cultural sensitivity. We investigate the ethical competence of AI models in NMT, examining the Ethical considerations at each stage of NMT development, including data handling, privacy, data ownership, and consent. We identify and address ethical issues through empirical studies. These include employing Transformer models for Luganda-English translations and enhancing efficiency with sentence mini-batching. And complementary studies that refine data labeling techniques and fine-tune BERT and Longformer models for analyzing Luganda and English social media content. Our second approach is a literature review from databases such as Google Scholar and platforms like GitHub. Additionally, the paper probes the distribution of responsibility between AI systems and humans, underscoring the essential role of human oversight in upholding NMT ethical standards. Incorporating a biblical perspective, we discuss the societal impact of NMT and the broader ethical responsibilities of developers, positing them as stewards accountable for the societal repercussions of their creations.
本文探讨了在神经机器翻译(NMT)系统中人工智能(AI)的道德挑战,强调了开发人员确保公平和敏感性至关重要。我们研究了AI模型在NMT开发中的道德能力,包括在NMT开发的不同阶段中的道德考虑,包括数据处理、隐私、数据所有权和同意。我们通过实证研究识别并解决了道德问题。这些包括使用Transformer模型进行卢干达-英语翻译以及通过句子迷你批处理提高效率。此外,互补研究还改进了数据标注技术,并微调了BERT和Longformer模型以分析卢干达和英语社交媒体内容。我们的第二项方法是对Google Scholar和类似平台上的文献进行文献综述。此外,本文探讨了AI系统和人类之间责任分配的问题,强调了在维护NMT道德标准方面人类监督至关重要。从圣经角度出发,我们讨论了NMT的社会影响以及开发人员的更广泛道德责任,将他们视为负责其作品社会后果的守护人。
https://arxiv.org/abs/2404.01070
This study explores the distinctions between neural machine translation (NMT) and human translation (HT) through the lens of translation relations. It benchmarks HT to assess the translation techniques produced by an NMT system and aims to address three key research questions: the differences in overall translation relations between NMT and HT, how each utilizes non-literal translation techniques, and the variations in factors influencing their use of specific non-literal techniques. The research employs two parallel corpora, each spanning nine genres with the same source texts with one translated by NMT and the other by humans. Translation relations in these corpora are manually annotated on aligned pairs, enabling a comparative analysis that draws on linguistic insights, including semantic and syntactic nuances such as hypernyms and alterations in part-of-speech tagging. The results indicate that NMT relies on literal translation significantly more than HT across genres. While NMT performs comparably to HT in employing syntactic non-literal translation techniques, it falls behind in semantic-level performance.
本研究通过考察翻译关系,探讨了神经机器翻译(NMT)和人类翻译(HT)之间的差异。它通过基准HT来评估NMT系统产生的翻译技术,旨在回答三个关键研究问题:NMT和HT之间整体翻译关系上的差异,每个系统如何利用非正式的翻译技术,以及它们在使用具体非正式技术时的变化。研究采用两个平行语料库,每个涵盖九个不同类型的作品,其中一个由NMT翻译,另一个由人类翻译。这些语料库中的翻译关系都是通过与原文对齐的 pair 手动注释的,从而实现了一种比较分析,包括语义和句法细微差别,如超类和词性标记的变化。结果显示,在各种语料库中,NMT在很大程度上依赖正式翻译,而HT与之相比表现相对较好。虽然NMT在采用句法非正式翻译技术方面与HT相当,但在语义级别上明显滞后。
https://arxiv.org/abs/2404.08661
Text continues to remain a relevant form of representation for information. Text documents are created either in digital native platforms or through the conversion of other media files such as images and speech. While the digital native text is invariably obtained through physical or virtual keyboards, technologies such as OCR and speech recognition are utilized to transform the images and speech signals into text content. All these variety of mechanisms of text generation also introduce errors into the captured text. This project aims at analyzing different kinds of error that occurs in text documents. The work employs two of the advanced deep neural network-based language models, namely, BART and MarianMT, to rectify the anomalies present in the text. Transfer learning of these models with available dataset is performed to finetune their capacity for error correction. A comparative study is conducted to investigate the effectiveness of these models in handling each of the defined error categories. It is observed that while both models can bring down the erroneous sentences by 20+%, BART can handle spelling errors far better (24.6%) than grammatical errors (8.8%).
文本 remains 是一种相关形式的信息表示形式。文本文件可以通过数字原生平台或转换其他媒体文件(如图像和语音)来创建。虽然数字原生文本是通过物理或虚拟键盘获得的,但使用 OCR 和语音识别等技术将图像和语音信号转换为文本内容。所有这些文本生成机制也引入了错误到捕获到的文本中。本项目旨在分析文本文件中发生的不同类型的错误。该工作采用两个先进的基于深度神经网络的语言模型,即 BART 和 MarianMT,来纠正文本中的异常。使用这些模型的可转移学习来微调其纠正错误的能力。对这两种模型在处理定义的错误类别的效果进行了比较研究。观察到,虽然两种模型都可以将错误的句子降低 20% 以上,但 BART 在处理拼写错误方面远比语法错误(8.8%)要好(24.6%)。
https://arxiv.org/abs/2403.16655
Recently, there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data, where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions, as can be observed from state-of-the-art methods. To tackle this issue, we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem, and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising, we further introduce a Multi-Task Conditioning strategy, which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks, leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps, and outperform the state-of-the-art methods on three challenging multi-task benchmarks, under two different partial-labeling evaluation settings. The code is available at this https URL.
近年来,对于从部分标注数据中学习多个密集场景理解任务的问题,引起了越来越多的关注。每个训练样本仅标注了任务的一部分,因此任务标签在训练中缺失会导致低质量且噪声干扰的预测,这是通过最先进的方法可以观察到的。为了应对这个问题,我们将部分标注的多任务密集预测重新建模为像素级的去噪问题,并提出了名为DiffusionMTL的新多任务去噪扩散框架。它设计了一个联合扩散和去噪范式,以建模任务预测或特征图中的潜在嘈杂分布,并为不同任务生成纠正输出。为了利用多任务去噪的一致性,我们进一步引入了多任务条件策略,它可以隐含地利用任务的互补性质来帮助学习未标记的任务,从而提高不同任务的去噪性能。大量的定量和定性实验证明,与最先进的方法相比,所提出的多任务去噪扩散模型可以在显著提高多任务预测图的基础上,在两个不同的半监督评估设置上超越了现有方法。代码可在此处访问:https://url.com/
https://arxiv.org/abs/2403.15389
Traditional Automatic Video Dubbing (AVD) pipeline consists of three key modules, namely, Automatic Speech Recognition (ASR), Neural Machine Translation (NMT), and Text-to-Speech (TTS). Within AVD pipelines, isometric-NMT algorithms are employed to regulate the length of the synthesized output text. This is done to guarantee synchronization with respect to the alignment of video and audio subsequent to the dubbing process. Previous approaches have focused on aligning the number of characters and words in the source and target language texts of Machine Translation models. However, our approach aims to align the number of phonemes instead, as they are closely associated with speech duration. In this paper, we present the development of an isometric NMT system using Reinforcement Learning (RL), with a focus on optimizing the alignment of phoneme counts in the source and target language sentence pairs. To evaluate our models, we propose the Phoneme Count Compliance (PCC) score, which is a measure of length compliance. Our approach demonstrates a substantial improvement of approximately 36% in the PCC score compared to the state-of-the-art models when applied to English-Hindi language pairs. Moreover, we propose a student-teacher architecture within the framework of our RL approach to maintain a trade-off between the phoneme count and translation quality.
传统自动视频配音(AVD)流程包括三个关键模块,即自动语音识别(ASR)、神经机器翻译(NMT)和文本转语音(TTS)。在AVD流程中,等距-NMT算法用于调节合成输出文本的长度。这是为了确保在配音过程后,视频和音频的对齐。之前的解决方案专注于将机器翻译模型的源语言和目标语言文本中的字符和单词对齐。然而,我们的方法旨在将等距-NMT算法的注意力放在对齐源语言和目标语言句子对中的音素计数上,因为它们与语音持续时间密切相关。在本文中,我们介绍了使用强化学习(RL)开发等距NMT系统的研究,重点优化源语言和目标语言句子对中的音素计数。为了评估我们的模型,我们提出了音素计数符合(PCC)分数,这是一种衡量长度符合的指标。我们的方法在将AHD-English和AHD-Hindi语言对应用到我们的RL方法时,PCC分数比最先进的模型大约提高了36%。此外,我们还提出了一种学生-教师架构,在我们的RL方法中保持音素计数和翻译质量之间的平衡。
https://arxiv.org/abs/2403.15469
Neural Machine Translation (NMT) for low-resource languages is still a challenging task in front of NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation including OpusMT, NLLB, and mBART. We carried out automatic evaluation using a range of different metrics including lexical-based and embedding-based. Furthermore. we create a user-friendly interface for the models we included in this\textsc{ CantonMT} research project and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models into this platform via our open-source\textsc{ CantonMT} toolkit \url{this https URL}.
神经机器翻译(NMT)对于低资源语言仍然是一项具有挑战性的任务,尤其是在自然语言处理(NLP)研究者面前。在本文中,我们通过反向翻译来部署一种标准的数据增强方法,以将目标语言从粤语到英语。我们展示了使用有限量的真实数据和通过反向翻译生成的合成数据进行微调的模型,包括OpusMT、NLLB和mBART。我们使用一系列不同的指标进行了自动评估,包括基于词汇的指标和基于嵌入的指标。此外,我们还创建了一个用户友好的界面,将包括在这个CantonMT研究项目中的模型公开发布,以促进粤语到英语的机器翻译研究。研究人员可以通过我们的开源CantonMT工具包添加更多的模型。\url{https://this https URL}
https://arxiv.org/abs/2403.11346
While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it. In the context of low-resource (LR) MT between two closely-related languages, a natural intuition is to seek benefits from structural "shortcuts", such as copying subwords from the source to the target, given that such language pairs often share a considerable number of identical words, cognates, and borrowings. We test Pointer-Generator Networks for this purpose for six language pairs over a variety of resource ranges, and find weak improvements for most settings. However, analysis shows that the model does not show greater improvements for closely-related vs. more distant language pairs, or for lower resource ranges, and that the models do not exhibit the expected usage of the mechanism for shared subwords. Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT, such as modern tokenization strategies, noisy real-world conditions, and linguistic complexities. We call for better scrutiny of linguistically motivated improvements to NMT given the blackbox nature of Transformer models, as well as for a focus on the above problems in the field.
虽然基于Transformer的神经机器翻译(NMT)在资源丰富的环境中非常有效,但许多语言缺乏大型的并行语料库,因此无法从中受益。在低资源(LR)NMT在两个相关语言之间的情境中,自然地寻求从源语言到目标语言的语义“短路”所带来的好处,因为这类语言对通常共享大量相同词汇、同源词和借用的关系较为密切。我们测试了六种语言对的自指生成网络,在各种资源范围内进行了测试,发现对于大多数设置,改善效果较弱。然而,分析表明,模型对于相关性较弱的语言对和资源较低的情况,并没有带来更大的改善,而且模型并未表现出共享同词的机制预期的使用方式。我们对这种行为的讨论揭示了低资源NMT面临的一些通用挑战,例如现代分词策略、嘈杂的现实条件和语言复杂性等。我们呼吁对基于语言的NMT改进进行更严格的审查,尤其是在Transformer模型的黑盒性质下,同时将更多关注放在上述问题上。
https://arxiv.org/abs/2403.10963
This research paper presents an innovative multi-task learning framework that allows concurrent depth estimation and semantic segmentation using a single camera. The proposed approach is based on a shared encoder-decoder architecture, which integrates various techniques to improve the accuracy of the depth estimation and semantic segmentation task without compromising computational efficiency. Additionally, the paper incorporates an adversarial training component, employing a Wasserstein GAN framework with a critic network, to refine model's predictions. The framework is thoroughly evaluated on two datasets - the outdoor Cityscapes dataset and the indoor NYU Depth V2 dataset - and it outperforms existing state-of-the-art methods in both segmentation and depth estimation tasks. We also conducted ablation studies to analyze the contributions of different components, including pre-training strategies, the inclusion of critics, the use of logarithmic depth scaling, and advanced image augmentations, to provide a better understanding of the proposed framework. The accompanying source code is accessible at \url{this https URL}.
本文提出了一种创新的多任务学习框架,可以在单个摄像机上同时进行深度估计和语义分割。所提出的框架基于共享编码器-解码器架构,该架构将各种技术集成在一起,以提高深度估计和语义分割任务的准确性,同时不牺牲计算效率。此外,论文还引入了一个对抗训练组件,使用Wasserstein GAN框架和批评网络,来优化模型的预测。该框架在两个数据集——户外城市景观数据集和室内NYU Depth V2数据集——进行了全面评估,在分割和深度估计任务上均优于现有最先进的方法。我们还进行了消融研究,以分析不同组件(包括预训练策略、批评者的包含、对数深度缩放和高级图像增强)对所提出的框架的贡献,以更好地理解所提出的框架。附带的源代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10662
Active learning (AL) techniques reduce labeling costs for training neural machine translation (NMT) models by selecting smaller representative subsets from unlabeled data for annotation. Diversity sampling techniques select heterogeneous instances, while uncertainty sampling methods select instances with the highest model uncertainty. Both approaches have limitations - diversity methods may extract varied but trivial examples, while uncertainty sampling can yield repetitive, uninformative instances. To bridge this gap, we propose HUDS, a hybrid AL strategy for domain adaptation in NMT that combines uncertainty and diversity for sentence selection. HUDS computes uncertainty scores for unlabeled sentences and subsequently stratifies them. It then clusters sentence embeddings within each stratum using k-MEANS and computes diversity scores by distance to the centroid. A weighted hybrid score that combines uncertainty and diversity is then used to select the top instances for annotation in each AL iteration. Experiments on multi-domain German-English datasets demonstrate the better performance of HUDS over other strong AL baselines. We analyze the sentence selection with HUDS and show that it prioritizes diverse instances having high model uncertainty for annotation in early AL iterations.
主动学习(AL)技术通过从未标注数据中选择较小的代表子集来降低训练神经机器翻译(NMT)模型的标注成本。多样性采样技术选择异质实例,而不确定性采样方法选择具有最高模型不确定性的实例。两种方法都有局限性——多样性方法可能会提取出多样但无关紧要的示例,而不确定性采样方法可能产生重复、无信息性的实例。为了弥合这一差距,我们提出了HUDS,一种用于在NMT领域进行领域自适应的混合AL策略,结合了不确定性和多样性来进行句子选择。HUDS对未标注的句子计算不确定性分数,然后对它们进行分层。接下来,它在每个层次内聚类句子嵌入使用k-MEANS,并通过距离中心计算多样性分数。然后使用加权混合分数结合不确定性和多样性来选择每个AL迭代中用于标注的顶级实例。在多领域德语-英语数据集上的实验表明,HUDS在与其他强AL baseline 相比具有更好的性能。我们分析使用HUDS进行句子选择的性能,并表明,在早期AL迭代中,对具有高模型不确定性的多样实例的优先选择。
https://arxiv.org/abs/2403.09259
NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machine translation systems at scale, initially derived from OpenNMT-py and then adapted to ensure efficient training across computation clusters. We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information. The toolkit is publicly available online.
在单体大型语言模型的时代,NLP在处理规模和信息方面的极限即将到来。趋势是模块化,这是朝着设计具有专用功能的小型子网络和组件的必要方向迈出的必要一步。在本文中,我们提出了MAMMOTH工具包:一个用于在规模上训练多语言大规模机器翻译系统的框架,最初来源于OpenNMT-py,然后根据需要在计算集群上实现高效的训练。我们在A100和V100 NVIDIA GPU集群上展示了其效率,并讨论了我们设计哲学和未来的计划。该工具包可以在网上获取公共访问权限。
https://arxiv.org/abs/2403.07544
Foundation models are usually pre-trained on large-scale datasets and then adapted to downstream tasks through tuning. However, the large-scale pre-training datasets, often inaccessible or too expensive to handle, can contain label noise that may adversely affect the generalization of the model and pose unexpected risks. This paper stands out as the first work to comprehensively understand and analyze the nature of noise in pre-training datasets and then effectively mitigate its impacts on downstream tasks. Specifically, through extensive experiments of fully-supervised and image-text contrastive pre-training on synthetic noisy ImageNet-1K, YFCC15M, and CC12M datasets, we demonstrate that, while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share a similar distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are significantly different. These observations are agnostic to scales of pre-training datasets, pre-training noise types, model architectures, pre-training objectives, downstream tuning methods, and downstream applications. We empirically ascertain that the reason behind this is that the pre-training noise shapes the feature space differently. We then propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization, which is applicable in both parameter-efficient and black-box tuning manners. We additionally conduct extensive experiments on popular vision and language models, including APIs, which are supervised and self-supervised pre-trained on realistic noisy data for evaluation. Our analysis and results demonstrate the importance of this novel and fundamental research direction, which we term as Noisy Model Learning.
基础模型通常在大规模数据集上进行预训练,然后通过调整适应下游任务。然而,大型预训练数据集通常不可访问或过于昂贵,可能会影响模型的泛化能力并导致意想不到的风险。本文是第一篇全面理解并分析预训练数据中噪音性质的工作,然后有效地减轻其对下游任务的影响。具体来说,我们在 synthetic noisy ImageNet-1K、YFCC15M 和 CC12M 数据集上进行了完全监督和图像文本对比预训练的实验,证明虽然在预训练中轻微的噪音对领域(ID)表现有利,但是始终会恶化到领域外(OOD)表现。这些观察结果对预训练数据的大小、预训练噪音类型、模型架构、预训练目标、下游调整方法和下游应用无依赖。我们通过实验实证,得出结论:噪音背后的原因是预训练噪音塑造了特征空间。然后我们提出了一种调整方法(NMTune),用于将特征空间拉伸以减轻噪音的恶性影响并提高泛化能力,该方法适用于参数高效和黑盒调整方式。此外,我们还对常见的视觉和语言模型进行了广泛的实验,包括用于评估的API,这些模型都是在现实世界的嘈杂数据上进行预训练的。我们的分析和结果证明了这种新颖和基本的研究方向的重要性,我们称之为噪音模型学习。
https://arxiv.org/abs/2403.06869