Encoder models trained for the embedding of sentences or short documents have proven useful for tasks such as semantic search and topic modeling. In this paper, we present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose. SwissBERT contains language adapters for the four national languages of Switzerland -- German, French, Italian, and Romansh -- and has been pre-trained on a large number of news articles in those languages. Using contrastive learning based on a subset of these articles, we trained a fine-tuned version, which we call SentenceSwissBERT. Multilingual experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model and of a comparable baseline. The model is openly available for research use.
用于句子或短文嵌入的编码器模型已被证明在诸如语义搜索和主题建模等任务中很有用。在本文中,我们提出了一个特别为这一目的进行微调的瑞士BERT编码器模型。瑞士BERT包括瑞士四种国家语言(德语、法语、意大利语和罗曼什语)的语言适配器,并在这些语言的众多新闻文章上进行了预训练。基于这些文章的对比学习,我们训练了一个微调的版本,我们称之为SentenceSwissBERT。在瑞士特定环境下的文档检索和文本分类实验中,SentenceSwissBERT超越了原始瑞士BERT模型和可比较基线的准确性。该模型公开可供研究使用。
https://arxiv.org/abs/2405.07513
The Chinese numerical string corpus, serves as a valuable resource for speaker verification, particularly in financial transactions. Researches indicate that in short speech scenarios, text-dependent speaker verification (TD-SV) consistently outperforms text-independent speaker verification (TI-SV). However, TD-SV potentially includes the validation of text information, that can be negatively impacted by reading rhythms and pauses. To address this problem, we propose an end-to-end speaker verification system that enhances TD-SV by decoupling speaker and text information. Our system consists of a text embedding extractor, a speaker embedding extractor and a fusion module. In the text embedding extractor, we employ an enhanced Transformer and introduce a triple loss including text classification loss, connectionist temporal classification (CTC) loss and decoder loss; while in the speaker embedding extractor, we create a multi-scale pooling method by combining sliding window attentive statistics pooling (SWASP) with attentive statistics pooling (ASP). To mitigate the scarcity of data, we have recorded a publicly available Chinese numerical corpus named SHALCAS22A (hereinafter called SHAL), which can be accessed on Open-SLR. Moreover, we employ data augmentation techniques using Tacotron2 and HiFi-GAN. Our method achieves an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.
中文数字字符串语料库是一个有价值的资源,尤其是对于金融交易等场景的说话人验证。研究表明,在短语对话场景中,基于文本的说话人验证(TD-SV)始终优于基于文本的说话人验证(TI-SV)。然而,TD-SV可能包括验证文本信息,这可能会受到阅读节奏和停顿的影响。为了解决这个问题,我们提出了一个端到端的说话人验证系统,通过解耦说话人和文本信息来增强TD-SV。我们的系统包括一个文本嵌入提取器、一个说话人嵌入提取器和一個融合模块。在文本嵌入提取器中,我们使用增强的Transformer,引入了包括文本分类损失、连接式时间分类(CTC)损失和解码器损失的三重损失;而在说话人嵌入提取器中,我们通过结合滑动窗口注意统计池化(SWASP)和注意统计池化(ASP)创建了多尺度池化方法。为了缓解数据稀缺的问题,我们已经公开了一个名为SHALCAS22A的可用中文数字语料库(以下称为SHAL),可以访问Open-SLR。此外,我们还使用Tacotron2和HiFi-GAN等数据增强技术。我们的方法在Hi-Mia和SHAL上的等错误率(EER)性能改进分别为49.2%和75.0%。
https://arxiv.org/abs/2405.07029
In this paper, we introduce SaudiBERT, a monodialect Arabic language model pretrained exclusively on Saudi dialectal text. To demonstrate the model's effectiveness, we compared SaudiBERT with six different multidialect Arabic language models across 11 evaluation datasets, which are divided into two groups: sentiment analysis and text classification. SaudiBERT achieved average F1-scores of 86.15\% and 87.86\% in these groups respectively, significantly outperforming all other comparative models. Additionally, we present two novel Saudi dialectal corpora: the Saudi Tweets Mega Corpus (STMC), which contains over 141 million tweets in Saudi dialect, and the Saudi Forums Corpus (SFC), which includes 15.2 GB of text collected from five Saudi online forums. Both corpora are used in pretraining the proposed model, and they are the largest Saudi dialectal corpora ever reported in the literature. The results confirm the effectiveness of SaudiBERT in understanding and analyzing Arabic text expressed in Saudi dialect, achieving state-of-the-art results in most tasks and surpassing other language models included in the study. SaudiBERT model is publicly available on \url{this https URL}.
在本文中,我们引入了沙特BERT,一种仅针对沙特方言文本进行预训练的单语种阿拉伯语言模型。为了证明模型的有效性,我们在11个评估数据集上比较了沙特BERT与6个不同多语种阿拉伯语言模型,这些数据集分为两组:情感分析和文本分类。沙特BERT在这些组中的平均F1分数分别为86.15%和87.86%,显著优于其他比较模型。此外,我们还提出了两个新的沙特方言数据集:沙特推特Mega Corpus(STMC),包含了沙特方言超过14100万条推文,以及沙特论坛Corpus(SFC),包含了来自五个沙特在线论坛的15.2 GB文本。这两个数据集都在预训练过程中使用,是 literature中报道的最大沙特方言数据集。结果证实了沙特BERT在理解和分析沙特方言中表达的阿拉伯文本方面的有效性,在大多数任务上达到了最先进水平,并超越了包括在研究中使用的其他语言模型。沙特BERT模型可在 \url{此链接} 上公开获取。
https://arxiv.org/abs/2405.06239
An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.
文本分类模型开发的一个瓶颈是训练数据的标注需要,这对多语言分类器来说更是如此。幸运的是,当代机器翻译模型既易于获取,又具有可靠的翻译质量,因此可以将一个语言的带标签训练数据翻译到另一个语言。在这里,我们探讨了使用机器翻译对跨多个语言分类任务进行微调的效果。我们还研究了使用图像标题领域中提出的一种新型技术来解释模型对翻译数据潜在负面影响的益处。我们发现,翻译数据具有足够的质量,可以用于微调跨多个语言分类器,并且这种新型损失技术能够提供比没有它更好的效果。
https://arxiv.org/abs/2405.05478
Recent studies have shown that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving high-quality in-context examples, significantly improves in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, presents challenges due to the scarcity of available cross-lingual retrievers and annotated data. In this paper, we introduce XAMPLER: Cross-Lingual Example Retrieval, a method tailored to tackle the challenge of cross-lingual in-context learning using only annotated English data. XAMPLER first trains a retriever with positive/negative English samples, which are constructed based on the predictions of the multilingual large language model for in-context learning. Then, the trained retriever is directly employed to retrieve English examples as few-shot examples for in-context learning of target languages. Experiments on the massively multilingual text classification benchmark of SIB200 with 176 languages demonstrate that XAMPLER substantially improves the in-context learning performance across languages. Our code is available at this https URL.
近年来,研究表明,利用可扩展或经过微调的检索器,具有从高质量上下文例子的检索能力,显著提高了学习英语的上下文理解能力。然而,将这种方法应用于其他语言,特别是资源较少的语言,由于可用跨语言检索器和标注数据的稀缺性,带来了挑战。在本文中,我们介绍了XAMPLER:跨语言示例检索方法,专门解决使用仅标注英语数据应对跨语言上下文学习的挑战。XAMPLER首先使用基于多语言大型语言模型预测的正面/负面英语样本训练一个检索器。然后,训练好的检索器直接应用于为目标语言的上下文学习检索英语示例作为少样本示例。在包含176个语言的SIB200大规模多语言文本分类基准上进行的实验证明,XAMPLER极大地提高了语言的上下文理解性能。我们的代码可在此处访问:https://www.xamplER.org/。
https://arxiv.org/abs/2405.05116
In recent years, significant progress has been made in the field of learning from positive and unlabeled examples (PU learning), particularly in the context of advancing image and text classification tasks. However, applying PU learning to semi-supervised disease classification remains a formidable challenge, primarily due to the limited availability of labeled medical images. In the realm of medical image-aided diagnosis algorithms, numerous theoretical and practical obstacles persist. The research on PU learning for medical image-assisted diagnosis holds substantial importance, as it aims to reduce the time spent by professional experts in classifying images. Unlike natural images, medical images are typically accompanied by a scarcity of annotated data, while an abundance of unlabeled cases exists. Addressing these challenges, this paper introduces a novel generative model inspired by Hölder divergence, specifically designed for semi-supervised disease classification using positive and unlabeled medical image data. In this paper, we present a comprehensive formulation of the problem and establish its theoretical feasibility through rigorous mathematical analysis. To evaluate the effectiveness of our proposed approach, we conduct extensive experiments on five benchmark datasets commonly used in PU medical learning: BreastMNIST, PneumoniaMNIST, BloodMNIST, OCTMNIST, and AMD. The experimental results clearly demonstrate the superiority of our method over existing approaches based on KL divergence. Notably, our approach achieves state-of-the-art performance on all five disease classification benchmarks. By addressing the limitations imposed by limited labeled data and harnessing the untapped potential of unlabeled medical images, our novel generative model presents a promising direction for enhancing semi-supervised disease classification in the field of medical image analysis.
近年来,在从积极和未标记示例中学习的领域取得了显著进展(PU学习),特别是在进步图像和文本分类任务方面。然而,将PU学习应用于半监督疾病分类仍然是一个具有挑战性的任务,主要原因是Labeled medical images的有限 availability。在医学图像辅助诊断算法的领域,仍然存在许多理论和实践的障碍。研究基于PU学习的医学图像辅助诊断具有重要的实际意义,因为其旨在减少专业专家在分类图像上花费的时间。与自然图像不同,医学图像通常缺乏注释数据,而存在大量的未标记示例。为解决这些挑战,本文引入了一种新型的生成模型,受到Hölder散度的启发,特别为半监督疾病分类设计,利用积极和未标记的医学图像数据。 本文我们对问题进行了全面的阐述,并通过严谨的数学分析建立了其理论可行性。为了评估我们提出的方法的有效性,我们在PU medical learning中常用的五个基准数据集:BreastMNIST,PneumoniaMNIST,BloodMNIST,OCTMNIST和AMD进行了广泛的实验。实验结果清楚地表明,我们的方法相对于现有的基于KL散度的方法具有优越性。值得注意的是,我们的方法在所有五个疾病分类基准上都实现了最先进的性能。通过解决有限标注数据带来的限制,并充分利用未标记医学图像的潜在优势,我们新型生成模型为增强医学图像分析领域的半监督疾病分类提供了有益的思路。
https://arxiv.org/abs/2405.04295
Recently, with the advancement of deep learning, several applications in text classification have advanced significantly. However, this improvement comes with a cost because deep learning is vulnerable to adversarial examples. This weakness indicates that deep learning is not very robust. Fortunately, the input of a text classifier is discrete. Hence, it can prevent the classifier from state-of-the-art attacks. Nonetheless, previous works have generated black-box attacks that successfully manipulate the discrete values of the input to find adversarial examples. Therefore, instead of changing the discrete values, we transform the input into its embedding vector containing real values to perform the state-of-the-art white-box attacks. Then, we convert the perturbed embedding vector back into a text and name it an adversarial example. In summary, we create a framework that measures the robustness of a text classifier by using the gradients of the classifier.
近年来,随着深度学习的进步,文本分类应用明显取得了进步。然而,这一进步也带来了成本,因为深度学习对对抗样本非常脆弱。这一弱点表明,深度学习并不非常健壮。幸运的是,文本分类器的输入是离散的。因此,它可以防止分类器受到最先进攻击的影响。然而,之前的工作已经产生了能够成功操纵输入离散值以寻找对抗样本的黑盒攻击。因此,我们不改变离散值,而是将输入转换为其包含实数的嵌入向量,以执行最先进的白盒攻击。然后,将受到攻击的嵌入向量转换回文本并命名为对抗样本。总之,我们创建了一个利用分类器梯度来衡量文本分类器 robustness 的框架。
https://arxiv.org/abs/2405.03789
Few-shot and zero-shot text classification aim to recognize samples from novel classes with limited labeled samples or no labeled samples at all. While prevailing methods have shown promising performance via transferring knowledge from seen classes to unseen classes, they are still limited by (1) Inherent dissimilarities among classes make the transformation of features learned from seen classes to unseen classes both difficult and inefficient. (2) Rare labeled novel samples usually cannot provide enough supervision signals to enable the model to adjust from the source distribution to the target distribution, especially for complicated scenarios. To alleviate the above issues, we propose a simple and effective strategy for few-shot and zero-shot text classification. We aim to liberate the model from the confines of seen classes, thereby enabling it to predict unseen categories without the necessity of training on seen classes. Specifically, for mining more related unseen category knowledge, we utilize a large pre-trained language model to generate pseudo novel samples, and select the most representative ones as category anchors. After that, we convert the multi-class classification task into a binary classification task and use the similarities of query-anchor pairs for prediction to fully leverage the limited supervision signals. Extensive experiments on six widely used public datasets show that our proposed method can outperform other strong baselines significantly in few-shot and zero-shot tasks, even without using any seen class samples.
少样本和零样本文本分类旨在识别具有有限标注样本或完全没有标注样本的新兴类别。虽然预先存在的方法通过将可见类别的知识传递到未见类别来展示有希望的性能,但它们仍然受到以下限制:(1)类之间的固有差异使得将可见类别的特征从可见类转移到未见类的过程既困难又低效。(2)稀有标注的新兴样本通常无法提供足够的指导信号,使模型能够从源分布调整到目标分布,尤其是在复杂场景中。为了减轻上述问题,我们提出了一个简单而有效的少样本和零样本文本分类策略。我们的目标是解放模型的限制,从而使其能够预测未见类别,而无需在可见类别上进行训练。具体来说,为了挖掘更多相关的未见类别知识,我们利用一个大的预训练语言模型生成伪新样本,并选择最具代表性的作为类别锚点。然后,我们将多分类分类任务转换为二分类任务,并利用查询-锚点之间的相似性进行预测,充分利用有限的监督信号。在六个广泛使用的大数据集上进行广泛的实验证明,与使用任何可见类别的样本相比,我们提出的方法在少样本和零样本任务上可以显著提高性能。
https://arxiv.org/abs/2405.03565
Large Language Models (LLMs) have revolutionized natural language processing, but their robustness against adversarial attacks remains a critical concern. We presents a novel white-box style attack approach that exposes vulnerabilities in leading open-source LLMs, including Llama, OPT, and T5. We assess the impact of model size, structure, and fine-tuning strategies on their resistance to adversarial perturbations. Our comprehensive evaluation across five diverse text classification tasks establishes a new benchmark for LLM robustness. The findings of this study have far-reaching implications for the reliable deployment of LLMs in real-world applications and contribute to the advancement of trustworthy AI systems.
大语言模型(LLMs)在自然语言处理领域取得了巨大的革命性突破,但它们对对抗性攻击的鲁棒性仍然是一个关键的担忧。我们提出了一种新颖的白色盒攻击方法,揭示了包括LLMA、OPT和T5在内领先的开源LLM中的漏洞。我们评估了模型大小、结构和微调策略对其对对抗性干扰的鲁棒性影响。我们在五个多样化的文本分类任务上进行全面的评估,为LLM的鲁棒性树立了新的基准。本研究的结果对在现实应用中可靠部署LLM以及可信AI系统的进步具有深远的影响。
https://arxiv.org/abs/2405.02764
Current natural language processing (NLP) research tends to focus on only one or, less frequently, two dimensions - e.g., performance, privacy, fairness, or efficiency - at a time, which may lead to suboptimal conclusions and often overlooking the broader goal of achieving trustworthy NLP. Work on adapter modules (Houlsby et al., 2019; Hu et al., 2021) focuses on improving performance and efficiency, with no investigation of unintended consequences on other aspects such as fairness. To address this gap, we conduct experiments on three text classification datasets by either (1) finetuning all parameters or (2) using adapter modules. Regarding performance and efficiency, we confirm prior findings that the accuracy of adapter-enhanced models is roughly on par with that of fully finetuned models, while training time is substantially reduced. Regarding fairness, we show that adapter modules result in mixed fairness across sensitive groups. Further investigation reveals that, when the standard fine-tuned model exhibits limited biases, adapter modules typically do not introduce extra bias. On the other hand, when the finetuned model exhibits increased bias, the impact of adapter modules on bias becomes more unpredictable, introducing the risk of significantly magnifying these biases for certain groups. Our findings highlight the need for a case-by-case evaluation rather than a one-size-fits-all judgment.
当前的自然语言处理(NLP)研究通常仅关注一个或两个方面 - 例如性能、隐私、公平性或效率 - 同时,这可能导致 suboptimal 的结论,并经常忽视实现可靠 NLP 的更广泛目标。关于适配器模块(Houlsby 等人,2019;Hu 等人,2021)的工作,重点是提高性能和效率,而没有研究其他方面(如公平性)的后果。为了填补这一空白,我们通过(1)对所有参数进行微调或(2)使用适配器模块)在三个文本分类数据集上进行实验。关于性能和效率,我们证实了之前的研究成果,即适配器增强模型的准确性与完全微调的模型相当,而训练时间大大缩短。关于公平性,我们展示了适配器模块在敏感群体上产生混合公平性。进一步的研究表明,当标准微调模型表现出有限偏见时,适配器模块通常不会引入额外的偏见。相反,当微调模型表现出增加的偏见时,适配器模块对偏见的影响变得不可预测,为某些群体带来重大风险。我们的研究结果强调了需要进行针对每个案例的评估,而不是一刀切地做出判断。
https://arxiv.org/abs/2405.02010
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them inevitably and increasingly susceptible to hardware faults (e.g., bit flips) that can potentially corrupt model parameters. Given this challenge, this paper aims to answer a critical question: How likely is a parameter corruption to result in an incorrect model output? To systematically answer this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model resilience/vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. Similar to AVF, this statistical concept can be derived from statistically extensive and meaningful fault injection (FI) experiments. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT). PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
人工智能系统的可靠性是成功部署和广泛采用人工智能技术的关键因素。然而,人工智能硬件系统的日益复杂和异质性使得它们越来越容易受到硬件故障(例如比特翻转)的潜在影响,这可能导致模型参数的错误。鉴于这一挑战,本文旨在回答一个关键问题:参数腐蚀导致错误模型的可能性有多大?为了系统地回答这个问题,我们提出了一个新型的定量指标——参数安全风险因子(PVF),灵感来自计算机架构社区中的架构漏洞因素(AVF),旨在统一和标准化人工智能模型对参数腐蚀的抵抗力。我们定义了一个模型参数的PVF为,在该特定模型参数中的腐蚀导致错误输出的概率。与AVF相似,这个统计概念可以从统计上广泛和有意义的故障注入(FI)实验中推导出来。在本文中,我们展示了将PVF应用于推理过程中的三种任务/模型的几个用例——推荐(DLRM)、视觉分类(CNN)和文本分类(BERT)。PVF可以为人工智能硬件设计师在故障保护与性能/效率之间取得平衡提供关键见解,将易受腐蚀的AI参数组件映射到得到良好保护的硬件模块。PVF指标适用于任何人工智能模型,有潜力帮助统一和标准化人工智能漏洞/抵抗力评估实践。
https://arxiv.org/abs/2405.01741
Large Language Models (LLMs) have swiftly emerged as vital resources for different applications in the biomedical and healthcare domains; however, these models encounter issues such as generating inaccurate information or hallucinations. Retrieval-augmented generation provided a solution for these models to update knowledge and enhance their performance. In contrast to previous retrieval-augmented LMs, which utilize specialized cross-attention mechanisms to help LLM encode retrieved text, BiomedRAG adopts a simpler approach by directly inputting the retrieved chunk-based documents into the LLM. This straightforward design is easily applicable to existing retrieval and language models, effectively bypassing noise information in retrieved documents, particularly in noise-intensive tasks. Moreover, we demonstrate the potential for utilizing the LLM to supervise the retrieval model in the biomedical domain, enabling it to retrieve the document that assists the LM in improving its predictions. Our experiments reveal that with the tuned scorer,\textsc{ BiomedRAG} attains superior performance across 5 biomedical NLP tasks, encompassing information extraction (triple extraction, relation extraction), text classification, link prediction, and question-answering, leveraging over 9 datasets. For instance, in the triple extraction task, \textsc{BiomedRAG} outperforms other triple extraction systems with micro-F1 scores of 81.42 and 88.83 on GIT and ChemProt corpora, respectively.
大语言模型(LLMs)迅速成为了生物医学和医疗领域各种应用程序的重要资源;然而,这些模型会面临生成不准确信息或虚构的问题。检索增强是一种解决这些模型的方法,以更新知识并提高其性能。与之前的检索增强语言模型不同,它们利用专门的跨注意机制帮助LLM编码检索到的文本。相比之下,生物医学RAG采用了一种更简单的策略,即直接将基于检索到的片段的文档输入LLM。这种直观的设计很容易应用于现有的检索和语言模型,有效绕过了检索文本文档中的噪声信息,尤其是在噪声密集任务中。此外,我们还证明了在生物医学领域使用LLM监督检索模型的潜力,使模型能够检索到帮助模型提高预测的文档。我们的实验结果表明,经过调整的评分器\textsc{BiomedRAG}在5个生物医学自然语言处理任务中取得了卓越的性能,包括信息提取(三元提取,关系提取),文本分类,链接预测和问题回答,利用了超过9个数据集。例如,在三元提取任务中,\textsc{BiomedRAG}在GIT和ChemProt数据集上的微F1得分分别为81.42和88.83。
https://arxiv.org/abs/2405.00465
For green AI, it is crucial to measure and reduce the carbon footprint emitted during the training of large language models. In NLP, performing pre-training on Transformer models requires significant computational resources. This pre-training involves using a large amount of text data to gain prior knowledge for performing downstream tasks. Thus, it is important that we select the correct data in the form of domain-specific data from this vast corpus to achieve optimum results aligned with our domain-specific tasks. While training on large unsupervised data is expensive, it can be optimized by performing a data selection step before pretraining. Selecting important data reduces the space overhead and the substantial amount of time required to pre-train the model while maintaining constant accuracy. We investigate the existing selection strategies and propose our own domain-adaptive data selection method - TextGram - that effectively selects essential data from large corpora. We compare and evaluate the results of finetuned models for text classification task with and without data selection. We show that the proposed strategy works better compared to other selection methods.
对于绿色AI,在训练大型语言模型的过程中测量和减少碳排放足迹至关重要。在自然语言处理(NLP)中,对Transformer模型的预训练需要大量的计算资源。这种预训练包括使用大量文本数据获取执行下游任务的先验知识。因此,选择正确的数据形式——领域特定数据,从这个庞大的语料库中,对于实现与领域特定任务相匹配的最佳结果非常重要。 虽然在大型无监督数据上训练是昂贵的,但可以通过在预训练之前执行数据选择步骤来优化。通过选择重要数据,可以降低空间开销,同时减少保持恒定准确度所需的时间。我们研究了现有的选择策略,并提出了自己领域自适应数据选择方法——TextGram,它有效地从大型语料库中选择关键数据。我们比较和评估了带有数据选择和不带数据选择的数据集上的预训练模型在文本分类任务上的结果。我们发现,与其它选择方法相比,我们提出的方法取得了更好的效果。
https://arxiv.org/abs/2404.18228
The availability of text or topic classification datasets in the low-resource Marathi language is limited, typically consisting of fewer than 4 target labels, with some achieving nearly perfect accuracy. In this work, we introduce L3Cube-MahaNews, a Marathi text classification corpus that focuses on News headlines and articles. This corpus stands out as the largest supervised Marathi Corpus, containing over 1.05L records classified into a diverse range of 12 categories. To accommodate different document lengths, MahaNews comprises three supervised datasets specifically designed for short text, long documents, and medium paragraphs. The consistent labeling across these datasets facilitates document length-based analysis. We provide detailed data statistics and baseline results on these datasets using state-of-the-art pre-trained BERT models. We conduct a comparative analysis between monolingual and multilingual BERT models, including MahaBERT, IndicBERT, and MuRIL. The monolingual MahaBERT model outperforms all others on every dataset. These resources also serve as Marathi topic classification datasets or models and are publicly available at this https URL .
低资源 Marathi 语言中文本或主题分类数据集的可用性有限,通常包括不到 4 个目标标签,有些甚至几乎达到完美的准确性。在这项工作中,我们介绍了 L3Cube-MahaNews,一个专注于新闻标题和文章的 Marathi 文本分类数据集。这个数据集在同类研究中脱颖而出,是最大的监督 Marathi 数据集,包含超过 1.05 万个记录,分为 12 个不同的类别。为了适应不同文档长度,MahaNews 包括三个针对短文本、长文档和中等段落的监督数据集。这些数据集之间的一致标签促进了基于文档长度的分析。我们使用最先进的预训练 BERT 模型提供了这些数据集的详细数据统计和基准结果。我们比较了单语种和多语种 BERT 模型,包括 MahaBERT、IndicBERT 和 MuRIL。单语种的 MahaBERT 模型在每条数据集上都胜过其他所有模型。这些资源还包括 Marathi 主题分类数据集或模型,您可以通过此链接公开获取:https://url.org/ 。
https://arxiv.org/abs/2404.18216
We propose a novel, lightweight supervised dictionary learning framework for text classification based on data compression and representation. This two-phase algorithm initially employs the Lempel-Ziv-Welch (LZW) algorithm to construct a dictionary from text datasets, focusing on the conceptual significance of dictionary elements. Subsequently, dictionaries are refined considering label data, optimizing dictionary atoms to enhance discriminative power based on mutual information and class distribution. This process generates discriminative numerical representations, facilitating the training of simple classifiers such as SVMs and neural networks. We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance. Tested on six benchmark text datasets, our algorithm competes closely with top models, especially in limited-vocabulary contexts, using significantly fewer parameters. \review{Our algorithm closely matches top-performing models, deviating by only ~2\% on limited-vocabulary datasets, using just 10\% of their parameters. However, it falls short on diverse-vocabulary datasets, likely due to the LZW algorithm's constraints with low-repetition data. This contrast highlights its efficiency and limitations across different dataset types.
我们提出了一个基于数据压缩和表示的文本分类监督字典学习框架,该框架新颖且轻量级。这个两阶段算法首先采用Lempel-Ziv-Welch(LZW)算法从文本数据集中构建字典,重点关注字典元素的概念意义。随后,字典被精炼考虑标签数据,通过互信息和支持向量机(SVM)和神经网络分布优化字典原子,从而增强判别能力。这个过程产生了具有区分性的数值表示,简化了简单分类器的训练,比如支持向量机(SVMs)和神经网络。我们使用信息论原理评估算法的信息论性能,并引入信息平面区域排名(IPAR)作为新颖的指标来量化信息论性能。在六个基准文本数据集上进行测试,我们的算法与最佳模型竞争密切,尤其是在有限词汇上下文下,仅使用约10%的参数。然而,在多样词汇数据集上,它表现不佳,很可能是因为LZW算法在低重复数据上的约束。这种对比突出了它在不同数据集上的效率和局限性。
https://arxiv.org/abs/2405.01584
One of the prime problems of computer science and machine learning is to extract information efficiently from large-scale, heterogeneous data. Text data, with its syntax, semantics, and even hidden information content, possesses an exceptional place among the data types in concern. The processing of the text data requires embedding, a method of translating the content of the text to numeric vectors. A correct embedding algorithm is the starting point for obtaining the full information content of the text data. In this work, a new embedding method based on the graph structure of the meaningful sentences is proposed. The design of the algorithm aims to construct an embedding vector that constitutes syntactic and semantic elements as well as the hidden content of the text data. The success of the proposed embedding method is tested in classification problems. Among the wide range of application areas, text classification is the best laboratory for embedding methods; the classification power of the method can be tested using dimensional reduction without any further processing. Furthermore, the method can be compared with different embedding algorithms and machine learning methods. The proposed method is tested with real-world data sets and eight well-known and successful embedding algorithms. The proposed embedding method shows significantly better classification for binary and multiclass datasets compared to well-known algorithms.
计算机科学和机器学习的一个主要问题是从大规模、异构数据中有效地提取信息。文本数据在关注的问题数据类型中具有特别重要的地位,因为它具有语法、语义和隐藏的信息内容。处理文本数据需要嵌入,一种将文本内容转换为数字向量的方法。正确的嵌入算法是获取文本数据完整信息内容的起点。在这项工作中,我们提出了一个基于有意义的句子图形结构的新的嵌入方法。该算法的设计旨在构建一个嵌入向量,该向量包括句子的语法和语义元素以及文本数据的隐藏内容。所提出的嵌入方法在分类问题上的成功程度进行了测试。在广泛的应用领域中,文本分类是嵌入方法的最佳实验场所;该方法在没有进一步处理的情况下可以测试其分类能力。此外,该方法可以与不同的嵌入算法和机器学习方法进行比较。我们对实际世界数据集和八个知名且成功的嵌入算法进行了测试。与知名的算法相比,所提出的嵌入方法在二分类和多分类数据集上的分类能力具有显著的提高。
https://arxiv.org/abs/2404.18942
Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called "Lexicon-Syntax Enhanced Multilingual BERT" that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0~3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks. Keywords:cross-lingual transfer, lexicon, syntax, code-switching, graph attention network
无监督跨语言转移涉及在没有任何明确监督的情况下在语言之间传递知识。尽管已经进行了大量研究,以通过关注跨语言知识来提高此类任务的性能,特别是词汇和句法知识,但目前的方法仍然有限,因为它们仅包括语义或词汇信息。由于每种信息都具有独特的优势,并且没有 previous 尝试将两种信息相结合,因此我们试图探索这种方法的潜力。在本文中,我们提出了一个名为 "Lexicon-Syntax Enhanced Multilingual BERT" 的新框架,结合了词汇和句法知识。具体来说,我们使用多语言 BERT(mBERT)作为基础模型,并采用两种技术来增强其学习能力。代码转换技术用于含蓄地教导模型词汇对齐信息,而基于句法的图注意力网络旨在帮助模型编码语义结构。为了整合两种知识,我们将代码转换序列同时输入到语义模块和 mBERT 基础模型中。我们进行了广泛的实验研究,结果表明,与其他零散的跨语言转移 baseline 相比,该框架可以始终如一地优于所有基线,在文本分类、命名实体识别(NER)和语义解析任务中的得分增加了 1.0~3.7 点。关键词:跨语言转移,词汇,语法,代码转换,图注意力网络
https://arxiv.org/abs/2404.16627
This paper focuses on a very important societal challenge of water quality analysis. Being one of the key factors in the economic and social development of society, the provision of water and ensuring its quality has always remained one of the top priorities of public authorities. To ensure the quality of water, different methods for monitoring and assessing the water networks, such as offline and online surveys, are used. However, these surveys have several limitations, such as the limited number of participants and low frequency due to the labor involved in conducting such surveys. In this paper, we propose a Natural Language Processing (NLP) framework to automatically collect and analyze water-related posts from social media for data-driven decisions. The proposed framework is composed of two components, namely (i) text classification, and (ii) topic modeling. For text classification, we propose a merit-fusion-based framework incorporating several Large Language Models (LLMs) where different weight selection and optimization methods are employed to assign weights to the LLMs. In topic modeling, we employed the BERTopic library to discover the hidden topic patterns in the water-related tweets. We also analyzed relevant tweets originating from different regions and countries to explore global, regional, and country-specific issues and water-related concerns. We also collected and manually annotated a large-scale dataset, which is expected to facilitate future research on the topic.
本论文重点关注水质量分析这一重要的社会挑战。作为社会经济发展的重要因素,提供水资源并确保其质量始终是公共当局的头等大事。为了确保水质,采用了一些监测和评估水网络的方法,例如离线和在线调查。然而,这些调查存在一些局限性,例如参与人数有限和调查频率较低,因为这些调查需要大量的人力投入。在本文中,我们提出了一个自然语言处理(NLP)框架,用于自动收集和分析社交媒体上的与水相关的帖子,以支持数据驱动的决策。所提出的框架由两个组成部分组成,即(i)文本分类和(ii)主题建模。 在文本分类方面,我们提出了一个基于 merits-fusion 的框架,其中包含多个 large language models (LLMs)。为了给 LLMs 分配权重,我们采用了一些权重选择和优化方法。在主题建模方面,我们使用了 BERTopic 库来发现水相关微博中的隐藏主题模式。我们还分析了许多不同地区和国家的相关 tweets,以探索全球、区域和国家特定的水和相关问题。 我们还收集并手动标注了一个大规模数据集,预计将促进未来关于这个主题的研究。
https://arxiv.org/abs/2404.14977
In this work, we propose a novel tree-based explanation technique, PEACH (Pretrained-embedding Explanation Across Contextual and Hierarchical Structure), that can explain how text-based documents are classified by using any pretrained contextual embeddings in a tree-based human-interpretable manner. Note that PEACH can adopt any contextual embeddings of the PLMs as a training input for the decision tree. Using the proposed PEACH, we perform a comprehensive analysis of several contextual embeddings on nine different NLP text classification benchmarks. This analysis demonstrates the flexibility of the model by applying several PLM contextual embeddings, its attribute selections, scaling, and clustering methods. Furthermore, we show the utility of explanations by visualising the feature selection and important trend of text classification via human-interpretable word-cloud-based trees, which clearly identify model mistakes and assist in dataset debugging. Besides interpretability, PEACH outperforms or is similar to those from pretrained models.
在这项工作中,我们提出了一种新颖的基于树的解释技术,称为PEACH(预训练嵌入解释跨上下文和层次结构),可以在基于树的整个人类可解释方式中解释文本文档的分类。请注意,PEACH可以采用任何PLM的上下文嵌入作为训练输入来训练决策树。通过使用提出的PEACH,我们对九个不同的自然语言处理文本分类基准进行了全面分析。这种分析证明了模型的灵活性,通过应用多个PLM的上下文嵌入、属性选择、缩放和聚类方法,对其进行了分析。此外,我们通过可视化通过人类可解释的词云树来展示解释的有用性,该树清楚地指出了模型的错误,并有助于数据集的调试。除了可解释性之外,PEACH超越或与预训练模型相当。
https://arxiv.org/abs/2404.13645
Machine learning models have made incredible progress, but they still struggle when applied to examples from unseen domains. This study focuses on a specific problem of domain generalization, where a model is trained on one source domain and tested on multiple target domains that are unseen during training. We propose IMO: Invariant features Masks for Out-of-Distribution text classification, to achieve OOD generalization by learning invariant features. During training, IMO would learn sparse mask layers to remove irrelevant features for prediction, where the remaining features keep invariant. Additionally, IMO has an attention module at the token level to focus on tokens that are useful for prediction. Our comprehensive experiments show that IMO substantially outperforms strong baselines in terms of various evaluation metrics and settings.
机器学习模型已经取得了巨大的进展,但在应用到未见过的领域时,它们仍然存在困难。本研究关注于领域泛化问题,即在训练过程中,模型学习一个未见过的领域,而在测试过程中,对多个未见过的领域进行测试。我们提出IMO:Invariant features Masks for Out-of-Distribution text classification,通过学习不变的特征来实现OOD泛化。在训练过程中,IMO会学习稀疏的掩码层,用于消除预测过程中的无关特征,而剩余的特征保持不变。此外,IMO在词级别有一个注意力模块,专注于对预测有用的词进行关注。我们的全面实验结果表明,IMO在各种评估指标和设置方面都显著优于强大的基线。
https://arxiv.org/abs/2404.13504