Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech this http URL better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.
自监督学习在语音识别方面取得了巨大的成功。然而,观察发现,对学习模型的所有层进行微调会导致性能低于重置顶层。这种现象归因于“自编码器”行为:顶层包含更接近输入的信息,并且不太适合需要语言信息的任务,比如更好地理解这个行为。为了研究模型在预训练期间的高级信息演化,我们关注了表现较轻的HuBERT模型。通过实验探索可能影响训练过程的各种因素,我们的目标是改进训练程序并提高HuBERT模型在高级任务上的顶级层。此外,我们的实验还表明,这些训练过程的改进会导致更快的学习曲线和竞争力的下游任务性能。
https://arxiv.org/abs/2405.08402
This paper investigates the development and evaluation of machine translation models from Cantonese to English, where we propose a novel approach to tackle low-resource language translations. The main objectives of the study are to develop a model that can effectively translate Cantonese to English and evaluate it against state-of-the-art commercial models. To achieve this, a new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning. In addition, a monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation. Following the data collection process, several approaches, including fine-tuning models, back-translation, and model switch, have been used. The translation quality of models has been evaluated with multiple quality metrics, including lexicon-based metrics (SacreBLEU and hLEPOR) and embedding-space metrics (COMET and BERTscore). Based on the automatic metrics, the best model is selected and compared against the 2 best commercial translators using the human evaluation framework HOPES. The best model proposed in this investigation (NLLB-mBART) with model switch mechanisms has reached comparable and even better automatic evaluation scores against State-of-the-art commercial models (Bing and Baidu Translators), with a SacreBLEU score of 16.8 on our test set. Furthermore, an open-source web application has been developed to allow users to translate between Cantonese and English, with the different trained models available for effective comparisons between models from this investigation and users. CANTONMT is available at this https URL
本文研究了从粤语到英语的机器翻译模型的开发和评估,提出了一种新颖的方法来解决资源有限的语言翻译问题。研究的主要目标是开发一个能有效将粤语翻译成英语的模型,并将其与最先进的商业模型进行比较。为实现这一目标,通过将不同可用的语料库在线合并并进行预处理和清洗,创建了一个新的并行语料库。此外,通过网络爬取创建了一个单语粤语数据集,以帮助生成合成的并行语料库。在数据收集过程完成后,采用了多种方法,包括微调模型、反向翻译和模型切换,对模型的翻译质量进行了评估。评估指标包括词汇表为基础的指标(SacreBLEU和hLEPOR)和嵌入空间指标(COMET和BERTscore)。根据自动指标,最佳模型被选择并与人评估框架HOPES中的两个最佳商业翻译者进行了比较。本研究提出的最佳模型(NLLB-mBART)具有模型切换机制,在自动评估评分上已经达到了与最先进的商业模型相当甚至更好的水平(Bing和Baidu Translators),同时在我们的测试集中获得了16.8的SacreBLEU分数。此外,还开发了一个开源的Web应用程序,允许用户在粤语和英语之间进行翻译,该应用程序上有不同训练好的模型供有效的模型比较和用户。CANTONMT可在此处访问:<https://url>
https://arxiv.org/abs/2405.08172
Adaptive Risk Control (ARC) is an online calibration strategy based on set prediction that offers worst-case deterministic long-term risk control, as well as statistical marginal coverage guarantees. ARC adjusts the size of the prediction set by varying a single scalar threshold based on feedback from past decisions. In this work, we introduce Localized Adaptive Risk Control (L-ARC), an online calibration scheme that targets statistical localized risk guarantees ranging from conditional risk to marginal risk, while preserving the worst-case performance of ARC. L-ARC updates a threshold function within a reproducing kernel Hilbert space (RKHS), with the kernel determining the level of localization of the statistical risk guarantee. The theoretical results highlight a trade-off between localization of the statistical risk and convergence speed to the long-term risk target. Thanks to localization, L-ARC is demonstrated via experiments to produce prediction sets with risk guarantees across different data subpopulations, significantly improving the fairness of the calibrated model for tasks such as image segmentation and beam selection in wireless networks.
适应风险控制(ARC)是一种基于集合预测的在线校准策略,提供 worst-case 确定性长期风险控制以及统计边际覆盖保证。ARC 通过根据过去决策的反馈调整预测集中的大小来调整单个标量阈值。在这篇工作中,我们引入了局部适应风险控制(L-ARC),一种在线校准方案,旨在实现从条件风险到边缘风险的统计局部风险保证,同时保留 ARC 的最差性能。L-ARC 在可重复核哈伯空间(RKHS)内更新一个阈值函数,其中核确定统计风险保证的水平。理论结果强调了局部化统计风险与长期风险目标收敛速度之间的权衡。通过局部化,L-ARC 通过实验证明可以在不同数据子集上产生具有风险保证的预测集,从而显著改善无线网络中图像分割和束选择等任务的公平性校准模型。
https://arxiv.org/abs/2405.07976
Decoding language information from brain signals represents a vital research area within brain-computer interfaces, particularly in the context of deciphering the semantic information from the fMRI signal. However, many existing efforts concentrate on decoding small vocabulary sets, leaving space for the exploration of open vocabulary continuous text decoding. In this paper, we introduce a novel method, the \textbf{Brain Prompt GPT (BP-GPT)}. By using the brain representation that is extracted from the fMRI as a prompt, our method can utilize GPT-2 to decode fMRI signals into stimulus text. Further, we introduce a text-to-text baseline and align the fMRI prompt to the text prompt. By introducing the text-to-text baseline, our BP-GPT can extract a more robust brain prompt and promote the decoding of pre-trained LLM. We evaluate our BP-GPT on the open-source auditory semantic decoding dataset and achieve a significant improvement up to $4.61\%$ on METEOR and $2.43\%$ on BERTScore across all the subjects compared to the state-of-the-art method. The experimental results demonstrate that using brain representation as a prompt to further drive LLM for auditory neural decoding is feasible and effective.
解析脑信号中的语言信息是脑-计算机接口领域的一个重要研究内容,尤其是在从fMRI信号中解读语义信息方面。然而,许多现有努力都集中在小词汇集的解析上,这为探索基于open词汇的连续文本解析留下了空间。在本文中,我们介绍了一种新颖的方法,即《大脑提示GPT(BP-GPT)》。通过利用从fMRI中提取的大脑表示作为提示,我们的方法可以利用GPT-2将fMRI信号解码为刺激文本。此外,我们还引入了一个文本到文本的基线,并将fMRI提示与文本提示对齐。通过引入文本到文本的基线,我们的BP-GPT可以提取更健壮的大脑提示,促进预训练LLM的解码。我们在开放的音频语义解码数据集上评估我们的BP-GPT,并实现了在所有受试者上的METEOR和BERTScore分别提高4.61%和2.43%的显著改善。实验结果表明,将大脑表示作为提示以进一步驱动音频神经解码的LLM是可行的和有效的。
https://arxiv.org/abs/2405.07840
Encoder models trained for the embedding of sentences or short documents have proven useful for tasks such as semantic search and topic modeling. In this paper, we present a version of the SwissBERT encoder model that we specifically fine-tuned for this purpose. SwissBERT contains language adapters for the four national languages of Switzerland -- German, French, Italian, and Romansh -- and has been pre-trained on a large number of news articles in those languages. Using contrastive learning based on a subset of these articles, we trained a fine-tuned version, which we call SentenceSwissBERT. Multilingual experiments on document retrieval and text classification in a Switzerland-specific setting show that SentenceSwissBERT surpasses the accuracy of the original SwissBERT model and of a comparable baseline. The model is openly available for research use.
用于句子或短文嵌入的编码器模型已被证明在诸如语义搜索和主题建模等任务中很有用。在本文中,我们提出了一个特别为这一目的进行微调的瑞士BERT编码器模型。瑞士BERT包括瑞士四种国家语言(德语、法语、意大利语和罗曼什语)的语言适配器,并在这些语言的众多新闻文章上进行了预训练。基于这些文章的对比学习,我们训练了一个微调的版本,我们称之为SentenceSwissBERT。在瑞士特定环境下的文档检索和文本分类实验中,SentenceSwissBERT超越了原始瑞士BERT模型和可比较基线的准确性。该模型公开可供研究使用。
https://arxiv.org/abs/2405.07513
The quantitative analysis of political ideological positions is a difficult task. In the past, various literature focused on parliamentary voting data of politicians, party manifestos and parliamentary speech to estimate political disagreement and polarization in various political systems. However previous methods of quantitative political analysis suffered from a common challenge which was the amount of data available for analysis. Also previous methods frequently focused on a more general analysis of politics such as overall polarization of the parliament or party-wide political ideological positions. In this paper, we present a method to analyze ideological positions of individual parliamentary representatives by leveraging the latent knowledge of LLMs. The method allows us to evaluate the stance of politicians on an axis of our choice allowing us to flexibly measure the stance of politicians in regards to a topic/controversy of our choice. We achieve this by using a fine-tuned BERT classifier to extract the opinion-based sentences from the speeches of representatives and projecting the average BERT embeddings for each representative on a pair of reference seeds. These reference seeds are either manually chosen representatives known to have opposing views on a particular topic or they are generated sentences which where created using the GPT-4 model of OpenAI. We created the sentences by prompting the GPT-4 model to generate a speech that would come from a politician defending a particular position.
量化政治意识形态分析是一项困难的任务。在过去,各种文献关注于政治家的议会投票数据、党纲和议会演讲,以估计各种政治系统中的政治分歧和极化。然而,先前的量化政治分析方法普遍面临着一个共同的挑战,那就是数据可用性的数量。同时,先前的方法经常更广泛地分析政治,如整体议会的极化或党内的政治意识形态立场。在本文中,我们提出了一种利用LLM的潜在知识来分析个别议会代表意识形态的方法。该方法允许我们在我们选择的轴线上评估政治家的立场,从而我们可以灵活地衡量政治家在某个主题/争议问题上的立场。我们通过使用微调的BERT分类器从代表者的讲话中提取基于意见的句子,并将每个代表的平均BERT嵌入投影到两个参考种子对上。这些参考种子可以是手动选择的已知对某个主题持有不同观点的代表,也可以是使用OpenAI的GPT-4模型生成的句子。我们通过向GPT-4模型发出请求,生成一个政治家为某个立场辩护的演讲。
https://arxiv.org/abs/2405.07320
As the conversation around using geoengineering to combat climate change intensifies, it is imperative to engage the public and deeply understand their perspectives on geoengineering research, development, and potential deployment. Through a comprehensive data-driven investigation, this paper explores the types of news that captivate public interest in geoengineering. We delved into 30,773 English-language news articles from the BBC and the New York Times, combined with Google Trends data spanning 2018 to 2022, to explore how public interest in geoengineering fluctuates in response to news coverage of broader climate issues. Using BERT-based topic modeling, sentiment analysis, and time-series regression models, we found that positive sentiment in energy-related news serves as a good predictor of heightened public interest in geoengineering, a trend that persists over time. Our findings suggest that public engagement with geoengineering and climate action is not uniform, with some topics being more potent in shaping interest over time, such as climate news related to energy, disasters, and politics. Understanding these patterns is crucial for scientists, policymakers, and educators aiming to craft effective strategies for engaging with the public and fostering dialogue around emerging climate technologies.
随着关于使用地球工程对抗气候变化的对话不断加剧,我们有必要与公众深入探讨他们对地球工程研究、开发和潜在部署的看法。通过全面的數據驱动調查,本文探討了哪些新闻吸引了公众对地球工程的兴趣。我们深入挖掘了BBC和《纽约时报》的30,773篇英文新闻文章以及2018年至2022年间的Google Trends数据,以探讨公众对地球工程兴趣的波动如何随针对更广泛气候问题的新闻报道而变化。利用基于BERT的主题建模、情感分析和时间序列回归模型,我们发现,与能源相关的新闻中积极情绪是一个预测公众对地球工程兴趣提高的好指标,这一趋势会随着时间延续。我们的研究结果表明,公众与地球工程和应对气候变化的参与程度并不一致。在塑造长期兴趣方面,一些主题比其他主题更有影响力,比如与能源、灾难和政治相关的气候新闻。了解这些模式对科学家、政策制定者和教育者来说至关重要,他们试图制定有效的策略与公众进行互动,并促进关于新兴气候技术的热烈讨论。
https://arxiv.org/abs/2405.07010
Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a cluster-aware compression method for improving event relation extraction (TacoERE), which explores a compression-then-extraction paradigm. Specifically, we first introduce document clustering for modeling event dependencies. It splits the document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Secondly, we utilize cluster summarization to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. We have conducted extensive experiments on both pre-trained language models, such as RoBERTa, and large language models, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Experimental results demonstrate that TacoERE is an effective method for ERE.
事件关系提取(ERE)是自然语言处理的一个关键和基本挑战。现有的工作主要集中在直接建模整个文档,这无法有效地处理长距离依赖和信息冗余。为了应对这些问题,我们提出了一个聚类感知压缩方法来提高事件关系提取(TacoREE),这是一种压缩然后再提取范式。具体来说,我们首先引入了文档聚类来建模事件依赖关系。它将文档划分为内聚和外聚的簇,内聚簇旨在在同一簇内增强关系,而外聚簇试图在任意距离上建模相关事件。其次,我们利用聚类摘要来简化并突出聚类的重要文本内容,以减轻信息冗余和事件距离。我们在三个数据集上(即MAVEN-ERE,EventStoryLine和HiEve)对预训练语言模型(如RoBERTa和ChatGPT)进行了广泛的实验。实验结果表明,TacoREE是一种有效的用于事件关系提取的有效方法。
https://arxiv.org/abs/2405.06890
Patient hand-off and triage are two fundamental problems in health care. Often doctors must painstakingly summarize complex findings to efficiently communicate with specialists and quickly make decisions on which patients have the most urgent cases. In pursuit of these challenges, we present (1) a model with state-of-art radiology report summarization performance using (2) a novel method for augmenting medical data, and (3) an analysis of the model limitations and radiology knowledge gain. We also provide a data processing pipeline for future models developed on the the MIMIC CXR dataset. Our best performing model was a fine-tuned BERT-to-BERT encoder-decoder with 58.75/100 ROUGE-L F1, which outperformed specialized checkpoints with more sophisticated attention mechanisms. We investigate these aspects in this work.
患者交接班和分诊是医疗保健中的两个基本问题。通常,医生必须费力地总结复杂的放射学报告以有效地与专家沟通并尽快做出关于病情最紧迫的患者的决策。为了解决这些挑战,我们提出了一个使用最先进的放射学报告总结性能的模型,该模型使用了一种新颖的方法来增强医疗数据,并分析了模型的局限性和放射学知识获取。我们还提供了基于MIMIC CXR数据集未来模型的数据处理管道。我们表现最好的模型是一款经过微调的BERT-to-BERT编码器-解码器,其ROUGE-L分数为58.75/100,优于具有更复杂注意机制的专业检查点。我们在本研究中调查了这些方面。
https://arxiv.org/abs/2405.06802
In today's digital landscape, where cyber attacks have become the norm, the detection of cyber attacks and threats is critically imperative across diverse domains. Our research presents a new empirical framework for cyber threat modeling, adept at parsing and categorizing cyber-related information from news articles, enhancing real-time vigilance for market stakeholders. At the core of this framework is a fine-tuned BERT model, which we call CANAL - Cyber Activity News Alerting Language Model, tailored for cyber categorization using a novel silver labeling approach powered by Random Forest. We benchmark CANAL against larger, costlier LLMs, including GPT-4, LLaMA, and Zephyr, highlighting their zero to few-shot learning in cyber news classification. CANAL demonstrates superior performance by outperforming all other LLM counterparts in both accuracy and cost-effectiveness. Furthermore, we introduce the Cyber Signal Discovery module, a strategic component designed to efficiently detect emerging cyber signals from news articles. Collectively, CANAL and Cyber Signal Discovery module equip our framework to provide a robust and cost-effective solution for businesses that require agile responses to cyber intelligence.
在当今数字 landscape中,网络攻击已成为常态,跨多个领域的检测网络攻击和威胁至关重要。我们的研究提出了一种新的实证框架,用于网络威胁建模,善于解析和分类与网络相关的信息,增强市场参与者的实时警惕。这个框架的核心是一个经过微调的BERT模型,我们称之为CANAL - 网络活动新闻警报语言模型,采用了一种新的银标注方法,利用随机森林进行网络安全分类。我们对比CANAL与其他大型、昂贵的LLM,包括GPT-4、LLLaMA和Zephyr,突显了它们在网络新闻分类中的零到几 shot学习。通过超越所有其他LLM备选,CANAL在准确性和性价比方面都表现卓越。此外,我们还引入了Cyber Signal Discovery模块,这是一个 strategic 组件,旨在有效地从新闻文章中检测新兴的网络安全信号。总之,CANAL 和 Cyber Signal Discovery 模块使我们的框架能够为需要对网络情报作出敏捷反应的企业提供稳健且经济有效的解决方案。
https://arxiv.org/abs/2405.06772
Large language models (LLMs) have emerged and presented their general problem-solving capabilities with one model. However, the model size has increased dramatically with billions of parameters to enable such broad problem-solving capabilities. In addition, due to the dominance of matrix-matrix and matrix-vector multiplications in LLMs, the compute-to-model size ratio is significantly lower than that of CNNs. This shift pushes LLMs from a computation-bound regime to a memory-bound regime. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored for achieving the memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning for LLMs is not well-understood yet. Therefore, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous (e.g., O($2^{37}$) for Llama2-7B). To navigate such a vast design space, we formulate the design space and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9\% model size reduction with minimal accuracy drops, which range from 4\%p to 10\%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service in scale (e.g., AI agent assist and real-time coding assistant), where the latency is as important as the model accuracy.
大规模语言模型(LLMs)已经出现并借助一种模型展示了其问题解决能力。然而,随着参数数量成倍增加,模型的规模大幅增加,以实现如此广泛的问题解决能力。此外,由于LLMs中矩阵-矩阵和矩阵-向量乘法的主导地位,计算-模型大小比CNNs要低得多。这一趋势将LLMs从计算密集型模式推向内存密集型模式。因此,优化LLMs的内存占用和流量是LLM today的一个重要优化方向。已经积极研究了量化技术和参数剪枝方法来实现内存占用和流量的优化。然而,LLMs中秩剪枝对模型的准确率-效率权衡尚不清楚。因此,我们研究了LLM中秩剪枝的准确率-效率权衡,特别是Tucker分解,在最近的语言模型上的表现。我们 formalize the design space of low-rank decomposition methods, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the design space and show that the decomposition design space is enormous (e.g., O($2^{37}$) for Llama2-7B). To navigate such a vast design space, we formulate the design space and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9% model size reduction with minimal accuracy drops, which range from 4%p to 10%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service in scale (e.g., AI agent assist and real-time coding assistant), where the latency is as important as the model accuracy.
https://arxiv.org/abs/2405.06626
The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barabási-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.
自然语言的关联维度通过将Grassberger-Procaccia算法应用于大型语言模型产生的高维序列来测量。这种方法之前只在欧氏空间中研究过,现在通过Fisher-Rao距离在统计流形上重新定义。语言表现出多 fractal 特征,具有全局 self-similarity 和一个普遍维度约为6.5,这比简单离散随机序列的维度要小,比巴雷巴-阿尔巴过程的维度要大。长记忆是产生 self-similarity 的关键。我们的方法适用于任何真实世界离散序列的概率模型,并在音乐数据上进行了应用。
https://arxiv.org/abs/2405.06321
In this paper, we introduce SaudiBERT, a monodialect Arabic language model pretrained exclusively on Saudi dialectal text. To demonstrate the model's effectiveness, we compared SaudiBERT with six different multidialect Arabic language models across 11 evaluation datasets, which are divided into two groups: sentiment analysis and text classification. SaudiBERT achieved average F1-scores of 86.15\% and 87.86\% in these groups respectively, significantly outperforming all other comparative models. Additionally, we present two novel Saudi dialectal corpora: the Saudi Tweets Mega Corpus (STMC), which contains over 141 million tweets in Saudi dialect, and the Saudi Forums Corpus (SFC), which includes 15.2 GB of text collected from five Saudi online forums. Both corpora are used in pretraining the proposed model, and they are the largest Saudi dialectal corpora ever reported in the literature. The results confirm the effectiveness of SaudiBERT in understanding and analyzing Arabic text expressed in Saudi dialect, achieving state-of-the-art results in most tasks and surpassing other language models included in the study. SaudiBERT model is publicly available on \url{this https URL}.
在本文中,我们引入了沙特BERT,一种仅针对沙特方言文本进行预训练的单语种阿拉伯语言模型。为了证明模型的有效性,我们在11个评估数据集上比较了沙特BERT与6个不同多语种阿拉伯语言模型,这些数据集分为两组:情感分析和文本分类。沙特BERT在这些组中的平均F1分数分别为86.15%和87.86%,显著优于其他比较模型。此外,我们还提出了两个新的沙特方言数据集:沙特推特Mega Corpus(STMC),包含了沙特方言超过14100万条推文,以及沙特论坛Corpus(SFC),包含了来自五个沙特在线论坛的15.2 GB文本。这两个数据集都在预训练过程中使用,是 literature中报道的最大沙特方言数据集。结果证实了沙特BERT在理解和分析沙特方言中表达的阿拉伯文本方面的有效性,在大多数任务上达到了最先进水平,并超越了包括在研究中使用的其他语言模型。沙特BERT模型可在 \url{此链接} 上公开获取。
https://arxiv.org/abs/2405.06239
Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.
成瘾性物质使用障碍(SUDs)是全球一个不断增长的问题,这需要通过数据驱动的研究来加强对其问题和趋势的理解。社交媒体是独特的并且重要的关于SUD的信息来源,特别是因为这些来源的数据通常是由有生活经历的人生成的。在本文中,我们介绍了Reddit-Impacts,一个由致力于讨论处方的镇痛药和成瘾性物质使用药物的Reddit子社区创建的具有挑战性的命名实体识别(NER)数据集,以及用于成瘾性物质使用障碍的药物。该数据集特别关注物质使用-其临床和社会影响方面。我们通过公开可用的Reddit应用程序接口收集来自所选子社区的数据。我们手动注释了包括报告个人非医疗使用物质(如镇痛药、兴奋剂和苯二氮䓬类)的临床和社会影响文本范围。我们的目标是创建一个资源,可以促进开发可以自动检测文本社交媒体数据中的临床和社会影响的系统。开发此类系统的成功可能有助于我们更好地了解非医疗物质使用对个人健康和社会动态的影响,从而促进公共卫生策略的有效发展。除了创建注释数据集外,我们还应用了几种机器学习模型来建立基线性能。具体来说,我们尝试了BERT、RoBERTa和DANN等Transformer模型,并利用完整的训练数据集,以及通过一次学习利用GPT-3.5进行自动实体识别(NER)的模型。数据集已通过2024年的SMM4H共享任务发布。
https://arxiv.org/abs/2405.06145
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.
目前用于分析或生成代码混合句子的计算方法没有明确地建模“自然性”或“可接受性”的代码混合句子的分布,而是依赖于训练语料库来反映可接受代码混合句子的分布。建模人类判断代码混合文本的可接受性有助于区分自然代码混合文本,并实现质量控制代码混合文本的生成。为此,我们构建了Cline数据集——一个包含英语-印地语(en-hi)代码混合文本人类接受性判断的 dataset。Cline是同类数据集中拥有最大样本数的一个数据集,包括来自两个来源的样本:由合成生成的代码混合文本和来自在线社交媒体的样本。我们的分析表明,流行的代码混合指标,如CMI、切换点数、爆发率,在很大程度上与人类接受性判断没有相关性,凸显了我们的数据集的必要性。使用Cline的数据进行实验证明,仅基于代码混合指标训练的简单多层感知器(MLP)模型被微调的预训练多语言大型语言模型(MLLM)击败。具体来说,XLM-Roberta和Bernice在具有挑战性数据设置的不同配置下优于IndicBERT。与ChatGPT的零和少样本功能进行比较表明,在大型数据上微调的MLLM比ChatGPT表现更好,为代码混合任务提供改进的机会。使用我们的模型检查点从英语-印地语到英语-泰米尔语的可接受性判断证明了我们的模型的优越性,使它可以应用于其他代码混合语言对,并为研究提供了更多的途径。我们公开发布了我们的人类标注数据集、训练检查点、代码混合文本集和用于数据生成和模型训练的代码。
https://arxiv.org/abs/2405.05572
Large language models (LLMs) are becoming bigger to boost performance. However, little is known about how explainability is affected by this trend. This work explores LIME explanations for DeBERTaV3 models of four different sizes on natural language inference (NLI) and zero-shot classification (ZSC) tasks. We evaluate the explanations based on their faithfulness to the models' internal decision processes and their plausibility, i.e. their agreement with human explanations. The key finding is that increased model size does not correlate with plausibility despite improved model performance, suggesting a misalignment between the LIME explanations and the models' internal processes as model size increases. Our results further suggest limitations regarding faithfulness metrics in NLI contexts.
大型语言模型(LLMs)变得越来越大以提高性能。然而,目前还很少有人知道这对可解释性有何影响。这项工作探讨了四种不同大小的 DeBERTaV3 模型在自然语言推理(NLI)和零散分类(ZSC)任务上的 LIME 解释。我们根据它们对模型内部决策过程的忠实程度以及与人类解释的合理性来评估解释。关键发现是,尽管模型性能得到提高,但增加模型大小并不影响可信度,这表明在模型规模增加时,LLM 解释和模型内部过程之间存在失调。我们的结果进一步表明,在 NLI 环境中,关于可解释性指标的限制。
https://arxiv.org/abs/2405.05348
Recent social media posts on the cholera outbreak in Hammanskraal have highlighted the diverse range of emotions people experienced in response to such an event. The extent of people's opinions varies greatly depending on their level of knowledge and information about the disease. The documented re-search about Cholera lacks investigations into the classification of emotions. This study aims to examine the emotions expressed in social media posts about Chol-era. A dataset of 23,000 posts was extracted and pre-processed. The Python Nat-ural Language Toolkit (NLTK) sentiment analyzer library was applied to deter-mine the emotional significance of each text. Additionally, Machine Learning (ML) models were applied for emotion classification, including Long short-term memory (LSTM), Logistic regression, Decision trees, and the Bidirectional En-coder Representations from Transformers (BERT) model. The results of this study demonstrated that LSTM achieved the highest accuracy of 75%. Emotion classification presents a promising tool for gaining a deeper understanding of the impact of Cholera on society. The findings of this study might contribute to the development of effective interventions in public health strategies.
近期哈姆skraal地区社交媒体上关于霍乱疫情的社会动态强调了人们在此类事件中经历的多种情感反应。人们对此事件的观点差异很大,这很大程度上取决于他们对这种疾病的了解和知识水平。关于霍乱的详细研究缺乏对情感分类的调查。本研究旨在研究社交媒体上关于霍乱的情感表达。 数据集提取并经过预处理,共有23,000条帖子。Python自然语言处理工具包(NLTK)情感分析库应用于确定每个文本的情感重要性。此外,应用机器学习(ML)模型进行情感分类,包括长短时记忆(LSTM)、逻辑回归、决策树和双向编码来自Transformer(BERT)模型。本研究的成果表明,LSTM取得了最高准确率75%。情感分类成为深入了解霍乱对社会影响的有趣工具。本研究的发现可能有助于在公共卫生策略的发展中制定有效的干预措施。
https://arxiv.org/abs/2405.04897
An important issue impacting healthcare is a lack of available experts. Machine learning (ML) models could resolve this by aiding in diagnosing patients. However, creating datasets large enough to train these models is expensive. We evaluated large language models (LLMs) for data creation. Using Autism Spectrum Disorders (ASD), we prompted ChatGPT and GPT-Premium to generate 4,200 synthetic observations to augment existing medical data. Our goal is to label behaviors corresponding to autism criteria and improve model accuracy with synthetic training data. We used a BERT classifier pre-trained on biomedical literature to assess differences in performance between models. A random sample (N=140) from the LLM-generated data was evaluated by a clinician and found to contain 83% correct example-label pairs. Augmenting data increased recall by 13% but decreased precision by 16%, correlating with higher quality and lower accuracy across pairs. Future work will analyze how different synthetic data traits affect ML outcomes.
影响医疗健康的一个重要问题是缺乏专业人才。机器学习(ML)模型可以通过协助诊断患者来解决这一问题。然而,为训练这些模型创建足够大的数据集是昂贵的。我们评估了大型语言模型(LLMs)用于数据创建。使用自闭症谱系障碍(ASD),我们要求ChatGPT和GPT-Premium生成4,200个合成观察值以补充现有的医学数据。我们的目标是使用合成训练数据标记自闭症标准,并改善模型准确性。我们使用预先训练于生物医学文献的BERT分类器评估模型的性能。来自LLM生成的数据的随机样本(N=140)由临床医生评估,发现其中包含83%正确的例子-标签对。增加数据量提高了回忆率13%,但精度降低了16%,相关性较高质量和较低准确性。未来的工作将分析不同合成数据特征如何影响ML结果。
https://arxiv.org/abs/2405.06695
Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in this https URL.
自然语言处理中的对抗攻击会通过在字符或词级别上应用扰动来发挥作用。词级攻击因为使用基于梯度的方法而受到关注,这些方法很容易在句子语义上改变,导致无效的对抗样本。而词级攻击容易保持语义,但由于它们无法轻松采用流行的基于梯度的方法,因此受到的关注较少。挑战这些信念,我们引入了Charmer,一种高效基于查询的攻击方法,可以在生成高度相似的对抗样本的同时实现高攻击成功率(ASR)。我们的方法成功针对BERT和Llama 2模型。具体来说,在BERT SST-2上,Charmer提高了ASR 4.84%和USE相似性 8% 点,与以前的作品相比。我们的实现可以在这个https://URL上找到。
https://arxiv.org/abs/2405.04346
In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, we propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. Moreover, leveraging AV-HuBERT's features, we introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance. Experimental results, along with a detailed ablation study, demonstrate the effectiveness of our approach and the utility of the proposed evaluation metrics.
在面部生成任务中,目标是生成与相应音频同步的带嘴唇的视频,同时保留视觉细节和身份信息。目前的方法面临着学习准确嘴同步的同时避免对视觉质量产生负面影响,以及稳健地评估这种同步的挑战。为了解决这些问题,我们提出了使用音频-视觉言语表示专家(AV-HuBERT)在训练过程中计算嘴同步损失的方法。此外,利用AV-HuBERT的特征,我们引入了三个新的嘴同步评估指标,旨在提供对嘴同步性能的全面评估。实验结果和详细消融研究证明了我们方法的有效性和所提出的评估指标的实用性。
https://arxiv.org/abs/2405.04327