Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.
使用大型语言模型(LLMs)进行教育应用,如以对话为基础的教学,是一个热门话题。然而,有效的教学需要教师根据学生的教育水平调整内容的难度和解释。即使是最先进的LLM today,也很难做到这一点。如果我们希望在这个适应任务上改进LLMs,我们需要能够可靠地测量适应成功的程度。然而,目前静态衡量文本难度的指标,如Flesch-Kincaid阅读难度分数,已经被证明是不够准确和脆弱的。因此,我们引入并评估了一组基于提示的文本难度指标。根据用户研究,我们将基于提示的指标作为LLM的输入。它们利用LLM的通用语言理解能力来捕捉比静态指标更抽象和复杂的特点。回归实验表明,仅使用我们的基于提示的指标显著提高了文本难度分类。我们的结果证明了使用LLM评估文本在不同教育水平上的适应具有巨大潜力。
https://arxiv.org/abs/2405.09482
This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
这篇论文对通过一系列实验对可解释性事实核查进行全面分析,重点关注大型语言模型验证公共卫生主张及其真实性评估的能力。我们研究了零/少抽样提示和参数高效的微调在各种开源和闭源模型上的有效性,并探讨了其在真实性预测和解释生成的单独和联合任务上的表现。重要的是,我们采用了一种包括先前确定的自动指标和通过人类评价的新颖标准的双重评估方法。自动评估表明,在零抽样场景中,GPT-4脱颖而出,但在少抽样和参数高效的微调环境中,开源模型表明其不仅弥合了性能差距,而且在某些情况下甚至超过了GPT-4。人类评估揭示了更多的细节,并表明了黄金解释存在的一些潜在问题。
https://arxiv.org/abs/2405.09454
Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
近年来,大型语言模型(LLMs)已经在预测神经科学研究结果方面超越了人类专家(Luo等人,2024)。这种超群表现的基础是什么?一种可能性是,相对于从更广泛的训练中产生的广泛的推理能力,该特定科学文献中的统计模式是LLMs性能超越人类专家的原因。为了评估这种可能性,我们在1300亿个领域特定知识点的数据上训练了一个规模相对较小的124M参数GPT-2模型。尽管规模比训练在数十亿个token的较大LLM要小得多,但小模型在预测神经科学结果方面实现了专家水平。当小模型通过专门为神经科学文本训练的tokenizer进行训练时,或者当神经科学文献被用于微调预训练的GPT-2时,它们的成功预测结果表明,专家级的性能可能通过领域特定的自回归训练方法实现。
https://arxiv.org/abs/2405.09395
Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges. Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings. Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts encompassed a diverse range of mental health topics, including depression, anxiety, and trauma, to ensure a comprehensive assessment. Results: The results demonstrated a significant difference in performance between the two models (p > 0.05). GPT-4 achieved an average rating of 8.29 out of 10, while Chat-GPT received an average rating of 6.52. The clinical psychologist's evaluation suggested that GPT-4 was more effective at generating clinically relevant and empathetic responses, thereby providing better support and guidance to potential users. Conclusions: This study contributes to the growing body of literature on the applicability of large language models in mental health care settings. The findings underscore the importance of continued research and development in the field to optimize these models for clinical use. Further investigation is necessary to understand the specific factors underlying the performance differences between the two models and to explore their generalizability across various populations and mental health conditions.
背景:自然语言处理技术的快速发展导致开发了具有潜在推翻精神卫生保健变革的大语言模型。这些模型在协助临床医生和为各种心理挑战经历的个人提供支持方面显示出希望。目标:本研究旨在比较GPT-4和Chat-GPT在回答一组18个心理提示方面的性能,以评估它们在精神卫生保健场所的潜在应用。方法:采用盲法进行研究,临床心理学家的评估过程中不知道模型的来源。提示涵盖了广泛的心理健康问题,包括抑郁症、焦虑和创伤,以确保全面评估。结果:研究结果表明,两个模型在性能上存在显著差异(p > 0.05)。GPT-4获得了平均评分8.29分,而Chat-GPT获得了平均评分6.52分。临床心理学家的评估表明,GPT-4在生成具有临床相关性和同情心的响应方面效果更好,从而为潜在用户提供了更好的支持和指导。结论:本研究为大型语言模型在精神卫生保健场所的应用提供了越来越多的证据。研究结果强调,该领域需要继续研究和开发,以优化这些模型用于临床使用。还需要进一步研究来了解两个模型性能差异的具体原因,并探讨它们在不同人群和心理状况下的可推广性。
https://arxiv.org/abs/2405.09300
Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
尽管大型语言模型和它们在各种任务上的高零样本触发性能最近非常普遍,但还不清楚它们在处理可能具有地道性需求的任务上的表现。特别是,这些模型与专门为地道性任务微调的编码器模型相比,表现如何?在这项工作中,我们试图回答这个问题,通过观察一系列LLM(包括本地和软件服务模型)在三个地道性数据集上的表现:SemEval 2022 Task 2a,FLUTE和MAGPIE。总体而言,我们发现,尽管这些模型确实具有竞争力的性能,但它们并不匹配针对特定任务进行微调的模型,即使在最大的规模上(例如,对于GPT-4)。然而,我们确实看到随着模型规模的一致性能改进。此外,我们研究了提示方法以提高性能,并讨论了为这些任务使用LLM的实用性。
https://arxiv.org/abs/2405.09279
The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.
在机器翻译(MT)中,幻觉和遗漏问题一直是一个长期存在的问题,因为LLM本身容易受到这些现象的影响。在本文中,我们通过引导基于LLM的MT模型进行更好的词对齐来减轻这个问题。我们首先研究了词对齐和MT中幻觉和遗漏现象之间的相关性。然后,我们提出将词对齐作为偏好来优化基于LLM的MT模型。偏好数据是通过选择和拒绝多个MT工具中的翻译来构建的。随后,我们使用直接偏好优化向量来优化LLM-based模型以指向偏好信号。鉴于在MT中没有专门为幻觉和遗漏评估的评估者,我们进一步提出选择难实例并利用GPT-4直接评估模型缓解这些问题。通过实验验证这些设计评估方法的有效性,并通过对结果的广泛分析展示了基于词对齐的偏好优化在减轻幻觉和遗漏方面的有效性。
https://arxiv.org/abs/2405.09223
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. OpenAI's GPT series models were compared against human annotation. Frequent terms with a high false-positive rate were removed from the dictionary. Analysis of the estimated false-positive rates of the annotated terms revealed 8 ambiguous terms (plus synonyms) used in Instagram posts, which were removed from the original dictionary. To study the effect of removing those terms, we constructed knowledge networks using the refined and the original dictionaries and performed an eigenvector-centrality analysis on both networks. We show that the refined dictionary thus produced leads to a significantly different rank of important terms, as measured by their eigenvector-centrality of the knowledge networks. Furthermore, the most important terms obtained after refinement are of greater medical relevance. In addition, we show that OpenAI's GPT series models fare worse than human annotators in this task.
我们使用了一个基于生物医学术语提取于各种来源的词典,如DrugBank、MedDRA、MedlinePlus、TCMGeneDIT等,来标记在至少提及一次抗癫痫药物的8000万条Instagram帖子。通过对2010年到2016年间的随机样本进行人名标注者评估,来识别出假阳性结果。OpenAI的GPT系列模型与人类标注进行了比较。高假阳性率和高频词汇从词典中移除。对注释词汇的预计假阳性率进行分析,发现其中8个词汇是 Instagram 帖子中使用的模糊词汇(包括同义词),它们被从原始词典中移除。为了研究移除这些词汇的影响,我们使用精化和原始词典来构建知识网络,并分别对两个网络进行了 eigenvector-centrality 分析。我们发现,因此产生的精化词典导致知识网络中重要词汇的排名明显不同,这是通过它们的 eigenvector-centrality 来衡量的。此外,在精化过程中获得的最重要词汇具有更大的医学意义。此外,我们还发现,OpenAI的GPT系列模型在這種任务上表现不佳,远低于人类标注者。
https://arxiv.org/abs/2405.08784
Indian folk paintings have a rich mosaic of symbols, colors, textures, and stories making them an invaluable repository of cultural legacy. The paper presents a novel approach to classifying these paintings into distinct art forms and tagging them with their unique salient features. A custom dataset named FolkTalent, comprising 2279 digital images of paintings across 12 different forms, has been prepared using websites that are direct outlets of Indian folk paintings. Tags covering a wide range of attributes like color, theme, artistic style, and patterns are generated using GPT4, and verified by an expert for each painting. Classification is performed employing the RandomForest ensemble technique on fine-tuned Convolutional Neural Network (CNN) models to classify Indian folk paintings, achieving an accuracy of 91.83%. Tagging is accomplished via the prominent fine-tuned CNN-based backbones with a custom classifier attached to its top to perform multi-label image classification. The generated tags offer a deeper insight into the painting, enabling an enhanced search experience based on theme and visual attributes. The proposed hybrid model sets a new benchmark in folk painting classification and tagging, significantly contributing to cataloging India's folk-art heritage.
印度民间绘画具有丰富的象征、色彩、纹理和故事,使其成为文化遗产的无价财富。本文提出了一种新颖的方法来将这些绘画分类为不同的艺术形式,并为它们独特的突出特点贴上标签。一个由12种形式、共2279幅绘画图片组成的自定义数据集FolkTalent已经准备就绪,这些图片来源于印度民间绘画的直接网站。使用GPT4生成覆盖颜色、主题、艺术风格和图案等广泛属性的标签,并请专家对每幅绘画进行验证。分类采用随机森林技术对经过微调的卷积神经网络(CNN)模型进行,实现91.83%的准确率。标签通过一个显著地进行微调的CNN骨干网络与自定义分类器连接在一起进行多标签图像分类。生成的标签提供了对绘画的更深刻的洞察,使主题和视觉属性能够成为增强的搜索体验。所提出的混合模型在民间绘画分类和标签方面设定了新的基准,显著地贡献了印度民间艺术遗产的目录。
https://arxiv.org/abs/2405.08776
We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at this http URL
我们提出了Hunyuan-DiT,一种具有对英语和中文细粒度理解的自监督文本到图像扩散变换器。为了构建Hunyuan-DiT,我们精心设计了变换器结构、文本编码器和位置编码。我们还从零开始构建了一个完整的数据管道,用于迭代模型优化。为了深入理解语言,我们训练了一个多模态大型语言模型,以优化图像的摘要。最后,Hunyuan-DiT可以与用户进行多轮多模态对话,根据上下文生成和优化图像。通过我们的全面人类评估协议(超过50名专业人类评估员)的全面评估,Hunyuan-DiT在中文到图像生成方面达到了最先进水平,与其他开源模型相比。代码和预训练模型可在该http URL公开获取。
https://arxiv.org/abs/2405.08748
Since the release of ChatGPT and GPT-4, large language models (LLMs) and multimodal large language models (MLLMs) have garnered significant attention due to their powerful and general capabilities in understanding, reasoning, and generation, thereby offering new paradigms for the integration of artificial intelligence with medicine. This survey comprehensively overviews the development background and principles of LLMs and MLLMs, as well as explores their application scenarios, challenges, and future directions in medicine. Specifically, this survey begins by focusing on the paradigm shift, tracing the evolution from traditional models to LLMs and MLLMs, summarizing the model structures to provide detailed foundational knowledge. Subsequently, the survey details the entire process from constructing and evaluating to using LLMs and MLLMs with a clear logic. Following this, to emphasize the significant value of LLMs and MLLMs in healthcare, we survey and summarize 6 promising applications in healthcare. Finally, the survey discusses the challenges faced by medical LLMs and MLLMs and proposes a feasible approach and direction for the subsequent integration of artificial intelligence with medicine. Thus, this survey aims to provide researchers with a valuable and comprehensive reference guide from the perspectives of the background, principles, and clinical applications of LLMs and MLLMs.
自 ChatGPT 和 GPT-4 发布以来,大型语言模型(LLMs)和多模态大型语言模型(MLLMs)因其在理解、推理和生成方面的强大和通用能力而备受关注,为将人工智能与医疗相结合提供了新的范例。这项调查全面回顾了 LLMs 和 MLLMs 的开发背景和原理,并探讨了它们在医学中的应用场景、挑战和未来发展方向。具体来说,这项调查首先关注范式的转变,从传统模型到 LLMs 和 MLLMs 的演变过程,并总结模型的结构以提供详细的基础知识。接着,调查详细描述了从构建和评估到使用 LLMs 和 MLLMs 的整个过程,并强调了 LLMs 和 MLLMs 在医疗保健中的重要价值。随后,我们调查和总结了 6 个医疗保健领域的有益应用。最后,调查讨论了医疗 LLMs 和 MLLMs 面临的问题,并为将来的人工智能与医疗结合提出了一种可行的方式和方向。因此,这项调查旨在为研究人员提供关于 LLMs 和 MLLMs 的背景、原则和临床应用方面宝贵的全面参考指南。
https://arxiv.org/abs/2405.08603
The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods
高级大型语言模型(如GPT-4、GPT-4o和Claude家族)的崛起使得伪造音频检测变得越来越具有挑战性。传统的微调方法很难与不断变化的合成语音格局保持同步,需要不断学习的方法来适应新的音频,同时保留检测较老类型音频的能力。持续学习作为一种有效的工具,在检测新型深度伪造音频的同时保持对较老类型的检测性能,但它缺乏一个结构良好和用户友好的评估框架。为了填补这一空白,我们引入了EVDA,一个用于评估持续学习在深度伪造音频检测中的基准。EVDA包括来自反伪造声音系列、中国伪造音频检测系列以及GPT-4和GPT-4o生成的全新深度伪造音频。它支持各种持续学习技术,例如EWC、学习不遗忘(LwF)以及像Regularized Adaptive Weight Modification(RAWM)和Radian Weight Modification(RWM)这样的最近方法。此外,EVDA通过提供一个开放的接口,促进将新的持续学习方法集成到算法中,从而推动其发展。
https://arxiv.org/abs/2405.08596
Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as alignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions -- expressions with a common lexical core used by both speakers within a dialogue -- and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.
对话需要对话参与者在语篇、回合管理以及相互理解等方面进行大量的协调。这种协调努力在文献中得到了充分记录,但仍有几个问题有待回答,包括跨说话者之间语言行为的再利用程度对新颖参照物标签的出现有何影响。在这项研究中,我们提出了一个自动检测共享同义词短语的方法——用于对话中共同使用的词汇核心的表达。我们将该方法应用于一个旨在识别没有现有标签的新物体语料库中。我们的分析揭示了互动中共享同义词短语的使用模式,并揭示了诸如它们的使用频率和使用的不同短语数量等特征与参与者在社交互动后展示的物体标签程度之间的关联。更一般地说,本研究表明,自动检测到的共享同义词短语为研究对话中参照物协商的动态提供了有价值的分析水平。
https://arxiv.org/abs/2405.08546
The SemEval task on Argument Reasoning in Civil Procedure is challenging in that it requires understanding legal concepts and inferring complex arguments. Currently, most Large Language Models (LLM) excelling in the legal realm are principally purposed for classification tasks, hence their reasoning rationale is subject to contention. The approach we advocate involves using a powerful teacher-LLM (ChatGPT) to extend the training dataset with explanations and generate synthetic data. The resulting data are then leveraged to fine-tune a small student-LLM. Contrary to previous work, our explanations are not directly derived from the teacher's internal knowledge. Instead they are grounded in authentic human analyses, therefore delivering a superior reasoning signal. Additionally, a new `mutation' method generates artificial data instances inspired from existing ones. We are publicly releasing the explanations as an extension to the original dataset, along with the synthetic dataset and the prompts that were used to generate both. Our system ranked 15th in the SemEval competition. It outperforms its own teacher and can produce explanations aligned with the original human analyses, as verified by legal experts.
民事诉讼中的推理挑战性的任务在于它需要理解法律概念并推断复杂论点。目前,在法律领域表现卓越的大规模语言模型(LLM)主要目的是分类任务,因此其推理是具有争议的。我们倡导的方法包括使用强大的教师-LLM(ChatGPT)扩展训练数据,并生成合成数据。然后将这些数据用于微调小学生-LLM。与之前的工作不同,我们的解释不是直接从教师内部知识中得出的。相反,它们是基于真实的人类分析得出的,因此具有更好的推理信号。此外,一种新方法`突变`生成源自现有方法的 artificial 数据实例。我们公开发布这些解释作为对原始数据集的扩展,以及为生成 both 原始人类分析和合成数据而使用的提示。我们的系统在 SemEval 竞赛中排名第15。它超越了自己的教师,并且可以通过法律专家的验证,产生与原始人类分析一致的解释。
https://arxiv.org/abs/2405.08502
This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages in several settings: zero-shot GEC, fine-tuning for GEC, and using GPT-3.5 to re-rank correction hypotheses generated by other GEC models. In the zero-shot setting, we conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods: estimating grammaticality with language models (LMs), the Scribendi test, and comparing the semantic embeddings of sentences. GPT-3.5 has a known tendency to over-correct erroneous sentences and propose alternative corrections. For several languages, such as Czech, German, Russian, Spanish, and Ukrainian, GPT-3.5 substantially alters the source sentences, including their semantics, which presents significant challenges for evaluation with reference-based metrics. For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics. However, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types, including punctuation mistakes, tense errors, syntactic dependencies between words, and lexical compatibility at the sentence level.
本文研究了GPT-3.5在多种语言的多个设置中应用Grammatical Error Correction(GEC)的应用:零击GEC、微调GEC和使用GPT-3.5重新排名由其他GEC模型生成的修正假设。在零击设置中,我们使用几种方法对GPT-3.5提出的修正进行自动评估:使用语言模型(LMs)估计语义性,使用Scribendi测试和比较句子语义嵌入。GPT-3.5有一个已知倾向,即过度纠正错误的句子,并提出了替代的修正。对于几种语言,如捷克语、德语、俄语、西班牙语和乌克兰语,GPT-3.5在很大程度上改变了源句子,包括其语义,这使得使用参考基评估指标进行评估具有重大挑战。对于英语,GPT-3.5具有高召回,生成流畅的修正,并通常保留句子语义。然而,对于英语和俄语的人类评估表明,尽管其强大的错误检测能力,GPT-3.5在几种错误类型上仍然遇到困难,包括标点错误、时态错误、单词之间的语义关系和句子级别上的词汇兼容性。
https://arxiv.org/abs/2405.08469
Stickers are increasingly used in social media to express sentiment and intent. When finding typing troublesome, people often use a sticker instead. Despite the significant impact of stickers on sentiment analysis and intent recognition, little research has been conducted. To address this gap, we propose a new task: Multimodal chat Sentiment Analysis and Intent Recognition involving Stickers (MSAIRS). Additionally, we introduce a novel multimodal dataset containing Chinese chat records and stickers excerpted from several mainstream social media platforms. Our dataset includes paired data with the same text but different stickers, and various stickers consisting of the same images with different texts, allowing us to better understand the impact of stickers on chat sentiment and intent. We also propose an effective multimodal joint model, MMSAIR, for our task, which is validated on our datasets and indicates that visual information of stickers counts. Our dataset and code will be publicly available.
贴纸在社交媒体上越来越多地用于表达情感和意图。当发现打字困难时,人们通常使用贴纸代替。尽管贴纸对情感分析和意图识别产生了显著影响,但很少有研究深入研究。为了填补这个空白,我们提出了一个新任务:多模态聊天情感分析和意图识别涉及贴纸(MSAIRS)。此外,我们还介绍了一个包含中国聊天记录和贴纸摘录的全新多模态数据集。我们的数据集包括相同文本但不同贴纸的配对数据,以及由相同图像但不同文本组成的各种贴纸,使我们能够更好地理解贴纸对聊天情感和意图的影响。我们还提出了有效的多模态联合模型MMSAIR,用于我们的任务,该模型在我们的数据集上验证,并表明贴纸的视觉信息计数。我们的数据集和代码将公开可用。
https://arxiv.org/abs/2405.08427
Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly.
近年来,整合了语音指令并能够生成相关文本响应的大规模语言模型(SLMs)受到了欢迎。然而,这些模型的安全性和鲁棒性仍然存在较大不确定性。在这项工作中,我们研究了这些指令跟随语音模型的潜在漏洞,以及针对这些模型的对抗攻击和破解。具体来说,我们设计了几种算法,可以生成对抗性样本来破解SLMs,无论是在白盒还是黑盒攻击设置中。此外,我们还提出了防止此类破解攻击的措施。我们基于语音指令的对话数据训练的模型在口头问题回答任务上实现了最先进的性能,得分超过80%的安全性和帮助性指标。尽管有安全防护措施,但在针对精心设计的具有12种有毒类别的有害问题的大型数据集上进行破解实验表明,SLMs对对抗扰动和传输攻击非常脆弱。然而,我们证明了我们提出的措施显著减少了攻击的成功率。
https://arxiv.org/abs/2405.08317
ChatGPT is a conversational agent built on a large language model. Trained on a significant portion of human output, ChatGPT can mimic people to a degree. As such, we need to consider what social identities ChatGPT simulates (or can be designed to simulate). In this study, we explored the case of identity simulation through Japanese first-person pronouns, which are tightly connected to social identities in intersectional ways, i.e., intersectional pronouns. We conducted a controlled online experiment where people from two regions in Japan (Kanto and Kinki) witnessed interactions with ChatGPT using ten sets of first-person pronouns. We discovered that pronouns alone can evoke perceptions of social identities in ChatGPT at the intersections of gender, age, region, and formality, with caveats. This work highlights the importance of pronoun use for social identity simulation, provides a language-based methodology for culturally-sensitive persona development, and advances the potential of intersectional identities in intelligent agents.
ChatGPT是一个基于大型语言模型构建的会话机器人。它通过训练大量人类输出而得到,能够以某种程度模仿人类。因此,我们需要考虑ChatGPT模拟的社会身份(或可以通过设计模拟)是什么。在这项研究中,我们通过使用日本第一人称代词来探讨身份模拟,这些代词与性别、年龄、地区和正式程度等方面紧密相关,即 intersectional pronouns。我们进行了一项控制性的在线实验,让来自日本两个地区的(关东和关西)人使用十组第一人称代词与ChatGPT进行互动。我们发现,代词本身就可以引发关于ChatGPT在性别、年龄、地区和正式程度等方面模拟社会身份的感知,不过需要指出的是,这种作用是有局限性的。这项工作强调了在社交身份模拟中使用代词的重要性,为具有文化敏感性的个性发展提供了语言基础,并推动了智能代理中交叉身份的潜在可能性。
https://arxiv.org/abs/2405.08238
Despite significant technological advancements, the process of programming robots for adaptive assembly remains labor-intensive, demanding expertise in multiple domains and often resulting in task-specific, inflexible code. This work explores the potential of Large Language Models (LLMs), like ChatGPT, to automate this process, leveraging their ability to understand natural language instructions, generalize examples to new tasks, and write code. In this paper, we suggest how these abilities can be harnessed and applied to real-world challenges in the manufacturing industry. We present a novel system that uses ChatGPT to automate the process of programming robots for adaptive assembly by decomposing complex tasks into simpler subtasks, generating robot control code, executing the code in a simulated workcell, and debugging syntax and control errors, such as collisions. We outline the architecture of this system and strategies for task decomposition and code generation. Finally, we demonstrate how our system can autonomously program robots for various assembly tasks in a real-world project.
尽管出现了显著的技术进步,但为适应性装配编程机器人仍然是一个劳动密集的过程,需要掌握多个领域的专业知识,通常导致任务特定、不灵活的代码。本文探讨了大型语言模型(如ChatGPT)自动执行这一过程的潜力,并利用其理解自然语言指令、泛化示例到新任务和编写代码的能力。在本文中,我们提出了如何利用这些能力解决制造业中实际挑战的建议。我们提出了一个新系统,该系统使用ChatGPT自动编程机器人进行适应性装配,通过将复杂任务分解为简单的子任务,生成机器人控制代码,在模拟工作单元中执行代码并进行调试,例如碰撞。我们概述了该系统的架构和任务分解和代码生成的策略。最后,我们展示了我们的系统如何自主编程各种装配任务在实际项目中的机器人。
https://arxiv.org/abs/2405.08216
We introduce Many-Shot Regurgitation (MSR) prompting, a new black-box membership inference attack framework for examining verbatim content reproduction in large language models (LLMs). MSR prompting involves dividing the input text into multiple segments and creating a single prompt that includes a series of faux conversation rounds between a user and a language model to elicit verbatim regurgitation. We apply MSR prompting to diverse text sources, including Wikipedia articles and open educational resources (OER) textbooks, which provide high-quality, factual content and are continuously updated over time. For each source, we curate two dataset types: one that LLMs were likely exposed to during training ($D_{\rm pre}$) and another consisting of documents published after the models' training cutoff dates ($D_{\rm post}$). To quantify the occurrence of verbatim matches, we employ the Longest Common Substring algorithm and count the frequency of matches at different length thresholds. We then use statistical measures such as Cliff's delta, Kolmogorov-Smirnov (KS) distance, and Kruskal-Wallis H test to determine whether the distribution of verbatim matches differs significantly between $D_{\rm pre}$ and $D_{\rm post}$. Our findings reveal a striking difference in the distribution of verbatim matches between $D_{\rm pre}$ and $D_{\rm post}$, with the frequency of verbatim reproduction being significantly higher when LLMs (e.g. GPT models and LLaMAs) are prompted with text from datasets they were likely trained on. For instance, when using GPT-3.5 on Wikipedia articles, we observe a substantial effect size (Cliff's delta $= -0.984$) and a large KS distance ($0.875$) between the distributions of $D_{\rm pre}$ and $D_{\rm post}$. Our results provide compelling evidence that LLMs are more prone to reproducing verbatim content when the input text is likely sourced from their training data.
我们提出了Many-Shot Regurgitation(MSR)提示,一种新的大语言模型(LLM)中正文内容复制的黑盒会员推理攻击框架,用于研究大型语言模型(LLMs)中的 verbatim 内容复制。MSR 提示包括将输入文本划分为多个部分并创建一个包含用户和语言模型之间一系列伪对话环节的单个提示。我们将 MSR 提示应用于各种文本来源,包括维基百科文章和开放教育资源(OER)教科书,这些资料提供高质量、事实内容,并且随着时间的推移不断更新。对于每个来源,我们挑选两个数据集:一个是在训练过程中 likely接触到LLM的($D_{\rm pre}$),另一个是模型训练截止日期之后发布的文档($D_{\rm post}$)。为了量化匹配的出现情况,我们使用 Longest Common Substring 算法计数不同长度阈值下的匹配频率。然后使用统计方法如 Cliff's delta、Kolmogorov-Smirnov(KS)距离和 Kruskal-Wallis H 检验来确定是否在 $D_{\rm pre}$ 和 $D_{\rm post}$ 之间显著存在匹配分布的差异。我们的研究结果揭示了 $D_{\rm pre}$ 和 $D_{\rm post}$ 之间 verbatim 匹配分布的显著差异,当 LLM(例如 GPT 模型和 LLaMAs)从其训练数据中接收到文本时,复制的频率显著更高。例如,在用 GPT-3.5 处理维基百科文章时,我们观察到显著的效应量(Cliff's delta = -0.984)和较大的KS距离(0.875)。我们的研究结果提供了令人信服的证据,表明当输入文本可能来自其训练数据时,LLM 更倾向于复制 verbatim 内容。
https://arxiv.org/abs/2405.08134