This study evaluates the performance of Recurrent Neural Network (RNN) and Transformer in replicating cross-language structural priming: a key indicator of abstract grammatical representations in human language processing. Focusing on Chinese-English priming, which involves two typologically distinct languages, we examine how these models handle the robust phenomenon of structural priming, where exposure to a particular sentence structure increases the likelihood of selecting a similar structure subsequently. Additionally, we utilize large language models (LLM) to measure the cross-lingual structural priming effect. Our findings indicate that Transformer outperform RNN in generating primed sentence structures, challenging the conventional belief that human sentence processing primarily involves recurrent and immediate processing and suggesting a role for cue-based retrieval mechanisms. Overall, this work contributes to our understanding of how computational models may reflect human cognitive processes in multilingual contexts.
本研究评估了循环神经网络(RNN)和Transformer在复制跨语言结构预处理方面的性能:这是人类语言处理中抽象语义表示的关键指标。我们重点研究了汉语-英语预处理,涉及两种具有不同类型的典型语言,并探讨了这些模型如何处理结构预定的稳健现象,即暴露于特定句子结构会增加随后选择类似结构的概率。此外,我们还利用大型语言模型(LLM)测量跨语言结构预处理效果。我们的研究结果表明,Transformer在生成预置句子结构方面优于RNN,挑战了传统观念,即人类句子处理主要涉及循环和即时的处理,并表明了基于提示的检索机制的作用。总的来说,这项工作对我们的理解在多语言环境中计算模型如何反映人类认知过程做出了贡献。
https://arxiv.org/abs/2405.09508
Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.
使用大型语言模型(LLMs)进行教育应用,如以对话为基础的教学,是一个热门话题。然而,有效的教学需要教师根据学生的教育水平调整内容的难度和解释。即使是最先进的LLM today,也很难做到这一点。如果我们希望在这个适应任务上改进LLMs,我们需要能够可靠地测量适应成功的程度。然而,目前静态衡量文本难度的指标,如Flesch-Kincaid阅读难度分数,已经被证明是不够准确和脆弱的。因此,我们引入并评估了一组基于提示的文本难度指标。根据用户研究,我们将基于提示的指标作为LLM的输入。它们利用LLM的通用语言理解能力来捕捉比静态指标更抽象和复杂的特点。回归实验表明,仅使用我们的基于提示的指标显著提高了文本难度分类。我们的结果证明了使用LLM评估文本在不同教育水平上的适应具有巨大潜力。
https://arxiv.org/abs/2405.09482
Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
近年来,大型语言模型(LLMs)已经在预测神经科学研究结果方面超越了人类专家(Luo等人,2024)。这种超群表现的基础是什么?一种可能性是,相对于从更广泛的训练中产生的广泛的推理能力,该特定科学文献中的统计模式是LLMs性能超越人类专家的原因。为了评估这种可能性,我们在1300亿个领域特定知识点的数据上训练了一个规模相对较小的124M参数GPT-2模型。尽管规模比训练在数十亿个token的较大LLM要小得多,但小模型在预测神经科学结果方面实现了专家水平。当小模型通过专门为神经科学文本训练的tokenizer进行训练时,或者当神经科学文献被用于微调预训练的GPT-2时,它们的成功预测结果表明,专家级的性能可能通过领域特定的自回归训练方法实现。
https://arxiv.org/abs/2405.09395
Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.
近年来,大型语言模型(LLMs)的进步导致它们在世界各地得到了广泛的部署。为了确保其安全性,需要进行全面和多语言的毒性评估。然而,现有的毒性基准大多数都关注英语,这使得将LLM部署到其他语言上存在严重风险。我们通过引入PolygloToxicityPrompt(PTP)作为第一个425K个跨17种语言的自然生成提示的大型多语言毒性评估基准来解决这一问题。我们通过自动爬取超过100M个网页文本文档,跨越不同资源,来覆盖各种语言。使用PTP,我们研究了研究问题,通过基准测试60多个LLM,探讨了模型大小、提示语言和指令和偏好调整方法对毒性的影响。值得注意的是,我们发现,随着语言资源减少或模型大小增加,毒性会增加。尽管指令和偏好调整可以减少毒性,但选择偏好调整方法并没有显著影响。我们的研究结果揭示了LLM保护的诸多不足,并指出了未来研究的方向。
https://arxiv.org/abs/2405.09373
Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs.
现有的偏差消除方法在指定的和评估过程中为实现不同社会群体之间的平衡,不可避免地做出不合理的或不受欢迎的预测,因为它们将个体事实排除在之外,导致现有知识的修改。在本文中,我们首先建立了一个新的偏见缓解基准BiasKE,利用现有和附加构建数据集,通过公平性、特异性和平衡度量来系统地评估偏差表现。同时,我们提出了一个新的偏差消除方法,称为公平性印章(FAST),它通过在个体有偏见知识上进行细粒度校准来实现可编辑的公平性。全面的实验证明,FAST在显著的偏差消除性能方面超过了最先进的基线,同时不影响知识保留的整体模型能力,突出了在LLM中可编辑公平性策略的可能性。
https://arxiv.org/abs/2405.09341
Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
尽管大型语言模型和它们在各种任务上的高零样本触发性能最近非常普遍,但还不清楚它们在处理可能具有地道性需求的任务上的表现。特别是,这些模型与专门为地道性任务微调的编码器模型相比,表现如何?在这项工作中,我们试图回答这个问题,通过观察一系列LLM(包括本地和软件服务模型)在三个地道性数据集上的表现:SemEval 2022 Task 2a,FLUTE和MAGPIE。总体而言,我们发现,尽管这些模型确实具有竞争力的性能,但它们并不匹配针对特定任务进行微调的模型,即使在最大的规模上(例如,对于GPT-4)。然而,我们确实看到随着模型规模的一致性能改进。此外,我们研究了提示方法以提高性能,并讨论了为这些任务使用LLM的实用性。
https://arxiv.org/abs/2405.09279
The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.
在机器翻译(MT)中,幻觉和遗漏问题一直是一个长期存在的问题,因为LLM本身容易受到这些现象的影响。在本文中,我们通过引导基于LLM的MT模型进行更好的词对齐来减轻这个问题。我们首先研究了词对齐和MT中幻觉和遗漏现象之间的相关性。然后,我们提出将词对齐作为偏好来优化基于LLM的MT模型。偏好数据是通过选择和拒绝多个MT工具中的翻译来构建的。随后,我们使用直接偏好优化向量来优化LLM-based模型以指向偏好信号。鉴于在MT中没有专门为幻觉和遗漏评估的评估者,我们进一步提出选择难实例并利用GPT-4直接评估模型缓解这些问题。通过实验验证这些设计评估方法的有效性,并通过对结果的广泛分析展示了基于词对齐的偏好优化在减轻幻觉和遗漏方面的有效性。
https://arxiv.org/abs/2405.09223
The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.
目前的大型语言模型(LLM)的安全保障机制确实容易受到破解攻击,这使得它们在本质上非常脆弱。即使在下游任务的微调过程中对看似无害的数据进行精调,也可能危及安全性。一个潜在的解决方案是在下游微调之后对模型进行安全微调。然而,在安全微调过程中,LLM可能会重新获得安全性措施,但却会丢失下游微调过程中获得 task-specific 知识。在本文中,我们通过子空间定向模型融合(SOMF)引入了一个安全对齐框架,旨在将最初对齐的模型的安全保障能力和当前微调的模型的安全性结合成一个对齐的模型。我们的方法首先从每个微调模型的权重中解耦所有任务向量。然后,使用子空间掩码技术在这些向量中识别安全相关区域。最后,我们基于识别出的安全子空间探索最初对齐的LLM与所有任务向量的融合。我们验证,我们的安全对齐框架不仅满足单个微调模型的安全要求,而且满足多个模型在其融合时的安全要求。我们的研究结果证实,SOMF在保留安全性的同时,在下游任务上不显著牺牲性能,包括中文、英语和印地语的指令跟随和代码和数学问题解决能力。
https://arxiv.org/abs/2405.09055
Today's analog/mixed-signal (AMS) integrated circuit (IC) designs demand substantial manual intervention. The advent of multimodal large language models (MLLMs) has unveiled significant potential across various fields, suggesting their applicability in streamlining large-scale AMS IC design as well. A bottleneck in employing MLLMs for automatic AMS circuit generation is the absence of a comprehensive dataset delineating the schematic-netlist relationship. We therefore design an automatic technique for converting schematics into netlists, and create dataset AMSNet, encompassing transistor-level schematics and corresponding SPICE format netlists. With a growing size, AMSNet can significantly facilitate exploration of MLLM applications in AMS circuit design. We have made an initial set of netlists public, and will make both our netlist generation tool and the full dataset available upon publishing of this paper.
今天的模拟/混合信号(AMS)集成电路(IC)设计需要大量的手动干预。多模态大型语言模型(MLLMs)的出现揭示了各种领域中的显著潜力,表明它们在简化大规模AMS IC设计方面具有应用价值。然而,在将MLLM用于自动AMS电路生成方面,采用MLLM的一个瓶颈是缺乏一个全面的电路网络图数据集,该数据集定义了电路图与网络之间的关系。因此,我们设计了一种自动将电路图转换为网络列表的技巧,并创建了数据集AMSNet,涵盖了器件级别的电路图和相应的SPICE格式网络列表。随着规模的增大,AMSNet可以显著地促进探索MLLM在AMS电路设计中的应用。我们已经公开了一些初步的网络列表,并在论文发布后,将公开我们的网络列表生成工具和完整的数据集。
https://arxiv.org/abs/2405.09045
We propose a new paradigm for machine translation that is particularly useful for no-resource languages (those without any publicly available bilingual or monolingual corpora): \acronym (LLM-Assisted Rule Based Machine Translation). Using the \acronym paradigm, we design the first language education/revitalization-oriented machine translator for Owens Valley Paiute (OVP), a critically endangered Indigenous American language for which there is virtually no publicly available data. We present a detailed evaluation of the translator's components: a rule-based sentence builder, an OVP to English translator, and an English to OVP translator. We also discuss the potential of the paradigm, its limitations, and the many avenues for future research that it opens up.
我们提出了一个名为“LLM辅助规则机器翻译”的新范式,特别适用于无资源语言(那些没有公共可用的双语或单语料库的语言):\acronym (LLM-Assisted Rule Based Machine Translation)。使用\acronym范式,我们设计了一种第一个面向奥威谷帕伊特(OVP)的教育/振兴机器翻译器,这是一种几乎没有任何公共可用数据的关键濒危美洲原住民语言。我们详细评估了翻译器的组成部分:基于规则的句子构建者、OVP到英语翻译器和英语到OVP翻译器。我们还讨论了范式的潜力、局限性以及它所开启的许多未来研究的方向。
https://arxiv.org/abs/2405.08997
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
What can contemporary machine learning (ML) models do? Given the proliferation of ML models in society, answering this question matters to a variety of stakeholders, both public and private. The evaluation of models' capabilities is rapidly emerging as a key subfield of modern ML, buoyed by regulatory attention and government grants. Despite this, the notion of an ML model possessing a capability has not been interrogated: what are we saying when we say that a model is able to do something? And what sorts of evidence bear upon this question? In this paper, we aim to answer these questions, using the capabilities of large language models (LLMs) as a running example. Drawing on the large philosophical literature on abilities, we develop an account of ML models' capabilities which can be usefully applied to the nascent science of model evaluation. Our core proposal is a conditional analysis of model abilities (CAMA): crudely, a machine learning model has a capability to X just when it would reliably succeed at doing X if it 'tried'. The main contribution of the paper is making this proposal precise in the context of ML, resulting in an operationalisation of CAMA applicable to LLMs. We then put CAMA to work, showing that it can help make sense of various features of ML model evaluation practice, as well as suggest procedures for performing fair inter-model comparisons.
当代机器学习(ML)模型可以做些什么?在社会中机器学习模型的普及,回答这个问题对公共和私营部门的各种利益相关者来说都至关重要。对模型能力的评估作为一种现代 ML 的关键子领域,得到了监管关注和政府资助的支持。尽管如此,对一个 ML 模型是否具有能力的概念还没有进行深入的探讨:我们说模型能够做某事时,我们究竟在说什么?这个问题上有哪些证据?在本文中,我们旨在回答这些问题,以大型语言模型(LLMs)的能力为运行范例。借鉴大型哲学文献中关于能力的丰富论述,我们为 ML 模型能力开发了一个可资应用的描述。我们论文的核心提议是条件能力分析(CAMA):简而言之,当一个机器学习模型在可靠地执行 X 时,它具有能力做 X。CAMA 的主要贡献在于在 ML 的背景下将这个提议精确化,从而为 LLMs 提供了操作化 CAMA 的方法。接着我们运用 CAMA,展示了它有助于解析 ML 模型评估实践的 various 特征,并提出了进行公平跨模型比较的程序。
https://arxiv.org/abs/2405.08989
Current video summarization methods primarily depend on supervised computer vision techniques, which demands time-consuming manual annotations. Further, the annotations are always subjective which make this task more challenging. To address these issues, we analyzed the feasibility in transforming the video summarization into a text summary task and leverage Large Language Models (LLMs) to boost video summarization. This paper proposes a novel self-supervised framework for video summarization guided by LLMs. Our method begins by generating captions for video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the frame captions and the text summary. It's worth noting that we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames whose captions are similar with the text summary. Our model achieves competitive results against other state-of-the-art methods and paves a novel pathway in video summarization.
目前,主要的视频摘要方法依赖于监督计算机视觉技术,这需要耗时的人工标注。此外,标注总是主观的,使这项任务更具挑战性。为了应对这些问题,我们分析了将视频摘要转换为文本摘要的可行性,并利用大型语言模型(LLMs)提高视频摘要。本文提出了一种新的自监督框架,用于指导LLMs的 video summarization。我们的方法首先为视频帧生成字幕,然后由LLMs将其合成为文本摘要。接下来,我们测量视频帧字幕与文本摘要之间的语义距离。值得注意的是,我们提出了一个新颖的损失函数,根据视频的多样性优化我们的模型。最后,可以根据文本摘要选择具有相似文本摘要的帧来生成摘要视频。我们的模型在与其他最先进的 methods竞争的同时,在视频摘要领域开辟了新的途径。
https://arxiv.org/abs/2405.08890
Autonomous tuning of particle accelerators is an active and challenging field of research with the goal of enabling novel accelerator technologies cutting-edge high-impact applications, such as physics discovery, cancer research and material sciences. A key challenge with autonomous accelerator tuning remains that the most capable algorithms require an expert in optimisation, machine learning or a similar field to implement the algorithm for every new tuning task. In this work, we propose the use of large language models (LLMs) to tune particle accelerators. We demonstrate on a proof-of-principle example the ability of LLMs to successfully and autonomously tune a particle accelerator subsystem based on nothing more than a natural language prompt from the operator, and compare the performance of our LLM-based solution to state-of-the-art optimisation algorithms, such as Bayesian optimisation (BO) and reinforcement learning-trained optimisation (RLO). In doing so, we also show how LLMs can perform numerical optimisation of a highly non-linear real-world objective function. Ultimately, this work represents yet another complex task that LLMs are capable of solving and promises to help accelerate the deployment of autonomous tuning algorithms to the day-to-day operations of particle accelerators.
自动调节粒子加速器是一个充满挑战的研究领域,旨在实现新型的加速器技术, cutting-edge 的具有高影响应用的高科技应用,例如物理学发现、癌症研究和材料科学。自动调节粒子加速器的一个重要挑战是,最有效的算法需要优化领域的专家才能实现对每个新调节任务的算法进行操作。在这项工作中,我们提出使用大型语言模型(LLMs)对粒子加速器进行自动调节。我们在一个证明性的例子中展示了LLMs成功且自主地调节一个粒子加速器子系统的能力,仅基于操作员的自然语言提示。我们还比较了我们的LLM基于解决方案与最先进的优化算法(如贝叶斯优化(BO)和强化学习训练的优化(RLO))的性能。通过这样做,我们还展示了LLMs如何执行高度非线性的现实世界目标函数的数值优化。最终,这项工作代表了LLMs能够解决的最新复杂任务,并有望加速将自调节算法应用于粒子加速器日常运营的工作。
https://arxiv.org/abs/2405.08888
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at this https URL
目前用于长视频理解的数据集往往无法提供真正的长视频理解挑战,因为许多来源于这些数据集的任务只要分析视频中的一两个随机帧就可以成功解决。为解决这个问题,我们提出了一个名为CinePile的新数据集和基准,专门为真正的长视频理解而设计。本文详细介绍了我们创造问题-答案数据集的方法,利用先进的神经网络与人类交互并基于人类生成的原始数据。我们全面的数据集包括305,000个多选题问题(MCQs),涵盖各种视觉和多模态方面,包括时间理解、理解人-对象交互和推理场景中事件或动作。此外,我们还评估了我们的数据集中的最新视频相关LLM,包括开源和专有版本,在测试集上。研究结果表明,即使是最先进的视频相关LLM在这些任务上也无法与人类 performance相媲美,这突出了视频理解的复杂性和挑战性。该数据集可在此处访问:<https://this URL>
https://arxiv.org/abs/2405.08813
This paper explores the potential of large language models (LLMs) to make the Aeronautical Regulations of Colombia (RAC) more accessible. Given the complexity and extensive technicality of the RAC, this study introduces a novel approach to simplifying these regulations for broader understanding. By developing the first-ever RAC database, which contains 24,478 expertly labeled question-and-answer pairs, and fine-tuning LLMs specifically for RAC applications, the paper outlines the methodology for dataset assembly, expert-led annotation, and model training. Utilizing the Gemma1.1 2b model along with advanced techniques like Unsloth for efficient VRAM usage and flash attention mechanisms, the research aims to expedite training processes. This initiative establishes a foundation to enhance the comprehensibility and accessibility of RAC, potentially benefiting novices and reducing dependence on expert consultations for navigating the aviation industry's regulatory landscape. You can visit the dataset (this https URL) and the model (this https URL) here.
本文探讨了大型语言模型(LLMs)在使哥斯达黎加航空法规(RAC)更具可读性的潜力。鉴于RAC的复杂性和广泛的技术性,这项研究采用了一种新的方法来简化这些法规以扩大理解。通过开发有史以来第一个RAC数据库,其中包含24,478个专家标注的问答对,并专门对RAC应用程序进行微调,本文概述了数据集组装、专家引导标注和模型训练的方法。利用Gemma1.1 2b模型以及先进的技术如Unsloth for efficient VRAM usage和flash attention mechanisms,该研究旨在加速训练过程。这项倡议为提高RAC的可读性和可访问性奠定了基础,可能有助于新手,并减少在探索航空行业监管格局时依赖专家咨询。你可以访问数据集(此<https://dataset.academia.edu/dgarcia19/1/1>)和模型(此<https://github.com/yourg Gemma1.1 2b model>)在这里。
https://arxiv.org/abs/2405.08792
The Prostate Imaging Reporting and Data System (PI-RADS) is pivotal in the diagnosis of clinically significant prostate cancer through MRI imaging. Current deep learning-based PI-RADS scoring methods often lack the incorporation of essential PI-RADS clinical guidelines~(PICG) utilized by radiologists, potentially compromising scoring accuracy. This paper introduces a novel approach that adapts a multi-modal large language model (MLLM) to incorporate PICG into PI-RADS scoring without additional annotations and network parameters. We present a two-stage fine-tuning process aimed at adapting MLLMs originally trained on natural images to the MRI data domain while effectively integrating the PICG. In the first stage, we develop a domain adapter layer specifically tailored for processing 3D MRI image inputs and design the MLLM instructions to differentiate MRI modalities effectively. In the second stage, we translate PICG into guiding instructions for the model to generate PICG-guided image features. Through feature distillation, we align scoring network features with the PICG-guided image feature, enabling the scoring network to effectively incorporate the PICG information. We develop our model on a public dataset and evaluate it in a real-world challenging in-house dataset. Experimental results demonstrate that our approach improves the performance of current scoring networks.
前列腺影像报告和数据系统(PI-RADS)在通过MRI成像诊断临床显著的前列腺癌中具有关键作用。当前的深度学习为基础的PI-RADS评分方法通常缺乏使用放射学家所使用的关键PI-RADS临床指南(PICG)进行整合,这可能影响评分准确性。本文介绍了一种新方法,将一个多模态的大语言模型(MLLM)适应性地整合到PI-RADS评分中,而无需额外注释和网络参数。我们提出了一个两阶段微调过程,旨在将最初在自然图像上训练的MLLM适应性地映射到MRI数据领域,同时有效地整合PICG。在第一阶段,我们开发了一个专门针对处理3D MRI图像输入的领域适配层,并设计MLLM指令以有效区分MRI模式。在第二阶段,我们将PICG转换为指导模型生成PICG指导的图像特征的指导指令。通过特征蒸馏,我们将评分网络特征与PICG指导的图像特征对齐,使评分网络能够有效整合PICG信息。我们在公共数据集上开发我们的模型,并在真实世界具有挑战性的内部数据集上进行评估。实验结果表明,我们的方法提高了当前评分网络的性能。
https://arxiv.org/abs/2405.08786
Humans often express their communicative intents indirectly or non-literally, which requires their interlocutors -- human or AI -- to understand beyond the literal meaning of words. While most existing work has focused on discriminative evaluations, we present a new approach to generatively evaluate large language models' (LLMs') intention understanding by examining their responses to non-literal utterances. Ideally, an LLM should respond in line with the true intention of a non-literal utterance, not its literal interpretation. Our findings show that LLMs struggle to generate pragmatically relevant responses to non-literal language, achieving only 50-55% accuracy on average. While explicitly providing oracle intentions significantly improves performance (e.g., 75% for Mistral-Instruct), this still indicates challenges in leveraging given intentions to produce appropriate responses. Using chain-of-thought to make models spell out intentions yields much smaller gains (60% for Mistral-Instruct). These findings suggest that LLMs are not yet effective pragmatic interlocutors, highlighting the need for better approaches for modeling intentions and utilizing them for pragmatic generation.
人类通常通过间接或非文字的方式来表达他们的交流意图,这需要他们的交流者(人类或AI)理解和超越文字的含义。虽然现有的工作主要集中在区分性评估上,但我们提出了一个新的方法来通过研究他们对非文字陈述的反应来评估大型语言模型(LLMs)的意图理解。理想情况下,LLM应该对非文字陈述的真实意图作出回应,而不是其字面理解。我们的研究结果表明,LLMs在生成与实际意图相关的 pragmatic 性回应方面存在困难,平均准确率只有50-55%。即使明确提供预言意图,性能也会显著提高(例如,Mistral-Instruct的平均准确率高达75%),但这仍然表明了在利用给定的意图产生适当回应方面存在挑战。使用连锁思维来建模意图仅能带来较小的收益(Mistral-Instruct的平均准确率只有60%)。这些发现表明,LLMs 还不是一个有效的实践性交流者,需要更好的方法来建模意图并利用它们产生 pragmatic 性回应。
https://arxiv.org/abs/2405.08760
With the proliferation of edge devices, there is a significant increase in attack surface on these devices. The decentralized deployment of threat intelligence on edge devices, coupled with adaptive machine learning techniques such as the in-context learning feature of large language models (LLMs), represents a promising paradigm for enhancing cybersecurity on low-powered edge devices. This approach involves the deployment of lightweight machine learning models directly onto edge devices to analyze local data streams, such as network traffic and system logs, in real-time. Additionally, distributing computational tasks to an edge server reduces latency and improves responsiveness while also enhancing privacy by processing sensitive data locally. LLM servers can enable these edge servers to autonomously adapt to evolving threats and attack patterns, continuously updating their models to improve detection accuracy and reduce false positives. Furthermore, collaborative learning mechanisms facilitate peer-to-peer secure and trustworthy knowledge sharing among edge devices, enhancing the collective intelligence of the network and enabling dynamic threat mitigation measures such as device quarantine in response to detected anomalies. The scalability and flexibility of this approach make it well-suited for diverse and evolving network environments, as edge devices only send suspicious information such as network traffic and system log changes, offering a resilient and efficient solution to combat emerging cyber threats at the network edge. Thus, our proposed framework can improve edge computing security by providing better security in cyber threat detection and mitigation by isolating the edge devices from the network.
随着边缘设备的普及,这些设备上的攻击面显著增加。在边缘设备上分布式威胁情报的集中部署,与大型语言模型(LLMs)的上下文学习特征等自适应机器学习技术的结合,代表了一种增强网络安全低功耗边缘设备的有前途的范式。这种方法涉及在边缘设备上直接部署轻量级机器学习模型以实时分析本地数据流,如网络流量和系统日志。此外,将计算任务分配给边缘服务器可以降低延迟并提高响应速度,同时通过在本地处理敏感数据而增强隐私。LLM服务器可以使得这些边缘服务器能够自主适应不断变化的威胁和攻击模式,持续更新模型以提高检测准确性和减少误报。此外,合作学习机制使边缘设备之间实现安全且可信的相互知识共享,增强网络集体智慧,并能够实现针对检测到的异常情况的动态威胁缓解措施,如设备隔离。这种方法的可扩展性和灵活性使其非常适合各种不断变化的网络环境,因为边缘设备仅发送网络流量和系统日志变化等可疑信息,为解决网络边缘 emerging cyber threats 提供了一个弹性和高效的解决方案。因此,我们提出的框架可以通过在网络边缘隔离边缘设备来提高边缘计算安全性,从而通过隔离边缘设备从网络来提高网络威胁检测和缓解的 security。
https://arxiv.org/abs/2405.08755
The field of chemistry and Artificial Intelligence (AI) intersection is an area of active research that aims to accelerate scientific discovery. The integration of large language models (LLMs) with scientific modalities has shown significant promise in this endeavour. However, challenges persist in effectively addressing training efficacy and the out-of-distribution problem, particularly as existing approaches rely on larger models and datasets. In this context, we focus on machine language-molecule translation and deploy a novel training approach called contrastive preference optimisation, which avoids generating translations that are merely adequate but not perfect. To ensure generalisability and mitigate memorisation effects, we conduct experiments using only 10\% of the data. Our results demonstrate that our models achieve up to a 32\% improvement compared to counterpart models. We also introduce a scalable fine-grained evaluation methodology that accommodates responsibility.
化学与人工智能(AI)领域的交叉是一个活跃的研究领域,旨在加速科学发现。将大型语言模型(LLMs)与科学方法相结合,在探索中取得了显著的进展。然而,在有效地解决训练效果和离散问题方面,仍然存在挑战。特别是,由于现有方法依赖于较大的模型和数据集,因此我们着手研究一种新的训练方法,称为对比偏好优化,它避免了生成仅仅是适当的但不是完美的翻译。为了确保通用性和减轻记忆效应,我们仅使用10%的数据进行实验。我们的结果表明,与对照模型相比,我们的模型实现了多达32%的改进。我们还介绍了一种可扩展的细粒度评估方法,以增强责任意识。
https://arxiv.org/abs/2405.08619