This study evaluates the performance of Recurrent Neural Network (RNN) and Transformer in replicating cross-language structural priming: a key indicator of abstract grammatical representations in human language processing. Focusing on Chinese-English priming, which involves two typologically distinct languages, we examine how these models handle the robust phenomenon of structural priming, where exposure to a particular sentence structure increases the likelihood of selecting a similar structure subsequently. Additionally, we utilize large language models (LLM) to measure the cross-lingual structural priming effect. Our findings indicate that Transformer outperform RNN in generating primed sentence structures, challenging the conventional belief that human sentence processing primarily involves recurrent and immediate processing and suggesting a role for cue-based retrieval mechanisms. Overall, this work contributes to our understanding of how computational models may reflect human cognitive processes in multilingual contexts.
本研究评估了循环神经网络(RNN)和Transformer在复制跨语言结构预处理方面的性能:这是人类语言处理中抽象语义表示的关键指标。我们重点研究了汉语-英语预处理,涉及两种具有不同类型的典型语言,并探讨了这些模型如何处理结构预定的稳健现象,即暴露于特定句子结构会增加随后选择类似结构的概率。此外,我们还利用大型语言模型(LLM)测量跨语言结构预处理效果。我们的研究结果表明,Transformer在生成预置句子结构方面优于RNN,挑战了传统观念,即人类句子处理主要涉及循环和即时的处理,并表明了基于提示的检索机制的作用。总的来说,这项工作对我们的理解在多语言环境中计算模型如何反映人类认知过程做出了贡献。
https://arxiv.org/abs/2405.09508
Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.
使用大型语言模型(LLMs)进行教育应用,如以对话为基础的教学,是一个热门话题。然而,有效的教学需要教师根据学生的教育水平调整内容的难度和解释。即使是最先进的LLM today,也很难做到这一点。如果我们希望在这个适应任务上改进LLMs,我们需要能够可靠地测量适应成功的程度。然而,目前静态衡量文本难度的指标,如Flesch-Kincaid阅读难度分数,已经被证明是不够准确和脆弱的。因此,我们引入并评估了一组基于提示的文本难度指标。根据用户研究,我们将基于提示的指标作为LLM的输入。它们利用LLM的通用语言理解能力来捕捉比静态指标更抽象和复杂的特点。回归实验表明,仅使用我们的基于提示的指标显著提高了文本难度分类。我们的结果证明了使用LLM评估文本在不同教育水平上的适应具有巨大潜力。
https://arxiv.org/abs/2405.09482
This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
这篇论文对通过一系列实验对可解释性事实核查进行全面分析,重点关注大型语言模型验证公共卫生主张及其真实性评估的能力。我们研究了零/少抽样提示和参数高效的微调在各种开源和闭源模型上的有效性,并探讨了其在真实性预测和解释生成的单独和联合任务上的表现。重要的是,我们采用了一种包括先前确定的自动指标和通过人类评价的新颖标准的双重评估方法。自动评估表明,在零抽样场景中,GPT-4脱颖而出,但在少抽样和参数高效的微调环境中,开源模型表明其不仅弥合了性能差距,而且在某些情况下甚至超过了GPT-4。人类评估揭示了更多的细节,并表明了黄金解释存在的一些潜在问题。
https://arxiv.org/abs/2405.09454
Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid human-AI technologies, and 3) investigates what these technologies can reveal about individual perspectives in online discussions. We propose a three-layered hierarchy for representing perspectives that can be obtained by a mixture of human intelligence and large language models. We illustrate how these representations can draw insights into the diversity of perspectives and allow us to investigate interactions in online discussions.
现代民主国家面临着公民参与度下降的一个关键问题。在线讨论论坛是提高公民参与度的重要途径。本论文提纲1)识别了促进大规模在线讨论与自然语言处理(NLP)相关的挑战,2)提出了通过融合人机智能技术来解决这些挑战的建议,3)研究了这些技术如何揭示关于在线讨论中个体观点的信息。我们提出了一个三层表示视角的三层结构,可以通过混合人类智慧和大型语言模型的方式获得。我们说明了这些表示如何揭示视角的多样性,并允许我们研究在线讨论中的互动。
https://arxiv.org/abs/2405.09439
3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
3D内容创作在各种应用中发挥着重要作用,如游戏、机器人模拟和虚拟现实。然而,该过程费力且耗时,需要熟练的设计师投入大量精力创作单个3D资产。为应对这一挑战,文本到3D生成技术作为一种有前途的自动化3D创作的解决方案应运而生。通过利用大型视觉语言模型的成功,这些技术旨在根据文本描述生成3D内容。尽管在最近一段时间内这一领域取得了进展,但现有的解决方案在生成质量和效率方面仍然存在显著的限制。在本次调查中,我们深入研究了最新的文本到3D创作方法。我们提供了关于文本到3D创作的全面背景,包括讨论训练和评估指标所使用的数据集以及用于评估生成3D模型的质量的评估指标。接着,我们深入探讨了作为3D生成过程基础的各种3D表示。此外,我们还对迅速发展的关于生成管道的研究进行了全面的比较,并将它们分为前馈生成、基于优化的生成和视图重构方法。通过分析这些方法的优缺点,我们希望揭示它们各自的潜能和局限。最后,我们指出了未来研究的几个有前景的方向。通过这次调查,我们希望激励研究人员进一步探索开放词汇文本条件下3D内容创作的潜力。
https://arxiv.org/abs/2405.09431
Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
近年来,大型语言模型(LLMs)已经在预测神经科学研究结果方面超越了人类专家(Luo等人,2024)。这种超群表现的基础是什么?一种可能性是,相对于从更广泛的训练中产生的广泛的推理能力,该特定科学文献中的统计模式是LLMs性能超越人类专家的原因。为了评估这种可能性,我们在1300亿个领域特定知识点的数据上训练了一个规模相对较小的124M参数GPT-2模型。尽管规模比训练在数十亿个token的较大LLM要小得多,但小模型在预测神经科学结果方面实现了专家水平。当小模型通过专门为神经科学文本训练的tokenizer进行训练时,或者当神经科学文献被用于微调预训练的GPT-2时,它们的成功预测结果表明,专家级的性能可能通过领域特定的自回归训练方法实现。
https://arxiv.org/abs/2405.09395
Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.
近年来,大型语言模型(LLMs)的进步导致它们在世界各地得到了广泛的部署。为了确保其安全性,需要进行全面和多语言的毒性评估。然而,现有的毒性基准大多数都关注英语,这使得将LLM部署到其他语言上存在严重风险。我们通过引入PolygloToxicityPrompt(PTP)作为第一个425K个跨17种语言的自然生成提示的大型多语言毒性评估基准来解决这一问题。我们通过自动爬取超过100M个网页文本文档,跨越不同资源,来覆盖各种语言。使用PTP,我们研究了研究问题,通过基准测试60多个LLM,探讨了模型大小、提示语言和指令和偏好调整方法对毒性的影响。值得注意的是,我们发现,随着语言资源减少或模型大小增加,毒性会增加。尽管指令和偏好调整可以减少毒性,但选择偏好调整方法并没有显著影响。我们的研究结果揭示了LLM保护的诸多不足,并指出了未来研究的方向。
https://arxiv.org/abs/2405.09373
Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs.
现有的偏差消除方法在指定的和评估过程中为实现不同社会群体之间的平衡,不可避免地做出不合理的或不受欢迎的预测,因为它们将个体事实排除在之外,导致现有知识的修改。在本文中,我们首先建立了一个新的偏见缓解基准BiasKE,利用现有和附加构建数据集,通过公平性、特异性和平衡度量来系统地评估偏差表现。同时,我们提出了一个新的偏差消除方法,称为公平性印章(FAST),它通过在个体有偏见知识上进行细粒度校准来实现可编辑的公平性。全面的实验证明,FAST在显著的偏差消除性能方面超过了最先进的基线,同时不影响知识保留的整体模型能力,突出了在LLM中可编辑公平性策略的可能性。
https://arxiv.org/abs/2405.09341
Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.
尽管语言模型(LMs)已经提高了问答系统的性能,但它们仍然需要大量数据。数据标注是一个耗时过程。相比之下,尤其是在问答系统中,可能需要对大量文档进行解析和标注,以及问题和它们的相应答案。此外,问答模型通常只在训练它们的领域中表现良好。由于标注成本高昂,我们认为LMs中具有领域无关知识(如语言理解)足够创建一个经过良好筛选的 dataset。以此动机,我们证明了在几 shot设置中,使用大型语言模型可以比最先进的方法提高各种数据集的问答系统性能。为此,我们利用提示框架进行数据生成,表明语言模型包含有价值的任务无关知识,可以用于超越常见的预训练/微调方案。因此,我们在几 shot问题回答方面持续优于之前的解决方案。
https://arxiv.org/abs/2405.09335
Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges. Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings. Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts encompassed a diverse range of mental health topics, including depression, anxiety, and trauma, to ensure a comprehensive assessment. Results: The results demonstrated a significant difference in performance between the two models (p > 0.05). GPT-4 achieved an average rating of 8.29 out of 10, while Chat-GPT received an average rating of 6.52. The clinical psychologist's evaluation suggested that GPT-4 was more effective at generating clinically relevant and empathetic responses, thereby providing better support and guidance to potential users. Conclusions: This study contributes to the growing body of literature on the applicability of large language models in mental health care settings. The findings underscore the importance of continued research and development in the field to optimize these models for clinical use. Further investigation is necessary to understand the specific factors underlying the performance differences between the two models and to explore their generalizability across various populations and mental health conditions.
背景:自然语言处理技术的快速发展导致开发了具有潜在推翻精神卫生保健变革的大语言模型。这些模型在协助临床医生和为各种心理挑战经历的个人提供支持方面显示出希望。目标:本研究旨在比较GPT-4和Chat-GPT在回答一组18个心理提示方面的性能,以评估它们在精神卫生保健场所的潜在应用。方法:采用盲法进行研究,临床心理学家的评估过程中不知道模型的来源。提示涵盖了广泛的心理健康问题,包括抑郁症、焦虑和创伤,以确保全面评估。结果:研究结果表明,两个模型在性能上存在显著差异(p > 0.05)。GPT-4获得了平均评分8.29分,而Chat-GPT获得了平均评分6.52分。临床心理学家的评估表明,GPT-4在生成具有临床相关性和同情心的响应方面效果更好,从而为潜在用户提供了更好的支持和指导。结论:本研究为大型语言模型在精神卫生保健场所的应用提供了越来越多的证据。研究结果强调,该领域需要继续研究和开发,以优化这些模型用于临床使用。还需要进一步研究来了解两个模型性能差异的具体原因,并探讨它们在不同人群和心理状况下的可推广性。
https://arxiv.org/abs/2405.09300
Markedness in natural language is often associated with non-literal meanings in discourse. Differential Object Marking (DOM) in Korean is one instance of this phenomenon, where post-positional markers are selected based on both the semantic features of the noun phrases and the discourse features that are orthogonal to the semantic features. Previous work has shown that distributional models of language recover certain semantic features of words -- do these models capture implied discourse-level meanings as well? We evaluate whether a set of large language models are capable of associating discourse meanings with different object markings in Korean. Results suggest that discourse meanings of a grammatical marker can be more challenging to encode than that of a discourse marker.
标记自然语言中的标记性通常与语篇中的非字面意义相关。韩国的差异对象标记(DOM)是这种现象的一个实例,其中后置语标记根据名词短语的语义特征以及与语义特征不相关的语篇特征进行选择。之前的工作表明,语言的分布模型可以恢复单词的某些语义特征,但这些模型是否能够捕捉隐含的语篇级含义?我们评估了是否有一组大型语言模型能够将韩语中的语篇意义与不同的标记相对应。结果表明,语篇标记的语义特征可能比语篇标记更具挑战性。
https://arxiv.org/abs/2405.09293
Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
尽管大型语言模型和它们在各种任务上的高零样本触发性能最近非常普遍,但还不清楚它们在处理可能具有地道性需求的任务上的表现。特别是,这些模型与专门为地道性任务微调的编码器模型相比,表现如何?在这项工作中,我们试图回答这个问题,通过观察一系列LLM(包括本地和软件服务模型)在三个地道性数据集上的表现:SemEval 2022 Task 2a,FLUTE和MAGPIE。总体而言,我们发现,尽管这些模型确实具有竞争力的性能,但它们并不匹配针对特定任务进行微调的模型,即使在最大的规模上(例如,对于GPT-4)。然而,我们确实看到随着模型规模的一致性能改进。此外,我们研究了提示方法以提高性能,并讨论了为这些任务使用LLM的实用性。
https://arxiv.org/abs/2405.09279
This paper will present textual corpora for Serbian (and Serbo-Croatian), usable for the training of large language models and publicly available at one of the several notable online repositories. Each corpus will be classified using multiple methods and its characteristics will be detailed. Additionally, the paper will introduce three new corpora: a new umbrella web corpus of Serbo-Croatian, a new high-quality corpus based on the doctoral dissertations stored within National Repository of Doctoral Dissertations from all Universities in Serbia, and a parallel corpus of abstract translation from the same source. The uniqueness of both old and new corpora will be accessed via frequency-based stylometric methods, and the results will be briefly discussed.
本论文将展示用于训练大型语言模型和公开可用的塞尔维亚(以及塞尔维亚-克罗地亚)语料库。每个语料库都将使用多种方法进行分类,并详细介绍其特征。此外,论文还将介绍三个新的语料库:一个新的塞尔维亚-克罗地亚语料库,基于塞尔维亚所有大学博士学位论文存储的国家博士学位论文存储库的新高质量语料库,以及一个与相同来源的摘要翻译平行语料库。通过基于频率的文体测量方法访问新旧语料库的唯一性,并将简要讨论结果。
https://arxiv.org/abs/2405.09250
The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.
在机器翻译(MT)中,幻觉和遗漏问题一直是一个长期存在的问题,因为LLM本身容易受到这些现象的影响。在本文中,我们通过引导基于LLM的MT模型进行更好的词对齐来减轻这个问题。我们首先研究了词对齐和MT中幻觉和遗漏现象之间的相关性。然后,我们提出将词对齐作为偏好来优化基于LLM的MT模型。偏好数据是通过选择和拒绝多个MT工具中的翻译来构建的。随后,我们使用直接偏好优化向量来优化LLM-based模型以指向偏好信号。鉴于在MT中没有专门为幻觉和遗漏评估的评估者,我们进一步提出选择难实例并利用GPT-4直接评估模型缓解这些问题。通过实验验证这些设计评估方法的有效性,并通过对结果的广泛分析展示了基于词对齐的偏好优化在减轻幻觉和遗漏方面的有效性。
https://arxiv.org/abs/2405.09223
In this paper, we present the findings of our Project ALPINE which stands for ``Autoregressive Learning for Planning In NEtworks." Project ALPINE initiates a theoretical investigation into the development of planning capabilities in Transformer-based language models through their autoregressive learning mechanisms, aiming to identify any potential limitations in their planning abilities. We abstract planning as a network path-finding task where the objective is to generate a valid path from a specified source node to a designated target node. In terms of expressiveness, we show that the Transformer is capable of executing path-finding by embedding the adjacency and reachability matrices within its weights. Our theoretical analysis of the gradient-based learning dynamic of the Transformer reveals that the Transformer is capable of learning both the adjacency matrix and a limited form of the reachability matrix. These theoretical insights are then validated through experiments, which demonstrate that the Transformer indeed learns the adjacency matrix and an incomplete reachability matrix, which aligns with the predictions made in our theoretical analysis. Additionally, when applying our methodology to a real-world planning benchmark, called Blocksworld, our observations remain consistent. Our theoretical and empirical analyses further unveil a potential limitation of Transformer in path-finding: it cannot identify reachability relationships through transitivity, and thus would fail when path concatenation is needed to generate a path. In summary, our findings shed new light on how the internal mechanisms of autoregressive learning enable planning in networks. This study may contribute to our understanding of the general planning capabilities in other related domains.
在本文中,我们展示了我们的项目ALPINE的研究成果,该项目名为“基于Transformer的自动规划网络”。ALPINE通过研究Transformer基语言模型的自回归学习机制,对基于Transformer的规划能力的发展进行理论探讨,旨在确定其规划能力中存在的任何潜在限制。我们将规划定义为一个网络路径寻址任务,其中目标是从指定的源节点生成一个有效的路径到指定的目标节点。在表现力方面,我们证明了Transformer可以通过将邻接和可达性矩阵嵌入其权重来执行路径寻址。我们对Transformer的自回归学习动态的理论分析揭示了Transformer能够学习邻接矩阵以及有限形式的可达性矩阵。这些理论见解通过实验得到了验证,实验结果表明Transformer确实能够学习邻接矩阵和有限形式的可达性矩阵,这与我们理论分析中的预测相符。此外,当将我们的方法应用于现实世界的规划基准,称为Blocksworld时,我们的观察结果仍然保持一致。我们的理论和实证分析进一步揭示了Transformer在路径寻址中的一个潜在限制:它不能通过传递性识别可达性关系,因此当路径拼接需要生成路径时,它将失败。总之,我们的研究结果为理解自回归学习在网络中的内部机制以及如何实现网络规划提供了新的思路。本研究可能会对我们的理解其他相关领域中的普遍规划能力有所贡献。
https://arxiv.org/abs/2405.09220
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at this https URL.
我们提出了Xmodel-VLM,一种尖端的跨模态视觉语言模型。它专门为在消费者GPU服务器上进行高效的部署而设计。通过直接应对一个关键行业问题,即克服阻碍大规模多模态系统广泛采用的繁琐服务费用,我们的工作为行业带来了实质性进展。通过严格的训练,我们从头构建了一个10亿级语言模型,并采用LLaVA范式进行模态对齐。最终,我们称之为Xmodel-VLM的语言模型,既轻便又具有强大的多模态视觉能力。在多个经典多模态基准测试中进行的大量测试揭示了,尽管其规模较小且执行速度更快,但Xmodel-VLM的性能与较大模型相当。我们的模型检查点和代码现在都可以在GitHub上公开获取,网址为https://github.com/。
https://arxiv.org/abs/2405.09215
Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE ranks these answers based on their log-likelihood under the LM's distribution, and subsequently calculates their correlation with the corresponding human rankings. We support HRE's efficacy by investigating how efficiently it separates pretrained and instruction-tuned LMs of various sizes. We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning.
语言模型(LMs)作为会话助手最近变得越来越受欢迎,这些工具有助于人们完成各种任务。通常,这是通过在预先调整通用领域文本序列的基础上,通过指令微调和可能的选择优化方法,对LMs进行自适应。对这类LMS的评估理想上应该是通过人类判断来完成的,然而,这并不切实际。另一方面,通过辅助LMs作为评委或知识基础任务实现自动评估是可扩展的,但很难评估会话能力以及遵循指示的准确性。为了加速LMs作为会话助手的发展,我们提出了一个新颖的自动评估任务:HumanRankEval(HRE)。它包括一个大型、多样且高质量的问答集,每个答案都由人类编撰和评分。为了进行评估,HRE根据LM的分布对这些答案进行排序,然后计算它们与相应人类评分之间的相关性。我们通过研究HRE如何高效地分离各种大小的预训练和指令微调的LM来支持HRE的有效性。我们发现,HRE与人类判断高度相关,尤其是在指令微调后模型变化时。
https://arxiv.org/abs/2405.09186
Internal Language Model (LM)-based methods use permutation language modeling (PLM) to solve the error correction caused by conditional independence in external LM-based methods. However, random permutations of human interference cause fit oscillations in the model training, and Iterative Refinement (IR) operation to improve multimodal information decoupling also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance the location-context-image interaction capability, improving autoregressive generalization with internal LM. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks to dynamically exploit token dependencies. The adaptive masks increase the diversity of training data and prevent model dependency on a specific order. It reduces the training overhead of PLM while avoiding training fit oscillations. Second, we develop Cross-modal Hierarchical Attention mechanism (CHA) to couple context and image features. This processing establishes rich positional semantic dependencies between context and image while avoiding IR. Extensive experimental results show the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
内部语言模型(LM)方法使用变换语言建模(PLM)来解决外部LM方法中由条件独立性引起的错误纠正。然而,人类干扰的随机变换导致训练中的拟合振荡,迭代优化(IR)操作为了提高多模态信息解耦还会引入额外的开销。为了应对这些问题,本文提出了具有自适应变换的层次注意力自回归模型(HAAP),以增强位置上下文-图像交互能力,提高自回归模型的泛化。首先,我们提出隐式变换神经元(IPN)来生成自适应的注意力掩码以动态利用词依赖关系。自适应掩码增加了训练数据的多样性,防止了模型对特定顺序的依赖。同时,降低了PLM的训练开销,并避免了训练中的拟合振荡。其次,我们开发了跨模态层次注意力机制(CHA)来将上下文和图像特征耦合。这种处理建立了上下文和图像之间的丰富位置语义依赖关系,同时避免了IR。大量的实验结果表明,与最先进的(SOTA)性能相比,HAAP在准确性、复杂性和延迟方面都取得了卓越的成绩。
https://arxiv.org/abs/2405.09125
The current safeguard mechanisms for large language models (LLMs) are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.
目前的大型语言模型(LLM)的安全保障机制确实容易受到破解攻击,这使得它们在本质上非常脆弱。即使在下游任务的微调过程中对看似无害的数据进行精调,也可能危及安全性。一个潜在的解决方案是在下游微调之后对模型进行安全微调。然而,在安全微调过程中,LLM可能会重新获得安全性措施,但却会丢失下游微调过程中获得 task-specific 知识。在本文中,我们通过子空间定向模型融合(SOMF)引入了一个安全对齐框架,旨在将最初对齐的模型的安全保障能力和当前微调的模型的安全性结合成一个对齐的模型。我们的方法首先从每个微调模型的权重中解耦所有任务向量。然后,使用子空间掩码技术在这些向量中识别安全相关区域。最后,我们基于识别出的安全子空间探索最初对齐的LLM与所有任务向量的融合。我们验证,我们的安全对齐框架不仅满足单个微调模型的安全要求,而且满足多个模型在其融合时的安全要求。我们的研究结果证实,SOMF在保留安全性的同时,在下游任务上不显著牺牲性能,包括中文、英语和印地语的指令跟随和代码和数学问题解决能力。
https://arxiv.org/abs/2405.09055
Today's analog/mixed-signal (AMS) integrated circuit (IC) designs demand substantial manual intervention. The advent of multimodal large language models (MLLMs) has unveiled significant potential across various fields, suggesting their applicability in streamlining large-scale AMS IC design as well. A bottleneck in employing MLLMs for automatic AMS circuit generation is the absence of a comprehensive dataset delineating the schematic-netlist relationship. We therefore design an automatic technique for converting schematics into netlists, and create dataset AMSNet, encompassing transistor-level schematics and corresponding SPICE format netlists. With a growing size, AMSNet can significantly facilitate exploration of MLLM applications in AMS circuit design. We have made an initial set of netlists public, and will make both our netlist generation tool and the full dataset available upon publishing of this paper.
今天的模拟/混合信号(AMS)集成电路(IC)设计需要大量的手动干预。多模态大型语言模型(MLLMs)的出现揭示了各种领域中的显著潜力,表明它们在简化大规模AMS IC设计方面具有应用价值。然而,在将MLLM用于自动AMS电路生成方面,采用MLLM的一个瓶颈是缺乏一个全面的电路网络图数据集,该数据集定义了电路图与网络之间的关系。因此,我们设计了一种自动将电路图转换为网络列表的技巧,并创建了数据集AMSNet,涵盖了器件级别的电路图和相应的SPICE格式网络列表。随着规模的增大,AMSNet可以显著地促进探索MLLM在AMS电路设计中的应用。我们已经公开了一些初步的网络列表,并在论文发布后,将公开我们的网络列表生成工具和完整的数据集。
https://arxiv.org/abs/2405.09045