The necessity for interpretability in natural language processing (NLP) has risen alongside the growing prominence of large language models. Among the myriad tasks within NLP, text generation stands out as a primary objective of autoregressive models. The NLP community has begun to take a keen interest in gaining a deeper understanding of text generation, leading to the development of model-agnostic explainable artificial intelligence (xAI) methods tailored to this task. The design and evaluation of explainability methods are non-trivial since they depend on many factors involved in the text generation process, e.g., the autoregressive model and its stochastic nature. This paper outlines 17 challenges categorized into three groups that arise during the development and assessment of attribution-based explainability methods. These challenges encompass issues concerning tokenization, defining explanation similarity, determining token importance and prediction change metrics, the level of human intervention required, and the creation of suitable test datasets. The paper illustrates how these challenges can be intertwined, showcasing new opportunities for the community. These include developing probabilistic word-level explainability methods and engaging humans in the explainability pipeline, from the data design to the final evaluation, to draw robust conclusions on xAI methods.
在自然语言处理(NLP)中,可解释性(Transparency)的需求随着大型语言模型的日益重要而上升。在NLP的丰富多样的任务中,文本生成是自回归模型的主要目标。NLP社区开始对文本生成产生浓厚兴趣,这导致开发了针对这一任务的模型无关可解释人工智能(xAI)方法。解释性方法的设计和评估非同小可,因为它们依赖于文本生成过程中涉及到的许多因素,例如自回归模型及其随机性质。 本文概述了在开发和评估基于归因的解释性方法时出现的17个挑战,分为三组。这些挑战包括词标注问题、定义解释相似性、确定词的重要性以及预测变化度量,所需的 human intervention 级别以及创建合适的测试数据集。本文展示了这些挑战如何交织在一起,为社区带来了新的机会。这些机会包括开发概率级别的词级解释性方法和激发人类参与可解释性管道,从数据设计到最终评估,以得出关于xAI方法的稳健结论。
https://arxiv.org/abs/2405.08468
Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
基于指标的评估应该会更为简单,结果应该会更接近,而不是像基于人类评估一样。特别是在原始作者已经提供了代码和模型检查点的文本生成(CTG)技术指标基于评估中,这种评估的重新运行并不总是产生与原始结果相同的结果,并可能揭示原始工作的报告中的错误。正如我们努力重新运行指标基于评估一个集合的单属性可控文本生成(CTG)技术所作的报告所示,这种重新运行的评估并不总是产生与原始结果相同的结果,并可能揭示原始工作的报告中的错误。
https://arxiv.org/abs/2405.07875
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose to take advantage of existing pre-trained large-scale vision and language models to directly generate captions with test time adaptation. Specifically, we bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP as well as frozen CLIP. Differing from the conventional way to train these tokens with training data, we update these tokens with pseudo-targets of the inference data under several carefully crafted loss functions which enable the tokens to absorb video information catered for GPT-2. This procedure can be done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
尽管在完全监督的视频标题中取得了显著的进展,但零样本方法仍然没有被充分利用。在本文中,我们提出了一种利用现有的大规模视觉和语言预训练模型直接生成适应测试时间调整的标题的方法。具体来说,我们通过三个关键模型:一个通用的视频理解模型XCLIP、一个通用的图像理解模型CLIP和一个文本生成模型GPT-2来桥接视频和文本,因为它们具有源代码可用性。主要挑战是如何让文本生成模型对给定视频的内容有足够的意识,从而生成相应的标题。为解决这个问题,我们提出了一种使用可学习标记作为在冻定的GPT-2和冻定的XCLIP以及冻定的CLIP之间的通信媒介的方法。与传统利用训练数据来训练这些标记的方式不同,我们在多个精心设计的损失函数下使用推理数据的伪目标更新这些标记。这个过程可以在几轮迭代后完成(我们在实验中使用了16轮)并且不需要地面真实数据。在三个广泛使用的数据集MSR-VTT、MSVD和VATEX上的实验结果表明,与现有最先进的方法相比,提高幅度在4%到20%之间。
https://arxiv.org/abs/2405.07046
While nationality is a pivotal demographic element that enhances the performance of language models, it has received far less scrutiny regarding inherent biases. This study investigates nationality bias in ChatGPT (GPT-3.5), a large language model (LLM) designed for text generation. The research covers 195 countries, 4 temperature settings, and 3 distinct prompt types, generating 4,680 discourses about nationality descriptions in Chinese and English. Automated metrics were used to analyze the nationality bias, and expert annotators alongside ChatGPT itself evaluated the perceived bias. The results show that ChatGPT's generated discourses are predominantly positive, especially compared to its predecessor, GPT-2. However, when prompted with negative inclinations, it occasionally produces negative content. Despite ChatGPT considering its generated text as neutral, it shows consistent self-awareness about nationality bias when subjected to the same pair-wise comparison annotation framework used by human annotators. In conclusion, while ChatGPT's generated texts seem friendly and positive, they reflect the inherent nationality biases in the real world. This bias may vary across different language versions of ChatGPT, indicating diverse cultural perspectives. The study highlights the subtle and pervasive nature of biases within LLMs, emphasizing the need for further scrutiny.
虽然民族身份是一个关键的人口统计因素,可以增强语言模型的表现,但关于其固有偏见,它受到的审查远不如其他方面。这项研究调查了 ChatGPT(GPT-3.5)这个大型语言模型(LLM)在固有偏见方面。研究覆盖了195个国家,4个温度设置和3种不同的提示类型,共产生了关于中文和英文中民族身份描述的4680个论述。使用了自动指标来分析民族身份偏见,专家注释者以及 ChatGPT 本身也对偏见进行了评估。研究结果表明,ChatGPT生成的论述大多是积极的,特别是与它的前任GPT-2相比。然而,当受到负面提示时,它偶尔会生成负面内容。尽管 ChatGPT 将其生成的文本视为中性,但在同样的成对比较注释框架下受到人类注释者处理时,它对民族身份偏见表现出一致的自我意识。总之,ChatGPT生成的文本似乎友好和积极,但它反映了现实世界中民族身份偏见的存在。这种偏见可能因 ChatGPT 的不同语言版本而异,表明了多样文化观点。这项研究突出了 LLMs 中偏见 subtle 和 pervasive 的本质,强调了需要进一步审查的重要性。
https://arxiv.org/abs/2405.06996
The task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. This complexity underlines the need for computer-aided diagnosis methods employing multi-label zero-shot learning. Recent advancements in pre-trained vision-language models (VLMs) have showcased notable zero-shot classification abilities on medical images. However, these methods have limitations on leveraging extensive pre-trained knowledge from broader image datasets, and often depend on manual prompt construction by expert radiologists. By automating the process of prompt tuning, prompt learning techniques have emerged as an efficient way to adapt VLMs to downstream tasks. Yet, existing CoOp-based strategies fall short in performing class-specific prompts on unseen categories, limiting generalizability in fine-grained scenarios. To overcome these constraints, we introduce a novel prompt generation approach inspirited by text generation in natural language processing (NLP). Our method, named Pseudo-Prompt Generating (PsPG), capitalizes on the priori knowledge of multi-modal features. Featuring a RNN-based decoder, PsPG autoregressively generates class-tailored embedding vectors, i.e., pseudo-prompts. Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of our approach against leading medical vision-language and multi-label prompt learning methods. The source code is available at this https URL
医学图像识别任务的复杂性显著地由多种病理诊断表现的存在所加剧,这为在多标签分类中处理未见标签的挑战带来了独特的挑战。这种复杂性突出了需要使用多标签零样本学习来进行计算机辅助诊断的方法。最近,预训练视觉语言模型(VLMs)在医学图像上的显著零样本分类能力引起了人们的关注。然而,这些方法在利用更广泛的预训练知识方面存在局限,并且通常依赖于专家放射科医生的手动提示构建。通过自动调整提示过程,提示学习技术已成为将VLMs适应下游任务的有效方法。然而,现有的CoOp基策略在为未见类别生成类特定提示时存在局限,从而限制了在细粒度场景下的泛化能力。为了克服这些限制,我们引入了一种基于自然语言处理(NLP)的全新提示生成方法,我们称之为伪提示生成(PsPG)。PsPG利用多模态特征的先前知识。它采用循环神经网络(RNN)的解码器,逐个生成类定制嵌入向量,即伪提示。在各种多标签胸部X光片数据集上的比较评估证实了我们的方法相对于最先进的医学视觉语言和多标签提示学习方法具有优越性。源代码可在此链接下载:https://url.cn/
https://arxiv.org/abs/2405.06468
Ownership verification is currently the most critical and widely adopted post-hoc method to safeguard model copyright. In general, model owners exploit it to identify whether a given suspicious third-party model is stolen from them by examining whether it has particular properties `inherited' from their released models. Currently, backdoor-based model watermarks are the primary and cutting-edge methods to implant such properties in the released models. However, backdoor-based methods have two fatal drawbacks, including harmfulness and ambiguity. The former indicates that they introduce maliciously controllable misclassification behaviors ($i.e.$, backdoor) to the watermarked released models. The latter denotes that malicious users can easily pass the verification by finding other misclassified samples, leading to ownership ambiguity. In this paper, we argue that both limitations stem from the `zero-bit' nature of existing watermarking schemes, where they exploit the status ($i.e.$, misclassified) of predictions for verification. Motivated by this understanding, we design a new watermarking paradigm, $i.e.$, Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution instead of model predictions. Specifically, EaaW embeds a `multi-bit' watermark into the feature attribution explanation of specific trigger samples without changing the original prediction. We correspondingly design the watermark embedding and extraction algorithms inspired by explainable artificial intelligence. In particular, our approach can be used for different tasks ($e.g.$, image classification and text generation). Extensive experiments verify the effectiveness and harmlessness of our EaaW and its resistance to potential attacks.
所有权验证是目前最严格且应用最广泛的后续方法,以保护模型版权。通常,模型所有者利用它来确定是否存在从他们发布的模型中盗用的可疑第三方模型,检查它是否具有从他们发布的模型中继承的特定属性。目前,基于后门的模型水印是植入此类属性的主要和最先进的方法。然而,基于后门的方法有两个致命的缺陷,包括有害性和模糊性。前者表示它们将恶意可控制的不正确分类行为(即后门)引入到标记的发布模型中。后者表示恶意用户可以轻松地绕过验证,通过找到其他已分类的样本,导致所有权模糊。在本文中,我们认为这两个缺陷都源于现有水印方案的“零比特”特性,即它们利用验证状态(即分类)的预测。为了理解这一点,我们设计了一个新的水印范例:解释作为水印(EaaW),将验证行为嵌入到特征归因的解释中,而不是模型预测。具体来说,EaaW 将一个“多比特”水印嵌入到特定触发样本的特征归因解释中,而不会改变原始预测。我们相应地设计水印嵌入和提取算法,受到可解释人工智能(EVA)的启发。特别地,我们的方法可用于不同任务(例如,图像分类和文本生成)。大量实验证实了我们的EaaW的有效性和无害性,以及其对潜在攻击的抵抗能力。
https://arxiv.org/abs/2405.04825
LLMs have been found to memorize training textual sequences and regurgitate verbatim said sequences during text generation time. This fact is known to be the cause of privacy and related (e.g., copyright) problems. Unlearning in LLMs then takes the form of devising new algorithms that will properly deal with these side-effects of memorized data, while not hurting the model's utility. We offer a fresh perspective towards this goal, namely, that each textual sequence to be forgotten should be treated differently when being unlearned based on its degree of memorization within the LLM. We contribute a new metric for measuring unlearning quality, an adversarial attack showing that SOTA algorithms lacking this perspective fail for privacy, and two new unlearning methods based on Gradient Ascent and Task Arithmetic, respectively. A comprehensive performance evaluation across an extensive suite of NLP tasks then mapped the solution space, identifying the best solutions under different scales in model capacities and forget set sizes and quantified the gains of the new approaches.
LLMs 已经被发现会在文本生成过程中记忆训练文本序列并重新表述这些序列。这个事实导致了隐私和相关(如版权)问题。在 LLMs 中,遗忘的形式是以设计新的算法来正确处理这些 memorized 数据的副作用,同时不损害模型的效用来实现。我们对这个目标提供了一个全新的视角,即在 LLM 中,每个要忘记的文本序列在遗忘程度不同的情况下应该有不同的处理方式。我们贡献了一个衡量遗忘质量的新指标,一种攻击方式显示了 SOTA 算法缺乏这种视角会在隐私方面失败,以及两种基于 Gradient Ascent 和 Task Arithmetic 的新的遗忘方法。对一系列 NLP 任务进行全面的性能评估后,将解决方案空间映射到不同规模的模型容量和遗忘设置中,并量化了新的方法的增长。
https://arxiv.org/abs/2405.03097
Large language models (LLMs) tend to inadequately integrate input context during text generation, relying excessively on encoded prior knowledge in model parameters, potentially resulting in generated text with factual inconsistencies or contextually unfaithful content. LLMs utilize two primary knowledge sources: 1) prior (parametric) knowledge from pretraining, and 2) contextual (non-parametric) knowledge from input prompts. The study addresses the open question of how LLMs effectively balance these knowledge sources during the generation process, specifically in the context of open-domain question answering. To address this issue, we introduce a novel approach integrating contrastive decoding with adversarial irrelevant passages as negative samples to enhance robust context grounding during generation. Notably, our method operates at inference time without requiring further training. We conduct comprehensive experiments to demonstrate its applicability and effectiveness, providing empirical evidence showcasing its superiority over existing methodologies. Our code is publicly available at: this https URL.
大语言模型(LLMs)在文本生成过程中往往不足以很好地整合输入上下文,过分依赖模型参数中的编码先验知识,可能导致生成的文本存在事实不一致或上下文不忠实的内容。LLM利用两个主要知识来源:1)预训练中的先验(参数)知识,2)输入提示中的上下文(非参数)知识。本研究回答了一个开放性问题:LLM在生成过程中如何有效地平衡这些知识来源,尤其是在开放领域问题回答的背景下。为解决这个问题,我们引入了一种新颖的方法,将对比性解码与对抗无关段落作为负样本,以增强生成过程中的上下文接地。值得注意的是,我们的方法在推理过程中无需进一步训练。我们进行了全面的实验来证明其应用性和有效性,提供了其优于现有方法的实证证据。我们的代码可在以下这个链接公开使用:https:// this URL。
https://arxiv.org/abs/2405.02750
Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses, aiming to alleviate the hallucination problem. However, existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks without considering the interaction of fine-grained structural semantics between the retrieved documents and the LLMs. This issue is particularly important for accurate response generation as LLMs tend to ``lose in the middle'' when dealing with input prompts augmented with lengthy documents. In this work, we propose a new pipeline named ``Reinforced Retriever-Reorder-Responder'' (R$^4$) to learn document orderings for retrieval-augmented LLMs, thereby further enhancing their generation abilities while the large numbers of parameters of LLMs remain frozen. The reordering learning process is divided into two steps according to the quality of the generated responses: document order adjustment and document representation enhancement. Specifically, document order adjustment aims to organize retrieved document orderings into beginning, middle, and end positions based on graph attention learning, which maximizes the reinforced reward of response quality. Document representation enhancement further refines the representations of retrieved documents for responses of poor quality via document-level gradient adversarial learning. Extensive experiments demonstrate that our proposed pipeline achieves better factual question-answering performance on knowledge-intensive tasks compared to strong baselines across various public datasets. The source codes and trained models will be released upon paper acceptance.
检索增强的大型语言模型(LLMs)利用信息检索系统检索的相关内容来生成正确的答案,旨在减轻混杂问题。然而,现有的检索响应方法通常在LLM的提示中附加相关文档进行文本生成任务,而没有考虑检索到的文档与LLM之间细粒度语义结构的交互。这个问题在准确回答问题方面尤为重要,因为LLM在处理带有长文档的输入提示时往往会出现“在中途迷失”的情况。在本文中,我们提出了一个名为“强化检索-排序-回答者”(R$^4$)的新管道来学习检索增强LLM的文档顺序,从而在保持LLM的大参数的同时进一步增强其生成能力。排序学习过程根据生成的响应质量分为两个步骤:文档顺序调整和文档表示增强。具体来说,文档顺序调整旨在根据图注意力学习将检索到的文档顺序组织为开始、中间和结束位置,从而最大化响应质量的强化奖励。文档表示增强通过文档级的梯度 adversarial 学习进一步优化了用于低质量响应的文档表示。大量实验证明,与各种公共数据集上的强大基线相比,我们提出的管道在知识密集型任务上的事实问题回答表现更好。源代码和训练好的模型将在论文接受后发布。
https://arxiv.org/abs/2405.02659
This paper introduces a framework for post-processing machine learning models so that their predictions satisfy multi-group fairness guarantees. Based on the celebrated notion of multicalibration, we introduce $(\mathbf{s},\mathcal{G}, \alpha)-$GMC (Generalized Multi-Dimensional Multicalibration) for multi-dimensional mappings $\mathbf{s}$, constraint set $\mathcal{G}$, and a pre-specified threshold level $\alpha$. We propose associated algorithms to achieve this notion in general settings. This framework is then applied to diverse scenarios encompassing different fairness concerns, including false negative rate control in image segmentation, prediction set conditional uncertainty quantification in hierarchical classification, and de-biased text generation in language models. We conduct numerical studies on several datasets and tasks.
本文提出了一种后处理机器学习模型的框架,使其预测满足多组公平性保证。基于多态性的著名概念,我们引入了$( \mathbf{s},\mathcal{G}, \alpha)-$GMC(一般多维多态性)方法,用于处理多维映射$\mathbf{s}$、约束集$\mathcal{G}$和一个预设的阈值级别$\alpha$。我们提出了一组相关算法,以实现这一目标。该框架 then 应用于各种场景,包括图像分割中的假阴性率控制、层次分类中预测集的条件下不确定性量化以及语言模型中的去偏文本生成等。我们对多个数据集和任务进行了数值研究。
https://arxiv.org/abs/2405.02225
Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide a dataset for creative, witty text generation based on Reddit Showerthoughts posts.
近年来,大型语言模型(LLMs)已经证明了生成难以或无法与人类写作区分的内容的能力。我们在Showerthoughts领域,即在普通活动中可能出现的想法,研究了不同大小的LLM是否具有复制人类写作风格的短小、有创意的文本的能力。我们将GPT-2和GPT-Neo在Reddit数据上微调以及通过零散的方式启动GPT-3.5,与人类创作的文章进行比较。我们在具体体现创意、幽默的文本质量的各个维度上衡量人类偏好。此外,我们还比较了人类和微调后的RoBERTa分类器在检测人工智能生成文本方面的能力。我们得出结论,人类评估者平均将生成的文本创意质量评价为略逊一筹,但他们无法可靠地区分人类创作和人工智能生成的文本。我们还基于Reddit Showerthoughts帖子提供了用于创意、幽默文本生成的数据集。
https://arxiv.org/abs/2405.01660
While most research on controllable text generation has focused on steering base Language Models, the emerging instruction-tuning and prompting paradigm offers an alternate approach to controllability. We compile and release ConGenBench, a testbed of 17 different controllable generation tasks, using a subset of it to benchmark the performance of 9 different baselines and methods on Instruction-tuned Language Models. To our surprise, we find that prompting-based approaches outperform controllable text generation methods on most datasets and tasks, highlighting a need for research on controllable text generation with Instruction-tuned Language Models in specific. Prompt-based approaches match human performance on most stylistic tasks while lagging on structural tasks, foregrounding a need to study more varied constraints and more challenging stylistic tasks. To facilitate such research, we provide an algorithm that uses only a task dataset and a Large Language Model with in-context capabilities to automatically generate a constraint dataset. This method eliminates the fields dependence on pre-curated constraint datasets, hence vastly expanding the range of constraints that can be studied in the future.
尽管关于可控制文本生成的研究主要集中在引导基础语言模型,但新兴的指令调整和提示范式提供了一种可控制的新方法。我们使用部分数据集编译和发布ConGenBench,使用其中的一部分来基准9种不同的基线和方法在指令调整语言模型上的性能。令我们惊讶的是,基于提示的方法在大多数数据集和任务上优于可控制文本生成方法,突出了在特定情况下研究可控制文本生成与指令调整语言模型的需要。基于提示的方法在大多数风格化任务上与人类性能相匹配,但在结构化任务上落后,突出了需要研究更广泛的约束以及更具挑战性的风格化任务。为了促进这种研究,我们提供了一种使用任务数据集和大语言模型(具有上下文功能)自动生成约束数据集的算法。这种方法消除了依赖性,因此极大地扩展了可以研究的约束范围。
https://arxiv.org/abs/2405.01490
AI Foundation models are gaining traction in various applications, including medical fields like radiology. However, medical foundation models are often tested on limited tasks, leaving their generalisability and biases unexplored. We present RayDINO, a large visual encoder trained by self-supervision on 873k chest X-rays. We compare RayDINO to previous state-of-the-art models across nine radiology tasks, from classification and dense segmentation to text generation, and provide an in depth analysis of population, age and sex biases of our model. Our findings suggest that self-supervision allows patient-centric AI proving useful in clinical workflows and interpreting X-rays holistically. With RayDINO and small task-specific adapters, we reach state-of-the-art results and improve generalization to unseen populations while mitigating bias, illustrating the true promise of foundation models: versatility and robustness.
AI 基础模型在各种应用领域都取得了越来越广泛的应用,包括医学领域如 radiology。然而,这些模型通常仅在有限的任务上进行测试,导致其泛化能力和偏见未被探索。我们提出了 RayDINO,一种通过自监督学习在 873k 张胸部 X 光片上进行训练的大视觉编码器。我们比较了 RayDINO 与之前最先进的模型在九个放射学任务上的效果,从分类和密集分割到文本生成,并对我们模型的Population、Age和Gender偏见进行了深入分析。我们的研究结果表明,自监督训练使患者为中心的 AI 在临床工作流程中具有实际价值,并能够全面地解释 X 光片。与 RayDINO 和针对具体任务的适配器相结合,我们达到了最先进的结果,并在未见过的受众上进行了泛化能力的提高,同时减轻了偏见,证明了基础模型的真正价值:多样性和稳健性。
https://arxiv.org/abs/2405.01469
Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of users. In this perspective paper, we draw together opinions from industry and academia, and from a variety of related research areas, to present our vision for automatic report generation, and -- critically -- a flexible framework by which such reports can be evaluated. In contrast with other summarization tasks, automatic report generation starts with a detailed description of an information need, stating the necessary background, requirements, and scope of the report. Further, the generated reports should be complete, accurate, and verifiable. These qualities, which are desirable -- if not required -- in many analytic report-writing settings, require rethinking how to build and evaluate systems that exhibit these qualities. To foster new efforts in building these systems, we present an evaluation framework that draws on ideas found in various evaluations. To test completeness and accuracy, the framework uses nuggets of information, expressed as questions and answers, that need to be part of any high-quality generated report. Additionally, evaluation of citations that map claims made in the report to their source documents ensures verifiability.
大规模语言模型(LLMs)已经实现了满足信息需求的新方式。尽管在将它们应用于如文档排名和短文文本生成等场景方面取得了很大进展,但它们在构建完整、准确、可验证的长篇报告方面仍然遇到困难。具有这些品质的报告对于满足用户复杂的、细微的或多面性的信息需求至关重要。在本文中,我们从行业和学术界以及相关研究领域的观点汇聚一堂,探讨了我们关于自动报告生成的愿景以及——关键——评估报告的框架。与其他总结任务相比,自动报告生成从详细描述信息需求开始,声明必要的背景、要求和报告的范围。此外,生成的报告应确保完整、准确和可验证。这些品质在许多分析报告写作环境中是受欢迎的——如果不是必须的——需要重新思考如何构建和评估系统,这些系统表现出这些品质。为了促进在这些系统中构建新举措,我们提出了一个利用各种评估建议的评价框架。为了测试完整性
https://arxiv.org/abs/2405.00982
Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models $\textit{dynamically}$ predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57$\times$ speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.
传统语言模型是自回归的,也就是说,它们一次预测一个单词。模型大小迅速膨胀导致了推理时间的高。在这项工作中,我们提出了DynaMo,一套多词预测语言模型,减少了网络推理时间。我们的模型根据预测联合概率分布的自信程度动态预测多个单词。我们提出了一个轻量级的方法来训练这些模型,利用传统自回归同行的权重。此外,我们还提出了提高估计联合概率以改善文本生成质量的新方法,包括共现加权掩码和自适应阈值。我们还提出了系统性的定性和定量方法来严格测试生成文本的质量,对于非自回归生成。我们 suite 中的一 个模型 DynaMo-7.3B-T3,在实现与基线(Pythia-6.9B)相同质量的生成文本的同时,仅比基线提高了5.87%的参数和训练时间,实现了2.67%的提速。
https://arxiv.org/abs/2405.00888
Learning never ends, and there is no age limit to grow yourself. However, the educational landscape may face challenges in effectively catering to students' inclusion and diverse learning needs. These students should have access to state-of-the-art methods for lecture delivery, online resources, and technology needs. However, with all the diverse learning sources, it becomes harder for students to comprehend a large amount of knowledge in a short period of time. Traditional assistive technologies and learning aids often lack the dynamic adaptability required for individualized education plans. Large Language Models (LLM) have been used in language translation, text summarization, and content generation applications. With rapid growth in AI over the past years, AI-powered chatbots and virtual assistants have been developed. This research aims to bridge this gap by introducing an innovative study buddy we will be calling the 'SAMCares'. The system leverages a Large Language Model (LLM) (in our case, LLaMa-2 70B as the base model) and Retriever-Augmented Generation (RAG) to offer real-time, context-aware, and adaptive educational support. The context of the model will be limited to the knowledge base of Sam Houston State University (SHSU) course notes. The LLM component enables a chat-like environment to interact with it to meet the unique learning requirements of each student. For this, we will build a custom web-based GUI. At the same time, RAG enhances real-time information retrieval and text generation, in turn providing more accurate and context-specific assistance. An option to upload additional study materials in the web GUI is added in case additional knowledge support is required. The system's efficacy will be evaluated through controlled trials and iterative feedback mechanisms.
学习永无止境,没有年龄限制去发展自己。然而,教育领域可能会面临照顾学生多样性需求和理解能力不足的挑战。这些学生应享有最先进的大课教学方法、在线资源和科技需求。然而,尽管有各种各样的学习资源,学生在短时间内理解大量知识仍然变得更加困难。传统的辅助技术和学习辅助工具通常缺乏个性化的教育计划所需的动态适应性。近年来,随着人工智能的快速发展,已经开发出了一些基于人工智能的聊天机器人或虚拟助手。这项研究旨在通过介绍我们称之为“SAMCares”的创新学习伙伴来填补这一空白。该系统利用大型语言模型(LLM)(在我们的案例中,LLaMa-2 70B作为基础模型)和Retriever-Augmented Generation(RAG)为每个学生提供实时的、上下文感知和自适应的教育支持。模型的上下文将局限于德克萨斯州休斯顿州立大学(SHSU)的课程笔记知识库。LLM部分使学生能够以类似聊天室的环境与它互动,满足每个学生的独特学习需求。为此,我们将构建一个自定义的网页GUI。同时,RAG通过实时信息检索和文本生成增强,提供更准确、上下文相关的帮助。在网页GUI中增加上传额外学习材料的选项,以便需要额外知识支持时使用。系统将通过控制试验和迭代反馈机制来评估其有效性。
https://arxiv.org/abs/2405.00330
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.
视觉语言数据对于文本到图像(T2I)和图像到文本(I2T)研究来说至关重要。然而,现有的数据集缺乏描述性,这些描述可以让模型更丰富地学习关联。为了填补这个空白,我们引入了连接和对比图像的描述(DOCCI)数据集,这是一个由单个研究人员收集、策划和捐赠的15000张图片的数据集,旨在捕捉一些关键挑战,如空间关系、计数、文本渲染、世界知识等。我们指示人类标注者为每张图片创建全面的描述;这些描述通常长度为136个词,并刻意区分彼此的关系或相似性。每个描述高度可组合,通常涵盖多个挑战。通过数量和定性分析,我们证明DOCCI是一个有效的图像到文本生成训练资源——在DOCCI上训练的PaLI 5B模型与高度表现的大模型(如LaVA-1.5 7B和InstructBLIP 7B)相比,表现出相同或更好的效果。此外,我们还展示了DOCCI对于文本到图像生成的测试平台的价值,突出了当前文本到图像模型的局限性,即捕捉不了长描述和细节。
https://arxiv.org/abs/2404.19753
Current text generation models are trained using real data which can potentially contain sensitive information, such as confidential patient information and the like. Under certain conditions output of the training data which they have memorised can be triggered, exposing sensitive data. To mitigate against this risk we propose a safer alternative which sees fragmented data in the form of domain-specific short phrases randomly grouped together shared instead of full texts. Thus, text fragments that could re-identify an individual cannot be reproduced by the model in one sequence, giving significant protection against linkage attacks. We fine-tune several state-of-the-art LLMs using meaningful syntactic chunks to explore their utility. In particular, we fine-tune BERT-based models to predict two cardiovascular diagnoses. Our results demonstrate the capacity of LLMs to benefit from the pre-trained knowledge and deliver classification results when fine-tuned with fragmented data comparable to fine-tuning with full training data.
当前的文本生成模型使用真实数据进行训练,这可能包含敏感信息,如机密患者信息等。在某些情况下,它们训练数据的输出可能会触发包含敏感数据的输出,从而暴露敏感信息。为了减轻这种风险,我们提出了一个更安全的选择,即使用领域特定短语(domain-specific short phrases)随机组合而不是完整文本的破碎数据。因此,模型无法通过一个序列复制可能重新识别个人的文本片段,从而对链接攻击具有显著的防护作用。我们使用有意义的语义块对几个最先进的LLM进行微调,以探讨它们的使用价值。特别地,我们微调基于BERT的模型,以预测两个心血管诊断。我们的结果表明,LLM可以利用预训练知识并在与完整训练数据相似的破碎数据上进行微调,从而产生分类结果。
https://arxiv.org/abs/2404.19486
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. To address this, various approaches have been developed to safeguard LLMs from producing unsafe content. However, existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight framework to safeguard LLM text generation in real-time. LLMSafeGuard integrates an external validator into the beam search algorithm during decoding, rejecting candidates that violate safety constraints while allowing valid ones to proceed. We introduce a similarity based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafe-Guard on two tasks, detoxification and copyright safeguarding, and demonstrate its superior performance over SOTA baselines. For instance, LLMSafeGuard reduces the average toxic score of. LLM output by 29.7% compared to the best baseline meanwhile preserving similar linguistic quality as natural output in detoxification task. Similarly, in the copyright task, LLMSafeGuard decreases the Longest Common Subsequence (LCS) by 56.2% compared to baselines. Moreover, our context-wise timing selection strategy reduces inference time by at least 24% meanwhile maintaining comparable effectiveness as validating each time step. LLMSafeGuard also offers tunable parameters to balance its effectiveness and efficiency.
大规模语言模型(LLMs)在自然语言处理(NLP)任务方面已经显著取得了进步,但同时也因为其倾向于生成有害内容而带来道德和社会风险。为了解决这个问题,已经开发了许多方法来保护LLMs免受产生不安全内容的侵害。然而,现有的方法存在局限性,包括需要训练特定的控制模型并在文本生成过程中进行主动干预,导致质量和计算开销的增加。为了减轻这些局限,我们提出了LLMSafeGuard,一个轻量级的框架,用于在实时保护LLM文本生成。LLMSafeGuard在解码过程中将外部验证器集成到beam搜索算法中,拒绝违反安全约束的候选者,同时允许合法的候选继续前进。我们引入了一种基于相似性的验证方法,简化了约束的引入,并消除了控制模型训练的需求。此外,LLMSafeGuard采用了一种基于语境的时序选择策略,只在需要时干预LLMs。我们在两个任务上评估LLMSafe-Guard,分别是脱毒和版权保护,并证明了其相对于现有基线的优越性能。例如,LLMSafeGuard在脱毒任务中将LLM输出平均毒性分数降低了29.7%,同时保留了与自然输出相似的语言质量。同样,在版权任务中,LLMSafeGuard将最长共同子序列(LCS)降低了56.2%。此外,我们的语境时序选择策略通过至少减少24%的推理时间,同时保持与验证每个时步的有效性相当,减少了LLMSafeGuard的复杂性。LLMSafeGuard还提供了可调节的参数,以平衡其有效性和效率。
https://arxiv.org/abs/2404.19048
In customer service technical support, swiftly and accurately retrieving relevant past issues is critical for efficiently resolving customer inquiries. The conventional retrieval methods in retrieval-augmented generation (RAG) for large language models (LLMs) treat a large corpus of past issue tracking tickets as plain text, ignoring the crucial intra-issue structure and inter-issue relations, which limits performance. We introduce a novel customer service question-answering method that amalgamates RAG with a knowledge graph (KG). Our method constructs a KG from historical issues for use in retrieval, retaining the intra-issue structure and inter-issue relations. During the question-answering phase, our method parses consumer queries and retrieves related sub-graphs from the KG to generate answers. This integration of a KG not only improves retrieval accuracy by preserving customer service structure information but also enhances answering quality by mitigating the effects of text segmentation. Empirical assessments on our benchmark datasets, utilizing key retrieval (MRR, Recall@K, NDCG@K) and text generation (BLEU, ROUGE, METEOR) metrics, reveal that our method outperforms the baseline by 77.6% in MRR and by 0.32 in BLEU. Our method has been deployed within LinkedIn's customer service team for approximately six months and has reduced the median per-issue resolution time by 28.6%.
在客户服务技术支持中,迅速和准确地检索相关的历史问题对于高效解决客户咨询至关重要。对于大型语言模型(LLMs)的检索增强生成(RAG)中的传统检索方法,将大量历史问题跟踪 ticket 视为纯文本,忽略关键的内部问题结构和互操作关系,这会限制性能。我们引入了一种名为客户服务问题回答的新方法,将 RAG 与知识图(KG)相结合。我们的方法从历史问题中构建 KG,用于检索,保留内部问题结构和互操作关系。在问题回答阶段,我们的方法解析消费者查询并从 KG 中检索相关子图以生成答案。这种将 KG 与检索相结合的方法不仅通过保留客户服务结构信息提高了检索准确性,而且通过减轻文本分割的影响提高了回答质量。在我们基准数据集上进行的实证评估表明,使用关键检索(MRR,召回率@K,NDCG@K)和文本生成(BLEU,ROUGE,METEOR)指标评估,我们的方法在 MRR 和 BLEU 指标上均优于基线,分别高出基线 77.6% 和 0.32。我们的方法已经在 LinkedIn 的客户服务团队中部署了大约六个月,并将每件问题的中位数解决时间减少了 28.6%。
https://arxiv.org/abs/2404.17723