This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
这篇论文对通过一系列实验对可解释性事实核查进行全面分析,重点关注大型语言模型验证公共卫生主张及其真实性评估的能力。我们研究了零/少抽样提示和参数高效的微调在各种开源和闭源模型上的有效性,并探讨了其在真实性预测和解释生成的单独和联合任务上的表现。重要的是,我们采用了一种包括先前确定的自动指标和通过人类评价的新颖标准的双重评估方法。自动评估表明,在零抽样场景中,GPT-4脱颖而出,但在少抽样和参数高效的微调环境中,开源模型表明其不仅弥合了性能差距,而且在某些情况下甚至超过了GPT-4。人类评估揭示了更多的细节,并表明了黄金解释存在的一些潜在问题。
https://arxiv.org/abs/2405.09454
Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.
尽管语言模型(LMs)已经提高了问答系统的性能,但它们仍然需要大量数据。数据标注是一个耗时过程。相比之下,尤其是在问答系统中,可能需要对大量文档进行解析和标注,以及问题和它们的相应答案。此外,问答模型通常只在训练它们的领域中表现良好。由于标注成本高昂,我们认为LMs中具有领域无关知识(如语言理解)足够创建一个经过良好筛选的 dataset。以此动机,我们证明了在几 shot设置中,使用大型语言模型可以比最先进的方法提高各种数据集的问答系统性能。为此,我们利用提示框架进行数据生成,表明语言模型包含有价值的任务无关知识,可以用于超越常见的预训练/微调方案。因此,我们在几 shot问题回答方面持续优于之前的解决方案。
https://arxiv.org/abs/2405.09335
Few-shot segmentation (FSS) aims to train a model which can segment the object from novel classes with a few labeled samples. The insufficient generalization ability of models leads to unsatisfactory performance when the models lack enough labeled data from the novel classes. Considering that there are abundant unlabeled data available, it is promising to improve the generalization ability by exploiting these various data. For leveraging unlabeled data, we propose a novel method, named Image to Pseudo-Episode (IPE), to generate pseudo-episodes from unlabeled data. Specifically, our method contains two modules, i.e., the pseudo-label generation module and the episode generation module. The former module generates pseudo-labels from unlabeled images by the spectral clustering algorithm, and the latter module generates pseudo-episodes from pseudo-labeled images by data augmentation methods. Extensive experiments on PASCAL-$5^i$ and COCO-$20^i$ demonstrate that our method achieves the state-of-the-art performance for FSS.
少样本分割(FSS)旨在训练一个模型,它可以从新的类别中分割出对象,只需几篇有标签的样本。模型缺乏足够的泛化能力会导致模型在新类上表现不满意,当模型缺乏足够的有标签数据时。考虑到有丰富的未标记数据可用,通过利用这些各种数据来提高泛化能力会是一个有前景的方法。为了利用未标记数据,我们提出了一种新的方法,名为图像到伪序列(IPE),从未标记数据中生成伪序列。具体来说,我们的方法包含两个模块,即伪标签生成模块和序列生成模块。前一个模块通过谱聚类算法从未标记图像中生成伪标签,后一个模块通过数据增强方法从伪标签图像中生成伪序列。在PASCAL-$5^i$和COCO-$20^i$等大量实验中,我们的方法证明了在FSS方面达到了最先进的性能。
https://arxiv.org/abs/2405.08765
Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
由于未见类别的标签信息有限,少样本分割仍然具有挑战性。之前的方法主要依赖于从冻结的视觉编码器中提取高层次特征图来计算像素级的相似度作为解码器的关键先验指导。然而,这种先验表示存在粗粒度不准确性和对新生类别的泛化能力差,因为这些高层次特征图明显具有类别偏见。在本文中,我们提出用视觉-文本对齐能力来替代视觉先验表示,以捕捉更可靠的指导并提高模型的泛化能力。具体来说,我们设计了两种类型的自适应先验信息生成策略,试图利用预训练的对比语言-图像模型(CLIP)的语义对齐能力来查找目标类别。此外,为了获得更精确的先验指导,我们构建了关注图的高级关系,并利用它来优化初始先验信息。在PASCAL-5和COCO-20数据集上的实验表明,我们的方法取得了显着的改进,并达到了与现有最佳性能相当的水平。
https://arxiv.org/abs/2405.08458
Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times and absolute improvements of 11.66% to 44.64% on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information in the traditional TableQA manner. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community. We also provide a human evaluation of error cases to analyze further the aspects in which the model can be improved. Project page: this https URL.
由于表格的简洁和结构化特点,其中包含的知识可能不完整或缺失,这对表格问题回答(TableQA)和数据分析系统构成了重大挑战。现有数据集中,要么未能解决表格外部知识的这个问题,要么仅将结构化文本作为表格的补充信息。在本文中,我们将知识库(KB)作为表格问题回答的外部知识来源,并构建了一个细粒度 gold 证据注释的 dataset KET-QA。数据集中的每个表对应于整个知识库的子图,每个问题都需要从表格和子图整合信息来回答。为了从庞大的知识子图中提取相关信息并应用于表格问题回答,我们设计了一个retriever-reasoner结构化管道模型。实验结果表明,我们的模型在三个不同的设置(微调、零散和少散)上实现了引人注目的相对性能改进,从1.9到6.5倍,以及绝对性能改进11.66%到44.64%。与仅依赖表格信息的传统 TableQA 方法相比,即使在最好的模型上,EM 分数也落后了人类水平。这揭示了对于问答社区来说,KET-QA 的具有挑战性的本质。我们还提供了人类评估错误案例以进一步分析模型可以改进的方面。项目页面:https:// this URL。
https://arxiv.org/abs/2405.08099
Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.
尽管在英语主导的生成式大型语言模型的帮助下,低资源语言的增强全球可访问性还需要进一步发展。表示这些语言的主要方法是单语和多语预训练。单语预训练由于硬件要求而昂贵,而多语模型在语言之间往往存在不均匀的性能。本研究探讨了一种通过将主要在英语上训练的大型语言模型适应低资源语言的替代方案。我们评估了各种策略,包括持续训练、指令微调、任务特定微调和词汇扩展。结果表明,持续训练改善了语言理解,这在词义理解分数上得到了反映,而任务特定微调通常会增强下游任务的性能。然而,扩展词汇并没有实质性的好处。此外,虽然大型模型通过少样本调整可以提高任务性能,但多语模型在适应时表现得更差。
https://arxiv.org/abs/2405.07745
Detecting anomaly edges for dynamic graphs aims to identify edges significantly deviating from the normal pattern and can be applied in various domains, such as cybersecurity, financial transactions and AIOps. With the evolving of time, the types of anomaly edges are emerging and the labeled anomaly samples are few for each type. Current methods are either designed to detect randomly inserted edges or require sufficient labeled data for model training, which harms their applicability for real-world applications. In this paper, we study this problem by cooperating with the rich knowledge encoded in large language models(LLMs) and propose a method, namely AnomalyLLM. To align the dynamic graph with LLMs, AnomalyLLM pre-trains a dynamic-aware encoder to generate the representations of edges and reprograms the edges using the prototypes of word embeddings. Along with the encoder, we design an in-context learning framework that integrates the information of a few labeled samples to achieve few-shot anomaly detection. Experiments on four datasets reveal that AnomalyLLM can not only significantly improve the performance of few-shot anomaly detection, but also achieve superior results on new anomalies without any update of model parameters.
检测动态图中的异常边旨在识别与正常模式显著不同的边,可以应用于各种领域,如网络安全、金融交易和AIOps。随着时间的推移,异常边的类型不断涌现,每个类型的有标签数据也越来越少。目前的方法要么是设计来检测随机插入的边,要么需要足够的有标签数据进行模型训练,这会损害其在现实应用中的适用性。在本文中,我们通过与拥有丰富知识的大型语言模型(LLMs)合作,研究了这个问题,并提出了名为 AnomalyLLM 的方法。为了将动态图与LLMs对齐,AnomalyLLM 通过预训练一个动态感知编码器来生成边缘的表示,并使用词向量示例对其进行重编程。与编码器一起,我们设计了一个上下文学习框架,将几个有标签样本的信息整合在一起,实现 few-shot 异常检测。在四个数据集上的实验表明,AnomalyLLM 不仅可以显著提高 few-shot 异常检测的性能,而且在新异常上还能获得更好的结果,而无需对模型参数进行更新。
https://arxiv.org/abs/2405.07626
High-quality images are crucial in remote sensing and UAV applications, but atmospheric haze can severely degrade image quality, making image dehazing a critical research area. Since the introduction of deep convolutional neural networks, numerous approaches have been proposed, and even more have emerged with the development of vision transformers and contrastive/few-shot learning. Simultaneously, papers describing dehazing architectures applicable to various Remote Sensing (RS) domains are also being published. This review goes beyond the traditional focus on benchmarked haze datasets, as we also explore the application of dehazing techniques to remote sensing and UAV datasets, providing a comprehensive overview of both deep learning and prior-based approaches in these domains. We identify key challenges, including the lack of large-scale RS datasets and the need for more robust evaluation metrics, and outline potential solutions and future research directions to address them. This review is the first, to our knowledge, to provide comprehensive discussions on both existing and very recent dehazing approaches (as of 2024) on benchmarked and RS datasets, including UAV-based imagery.
高质量的图像在遥感和无人机应用中至关重要,但大气雾霾会严重破坏图像质量,使图像去雾成为一个关键的研究领域。自深度卷积神经网络的引入,已经提出了许多方法,随着视觉变压器和对比/零样本学习的发展,更多方法也应运而生。同时,描述适用于各种遥感(RS)领域的去雾架构的论文也在不断发表。本综述超越了传统关注基准雾数据集的范围,我们还在遥感和无人机数据上探讨了去雾技术的应用,为这些领域提供了一个全面的深度学习和基于先验方法的研究概述。我们指出了关键挑战,包括缺乏大规模 RS 数据集和需要更健壮的评估指标,并提出了可能的解决方案和未来的研究方向来解决这些挑战。据我们所知,这是第一部关于基准和 RS 数据集上现有和非常最近去雾方法的综合讨论(截至 2024 年)。包括基于 UAV 的图像。
https://arxiv.org/abs/2405.07520
In recent years, deep learning based on Convolutional Neural Networks (CNNs) has achieved remarkable success in many applications. However, their heavy reliance on extensive labeled data and limited generalization ability to unseen classes pose challenges to their suitability for medical image processing tasks. Few-shot learning, which utilizes a small amount of labeled data to generalize to unseen classes, has emerged as a critical research area, attracting substantial attention. Currently, most studies employ a prototype-based approach, in which prototypical networks are used to construct prototypes from the support set, guiding the processing of the query set to obtain the final results. While effective, this approach heavily relies on the support set while neglecting the query set, resulting in notable disparities within the model classes. To mitigate this drawback, we propose a novel Support-Query Prototype Fusion Network (SQPFNet). SQPFNet initially generates several support prototypes for the foreground areas of the support images, thus producing a coarse segmentation mask. Subsequently, a query prototype is constructed based on the coarse segmentation mask, additionally exploiting pattern information in the query set. Thus, SQPFNet constructs high-quality support-query fused prototypes, upon which the query image is segmented to obtain the final refined query mask. Evaluation results on two public datasets, SABS and CMR, show that SQPFNet achieves state-of-the-art performance.
近年来,基于卷积神经网络(CNNs)的深度学习在许多应用领域取得了显著的成功。然而,它们对大量标记数据的高度依赖和对于未见过的类别的有限泛化能力,使得它们在医学图像处理任务上并不适用。少量样本学习,利用少量的标记数据推广到未见过的类别,成为一个关键的研究领域,吸引了大量关注。目前,大多数研究采用基于原型的方法,其中原型网络用于从支持集构建原型,指导查询集的加工以获得最终结果。虽然有效,但这种方法在支持集上过于依赖,而忽略了查询集,导致模型类之间的差异显著。为了减轻这一缺点,我们提出了一个新的支持-查询原型融合网络(SQPFNet)。 SQPFNet首先为支持图像的前景区域生成几个支持原型,从而产生粗分割掩码。接着,基于粗分割掩码构建查询原型,并利用查询集中的模式信息。因此,SQPFNet构建了高质量的支持-查询融合原型,在查询图像上进行分割,以获得最终精化的查询掩码。在两个公开数据集SABS和CMR上的评估结果表明,SQPFNet实现了最先进的性能。
https://arxiv.org/abs/2405.07516
Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated with a simple natural-language explanation. We test whether models that have learned to classify inputs competently (both in- and out-of-distribution) are able to articulate freeform natural language explanations that match their classification behavior. Our dataset can be used for both in-context and finetuning evaluations. We evaluate a range of LLMs, demonstrating that articulation accuracy varies considerably between models, with a particularly sharp increase from GPT-3 to GPT-4. We then investigate whether we can improve GPT-3's articulation accuracy through a range of methods. GPT-3 completely fails to articulate 7/10 rules in our test, even after additional finetuning on correct explanations. We release our dataset, ArticulateRules, which can be used to test self-explanation for LLMs trained either in-context or by finetuning.
大语言模型(LLMs)在许多任务上表现出色,但解释其取得这一成就背后的过程是一个挑战。本文研究了LLMs是否能够给出其自己内部过程的高层次忠实解释。为了探讨这一点,我们引入了一个数据集ArticulateRules,它是通过简单的规则生成的少量 shot文本分类任务的数据集。每个规则都与一个简单的自然语言解释相关联。我们测试了具有良好分类输入( both in- and out-of-distribution)的模型是否能够阐述自由文本解释,这些解释与其分类行为相匹配。我们的数据集既可以用于当前上下文评估,也可以用于微调评估。我们评估了一系列LLM,证明了 articulation准确率在模型之间存在很大差异,尤其是从GPT-3到GPT-4的准确率增长变得非常锐利。接下来,我们研究了是否可以通过各种方法提高GPT-3的articulation准确率。GPT-3在我们的测试中完全无法阐述7/10条规则,即使在经过额外的正确解释的微调后也是如此。我们发布了我们的数据集ArticulateRules,该数据集可用于测试为in-context或通过微调训练的LLM。
https://arxiv.org/abs/2405.07436
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at this https URL
我们提出了MedConceptsQA,一个专用的开源医疗概念问题回答基准。基准包括来自不同词汇表的医疗概念问题的各种类型:诊断、程序和药物。问题分为三个难度级别:容易、中难和难。我们使用各种大型语言模型对基准进行评估。我们的研究结果表明,尽管预先训练的临床大型语言模型在医疗数据上进行预训练,但它们在基准上的准确性接近于随机猜测。然而,GPT-4在临床大型语言模型上实现了几乎27%-37%(27%的零散学习百分比和37%的少数学习百分比)的绝对平均改善。我们的基准为评估大型语言模型对医疗概念的理解和推理提供了有价值的资源。基准可以在https:// this URL
https://arxiv.org/abs/2405.07348
Generating images from text has become easier because of the scaling of diffusion models and advancements in the field of vision and language. These models are trained using vast amounts of data from the Internet. Hence, they often contain undesirable content such as copyrighted material. As it is challenging to remove such data and retrain the models, methods for erasing specific concepts from pre-trained models have been investigated. We propose a novel concept-erasure method that updates the text encoder using few-shot unlearning in which a few real images are used. The discussion regarding the generated images after erasing a concept has been lacking. While there are methods for specifying the transition destination for concepts, the validity of the specified concepts is unclear. Our method implicitly achieves this by transitioning to the latent concepts inherent in the model or the images. Our method can erase a concept within 10 s, making concept erasure more accessible than ever before. Implicitly transitioning to related concepts leads to more natural concept erasure. We applied the proposed method to various concepts and confirmed that concept erasure can be achieved tens to hundreds of times faster than with current methods. By varying the parameters to be updated, we obtained results suggesting that, like previous research, knowledge is primarily accumulated in the feed-forward networks of the text encoder.
由于扩散模型的扩展和计算机视觉与自然语言领域的进步,从文本中生成图像变得更加容易。这些模型使用从互联网上大量数据进行训练。因此,它们通常包含不想包含的具有版权的内容。由于很难删除这种数据并重新训练模型,因此研究了从预训练模型中消除特定概念的方法。我们提出了一个新颖的概念消除方法,该方法使用少样本无监督学习更新文本编码器,其中使用几张真实图片。关于消除概念后生成图像的讨论是缺乏的。虽然有一些方法可以指定概念的转移目标,但是指定概念的有效性仍然不确定。我们的方法通过将模型或图像的潜在概念间接转换来实现这一目标。我们的方法可以在10秒内消除一个概念,使概念消除比以往任何时候都更加容易。通过调整需要更新的参数,我们得到了结果,表明与以前的研究类似,文本编码器的递归网络中知识的主要积累是主要的。
https://arxiv.org/abs/2405.07288
Semitic morphologically-rich languages (MRLs) are characterized by extreme word ambiguity. Because most vowels are omitted in standard texts, many of the words are homographs with multiple possible analyses, each with a different pronunciation and different morphosyntactic properties. This ambiguity goes beyond word-sense disambiguation (WSD), and may include token segmentation into multiple word units. Previous research on MRLs claimed that standardly trained pre-trained language models (PLMs) based on word-pieces may not sufficiently capture the internal structure of such tokens in order to distinguish between these analyses. Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated and analyzed using PLMs. We evaluate all existing models for contextualized Hebrew embeddings on a novel Hebrew homograph challenge sets that we deliver. Our empirical results demonstrate that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings; and that they are most effective for disambiguating segmentation and morphosyntactic features, less so regarding pure word-sense disambiguation. We show that these embeddings are more effective when the number of word-piece splits is limited, and they are more effective for 2-way and 3-way ambiguities than for 4-way ambiguity. We show that the embeddings are equally effective for homographs of both balanced and skewed distributions, whether calculated as masked or unmasked tokens. Finally, we show that these embeddings are as effective for homograph disambiguation with extensive supervised training as with a few-shot setup.
语素丰富的语言(SMLs)的特点是极端的词义模糊。 因为标准文本中大多数元音都被省略了,所以许多单词是多义词,每个义项都有不同的发音和形态学特征。这种模糊性超出了词义歧义(WSD)的范围,甚至可能包括将词切分为多个词单位的情况。以前关于SML的研究声称,基于词块的预训练语言模型(PLMs)可能不足以捕捉这类词的内部结构,以区分这些分析。以希伯来语为例,我们研究了使用PLMs对希伯来语同义词进行歧义和分析的程度。我们在为新创希伯来语同义词挑战集上评估所有现有的模型。我们的实验结果表明,当代希伯来语预处理嵌入效果优于非预处理嵌入;而且它们最有效地用于区分词义和形态特征,对于纯词义歧义的效果相对较低。我们发现,当词块分割数量有限时,这些嵌入效果更有效;而且它们对于二义性和三元词性的效果比四元词性效果更佳。我们发现,这些嵌入对于平衡和偏斜分布的词相同效果,无论是作为遮罩词还是未遮罩词。最后,我们发现这些嵌入对于通过广泛监督训练进行词相同歧义的效果与少数次试验设置的效果相同。
https://arxiv.org/abs/2405.07099
Garment manipulation (e.g., unfolding, folding and hanging clothes) is essential for future robots to accomplish home-assistant tasks, while highly challenging due to the diversity of garment configurations, geometries and deformations. Although able to manipulate similar shaped garments in a certain task, previous works mostly have to design different policies for different tasks, could not generalize to garments with diverse geometries, and often rely heavily on human-annotated data. In this paper, we leverage the property that, garments in a certain category have similar structures, and then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations in the self-supervised manner. The topological correspondence can be easily adapted to the functional correspondence to guide the manipulation policies for various downstream tasks, within only one or few-shot demonstrations. Experiments over garments in 3 different categories on 3 representative tasks in diverse scenarios, using one or two arms, taking one or more steps, inputting flat or messy garments, demonstrate the effectiveness of our proposed method. Project page: this https URL.
衣物的操作(例如,展开、折叠和挂起衣物)对于未来机器人完成家庭助手任务至关重要,但由于衣物配置、几何形状和变形程度的多样性,实现这一目标具有极大挑战性。尽管在某些任务中,它们能够操纵具有相似形状的衣物,但以前的工作主要需要为不同任务设计不同的策略,不能推广到具有不同几何形状的衣物,并且通常依赖于人类标注的数据。在本文中,我们利用衣物属于同一类别的衣物具有相似结构的性质,然后以自监督的方式在同一类别级别学习衣物之间的拓扑密集(点级别)视觉对应关系。这种拓扑对应关系可以轻松地适应功能的对应关系来指导各种下游任务的操作策略,仅需几轮演示。在多样场景下的3种不同类别的衣物上进行实验,使用1或2个手臂,进行1或2步操作,输入平滑或凌乱的衣物,证明了我们所提出方法的有效性。项目页面:此链接。
https://arxiv.org/abs/2405.06903
In today's digital landscape, where cyber attacks have become the norm, the detection of cyber attacks and threats is critically imperative across diverse domains. Our research presents a new empirical framework for cyber threat modeling, adept at parsing and categorizing cyber-related information from news articles, enhancing real-time vigilance for market stakeholders. At the core of this framework is a fine-tuned BERT model, which we call CANAL - Cyber Activity News Alerting Language Model, tailored for cyber categorization using a novel silver labeling approach powered by Random Forest. We benchmark CANAL against larger, costlier LLMs, including GPT-4, LLaMA, and Zephyr, highlighting their zero to few-shot learning in cyber news classification. CANAL demonstrates superior performance by outperforming all other LLM counterparts in both accuracy and cost-effectiveness. Furthermore, we introduce the Cyber Signal Discovery module, a strategic component designed to efficiently detect emerging cyber signals from news articles. Collectively, CANAL and Cyber Signal Discovery module equip our framework to provide a robust and cost-effective solution for businesses that require agile responses to cyber intelligence.
在当今数字 landscape中,网络攻击已成为常态,跨多个领域的检测网络攻击和威胁至关重要。我们的研究提出了一种新的实证框架,用于网络威胁建模,善于解析和分类与网络相关的信息,增强市场参与者的实时警惕。这个框架的核心是一个经过微调的BERT模型,我们称之为CANAL - 网络活动新闻警报语言模型,采用了一种新的银标注方法,利用随机森林进行网络安全分类。我们对比CANAL与其他大型、昂贵的LLM,包括GPT-4、LLLaMA和Zephyr,突显了它们在网络新闻分类中的零到几 shot学习。通过超越所有其他LLM备选,CANAL在准确性和性价比方面都表现卓越。此外,我们还引入了Cyber Signal Discovery模块,这是一个 strategic 组件,旨在有效地从新闻文章中检测新兴的网络安全信号。总之,CANAL 和 Cyber Signal Discovery 模块使我们的框架能够为需要对网络情报作出敏捷反应的企业提供稳健且经济有效的解决方案。
https://arxiv.org/abs/2405.06772
Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias. Nevertheless, an ongoing controversy exists over the extent to which LLMs can grasp structured semantics. To assess this, we propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured semantics. In our assessment, we employ the prompting approach, which leads to the creation of our few-shot SRL parser, called PromptSRL. PromptSRL enables LLMs to map natural languages to explicit semantic structures, which provides an interpretable window into the properties of LLMs. We find interesting potential: LLMs can indeed capture semantic structures, and scaling-up doesn't always mirror potential. Additionally, limitations of LLMs are observed in C-arguments, etc. Lastly, we are surprised to discover that significant overlap in the errors is made by both LLMs and untrained humans, accounting for almost 30% of all errors.
大语言模型(LLMs)在捕捉结构化语义以提高语言理解、改善可解释性和减少偏见方面发挥了关键作用。然而,目前存在一个持续的争议,即LLMs是否能够把握结构化语义。为了评估这一点,我们提出使用语义角色标注(SRL)作为基本任务,以探讨LLMs提取结构化语义的能力。在我们的评估中,我们采用提示方法,这是我们用于构建几 shot SRL 解析器的称为提示SRL。提示SRL 使LLMs能够将自然语言映射到明确的语义结构中,为LLMs提供了对LLMs性质的直观认识。我们发现有趣的应用潜力:LLMs确实可以捕捉语义结构,而且提高并不总是与潜力成正比。此外,在C-论据等地方,LLM的局限性也被观察到。最后,我们惊讶地发现,LLMs和未经训练的人类之间的错误重叠几乎达到了30%以上,这一发现解释了几乎所有错误。
https://arxiv.org/abs/2405.06410
The integration of Large Language Models (LLMs) into healthcare diagnostics offers a promising avenue for clinical decision-making. This study outlines the development of a novel method for zero-shot/few-shot in-context learning (ICL) by integrating medical domain knowledge using a multi-layered structured prompt. We also explore the efficacy of two communication styles between the user and LLMs: the Numerical Conversational (NC) style, which processes data incrementally, and the Natural Language Single-Turn (NL-ST) style, which employs long narrative prompts. Our study systematically evaluates the diagnostic accuracy and risk factors, including gender bias and false negative rates, using a dataset of 920 patient records in various few-shot scenarios. Results indicate that traditional clinical machine learning (ML) models generally outperform LLMs in zero-shot and few-shot settings. However, the performance gap narrows significantly when employing few-shot examples alongside effective explainable AI (XAI) methods as sources of domain knowledge. Moreover, with sufficient time and an increased number of examples, the conversational style (NC) nearly matches the performance of ML models. Most notably, LLMs demonstrate comparable or superior cost-sensitive accuracy relative to ML models. This research confirms that, with appropriate domain knowledge and tailored communication strategies, LLMs can significantly enhance diagnostic processes. The findings highlight the importance of optimizing the number of training examples and communication styles to improve accuracy and reduce biases in LLM applications.
将大型语言模型(LLMs)融入医疗诊断领域为临床决策提供了一个有前景的途径。本研究概述了一种通过多层次结构提示整合医学领域知识的全新方法,用于实现零散/少散学习(ICL)。我们还探讨了用户与LLMs之间的两种沟通风格:数值聊天(NC)风格,该风格逐级处理数据;自然语言单轮对话(NL-ST)风格,该风格使用长篇故事提示。我们的研究系统地评估了在各种少散场景下的诊断准确性和风险因素,包括性别偏见和假阴性的发生率。结果显示,传统临床机器学习(ML)模型在零散和少散场景通常优于LLMs。然而,当采用少散示例与有效可解释人工智能(XAI)作为领域知识的来源时,性能差距显著缩小。此外,如果时间和示例数量足够,对话风格(NC)几乎可以与ML模型的性能相匹敌。最值得注意的是,LLMs表现出与ML模型相当或更好的成本敏感准确性。这项研究证实,只要有适当的领域知识和定制化的沟通策略,LLMs可以显著增强诊断过程。研究结果强调了优化训练示例和沟通策略以提高LLM应用的准确性和减少偏见的重要性。
https://arxiv.org/abs/2405.06270
Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks.
成瘾性物质使用障碍(SUDs)是全球一个不断增长的问题,这需要通过数据驱动的研究来加强对其问题和趋势的理解。社交媒体是独特的并且重要的关于SUD的信息来源,特别是因为这些来源的数据通常是由有生活经历的人生成的。在本文中,我们介绍了Reddit-Impacts,一个由致力于讨论处方的镇痛药和成瘾性物质使用药物的Reddit子社区创建的具有挑战性的命名实体识别(NER)数据集,以及用于成瘾性物质使用障碍的药物。该数据集特别关注物质使用-其临床和社会影响方面。我们通过公开可用的Reddit应用程序接口收集来自所选子社区的数据。我们手动注释了包括报告个人非医疗使用物质(如镇痛药、兴奋剂和苯二氮䓬类)的临床和社会影响文本范围。我们的目标是创建一个资源,可以促进开发可以自动检测文本社交媒体数据中的临床和社会影响的系统。开发此类系统的成功可能有助于我们更好地了解非医疗物质使用对个人健康和社会动态的影响,从而促进公共卫生策略的有效发展。除了创建注释数据集外,我们还应用了几种机器学习模型来建立基线性能。具体来说,我们尝试了BERT、RoBERTa和DANN等Transformer模型,并利用完整的训练数据集,以及通过一次学习利用GPT-3.5进行自动实体识别(NER)的模型。数据集已通过2024年的SMM4H共享任务发布。
https://arxiv.org/abs/2405.06145
Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a priori known tasks are combined into a Mixture-of-Experts model with an attention mechanism across a mix of frozen and unfrozen experts. The model learns when to attend to frozen task-specific experts when appropriate and learns new experts to handle novel situations. We work in an open-ended text-based environment in which the agent is tasked with behaving like different types of character roles and must rapidly learn behaviors associated with new character role types. We show that our agent both obtains more rewards in the zero-shot setting, and discovers these rewards with greater sample efficiency in the few-shot learning settings.
开放性世界是指没有预先指定目标或环境奖励信号的世界。因此,智能体必须知道如何执行许多任务。然而,当一个新的任务呈现给智能体时,我们希望它能够将其之前任务中学到的知识的一部分用于快速学习新任务。我们引入了一种新的技术,即将不同先验知识任务的政策合并成一个Mixture-of-Experts模型,该模型具有跨冻结和未冻结专家的注意机制。模型学习何时关注冻结任务特定专家并在适当的时候学习新的专家来处理新颖情况。我们工作在一个基于文本的开放性环境中,其中智能体被要求表现出不同角色类型的行为,并必须迅速学习与新角色类型相关的行为。我们证明了我们的智能体在零散设置中获得更多的奖励,并且在几散学习设置中,发现这些奖励的样本效率更高。
https://arxiv.org/abs/2405.06059
Many tasks related to Computational Social Science and Web Content Analysis involve classifying pieces of text based on the claims they contain. State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce. In light of this, we propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. This methodology involves defining the classes as arbitrarily sophisticated taxonomies of claims, and using Natural Language Inference models to obtain the textual entailment between these and a corpus of interest. The performance of these models is then boosted by annotating a minimal sample of data points, dynamically sampled using the well-established statistical heuristic of Probabilistic Bisection. We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection. This approach rivals traditional pre-train/fine-tune approaches while drastically reducing the need for data annotation.
许多与计算社会科学和网页内容分析相关的任务都涉及根据文章中包含的内容对文本进行分类。最先进的方法通常需要在大型注释数据集上对模型进行微调,这些数据集代价昂贵。鉴于这一点,我们提出了一个质量和多功能的少样本学习方法,作为任何基于言论的文本分类任务的共同范式。这种方法包括将类别定义为任意复杂度的言论分类,并使用自然语言推理模型在这些他和感兴趣的语料库之间获得文本蕴含关系。通过标记极少量的数据点,并使用经过良好检验的统计启发式Probabilistic Bisection进行动态采样,从而提高模型的性能。我们在气候变暖反对者检测、主题/立场分类和抑郁相关症状检测等三个任务中展示了这种方法。与传统的预训练/微调方法相比,这种方法在降低数据注释需求的同时,与传统方法竞争。
https://arxiv.org/abs/2405.05705