This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
这篇论文对通过一系列实验对可解释性事实核查进行全面分析,重点关注大型语言模型验证公共卫生主张及其真实性评估的能力。我们研究了零/少抽样提示和参数高效的微调在各种开源和闭源模型上的有效性,并探讨了其在真实性预测和解释生成的单独和联合任务上的表现。重要的是,我们采用了一种包括先前确定的自动指标和通过人类评价的新颖标准的双重评估方法。自动评估表明,在零抽样场景中,GPT-4脱颖而出,但在少抽样和参数高效的微调环境中,开源模型表明其不仅弥合了性能差距,而且在某些情况下甚至超过了GPT-4。人类评估揭示了更多的细节,并表明了黄金解释存在的一些潜在问题。
https://arxiv.org/abs/2405.09454
Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
尽管大型语言模型和它们在各种任务上的高零样本触发性能最近非常普遍,但还不清楚它们在处理可能具有地道性需求的任务上的表现。特别是,这些模型与专门为地道性任务微调的编码器模型相比,表现如何?在这项工作中,我们试图回答这个问题,通过观察一系列LLM(包括本地和软件服务模型)在三个地道性数据集上的表现:SemEval 2022 Task 2a,FLUTE和MAGPIE。总体而言,我们发现,尽管这些模型确实具有竞争力的性能,但它们并不匹配针对特定任务进行微调的模型,即使在最大的规模上(例如,对于GPT-4)。然而,我们确实看到随着模型规模的一致性能改进。此外,我们研究了提示方法以提高性能,并讨论了为这些任务使用LLM的实用性。
https://arxiv.org/abs/2405.09279
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Autonomous systems often encounter environments and scenarios beyond the scope of their training data, which underscores a critical challenge: the need to generalize and adapt to unseen scenarios in real time. This challenge necessitates new mathematical and algorithmic tools that enable adaptation and zero-shot transfer. To this end, we leverage the theory of function encoders, which enables zero-shot transfer by combining the flexibility of neural networks with the mathematical principles of Hilbert spaces. Using this theory, we first present a method for learning a space of dynamics spanned by a set of neural ODE basis functions. After training, the proposed approach can rapidly identify dynamics in the learned space using an efficient inner product calculation. Critically, this calculation requires no gradient calculations or retraining during the online phase. This method enables zero-shot transfer for autonomous systems at runtime and opens the door for a new class of adaptable control algorithms. We demonstrate state-of-the-art system modeling accuracy for two MuJoCo robot environments and show that the learned models can be used for more efficient MPC control of a quadrotor.
自主系统通常会面临其训练数据范围之外的环境和场景,这凸显了一个关键挑战:需要在实时情况下对未见过的场景进行泛化和适应。这个挑战需要新的数学和算法工具来实现适应和零样本转移。为此,我们利用函数编码器的理论,该理论通过结合神经网络的灵活性和Hilbert空间数学原理来实现零样本转移。使用这个理论,我们首先提出了一种学习由一组神经ODE基础函数组成的动态空间的方法。训练后,所提出的方法可以迅速地在学习到的空间中识别出动态。关键的是,这个计算在在线阶段不需要梯度计算或重新训练。这种方法使得自主系统在实时情况下实现零样本转移,并为新的自适应控制算法打开了大门。我们用两个MuJoCo机器人环境证明了最先进的系统建模精度,并表明所学习到的模型可以用于更有效的MPC控制四旋翼。
https://arxiv.org/abs/2405.08954
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911
Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.
浏览新闻文章的复杂多变的景观,需要理解各种参与者的身份,这些参与者从决策者到反对派人物、公民等,在塑造新闻故事中发挥着关键作用。了解他们的角色、政治观点、社会地位等,对于深入理解新闻内容至关重要。尽管现有的工作集中于通过社交媒体数据突出实体、覆盖差异和政治立场,但自动检测新闻内容中的参与者角色仍然是一个未被探索的领域。在本文中,我们通过引入一种有效的分类新闻文章中参与者类型的方法,跨越了这个领域的空白。我们的方法将参与者分类问题转化为自然语言推理任务,利用新闻文章的上下文信息和外部知识来提高参与者类型检测的准确性。此外,我们所提出的模型在零散设置中表现出优异效果,进一步拓展了其在各种新闻环境中的应用。
https://arxiv.org/abs/2405.08751
Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.
在视频中的多标签动作识别是一个对机器人动态环境应用的显著挑战,尤其是在机器人需要与人类在涉及对象的任務中進行合作时。现有的方法仍然很难识别未见到的动作,或者需要大量的训练数据。为了克服这些问题,我们提出了Dual-VCLIP,一种用于零散标签多标签动作识别的统一方法。Dual-VCLIP通过DualCoOp方法增强了VCLIP,一种用于零散标签图像分类的零散动作识别方法。我们方法的优势在于,在训练时它只学习两个提示,因此它比其他方法要简单得多。我们在包含大量物体为基础的动作的Charades数据集上验证我们的方法,证明了--尽管其简单性--我们的方法在完整数据集上与现有方法的表现相当,而在测试未见到的动作时具有 promising 的表现。我们的贡献强调了在机器人训练过程中动词-物体类别的分割对新型合作任务的影响,突出了对表现和减轻偏见的影响。
https://arxiv.org/abs/2405.08695
This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages in several settings: zero-shot GEC, fine-tuning for GEC, and using GPT-3.5 to re-rank correction hypotheses generated by other GEC models. In the zero-shot setting, we conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods: estimating grammaticality with language models (LMs), the Scribendi test, and comparing the semantic embeddings of sentences. GPT-3.5 has a known tendency to over-correct erroneous sentences and propose alternative corrections. For several languages, such as Czech, German, Russian, Spanish, and Ukrainian, GPT-3.5 substantially alters the source sentences, including their semantics, which presents significant challenges for evaluation with reference-based metrics. For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics. However, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types, including punctuation mistakes, tense errors, syntactic dependencies between words, and lexical compatibility at the sentence level.
本文研究了GPT-3.5在多种语言的多个设置中应用Grammatical Error Correction(GEC)的应用:零击GEC、微调GEC和使用GPT-3.5重新排名由其他GEC模型生成的修正假设。在零击设置中,我们使用几种方法对GPT-3.5提出的修正进行自动评估:使用语言模型(LMs)估计语义性,使用Scribendi测试和比较句子语义嵌入。GPT-3.5有一个已知倾向,即过度纠正错误的句子,并提出了替代的修正。对于几种语言,如捷克语、德语、俄语、西班牙语和乌克兰语,GPT-3.5在很大程度上改变了源句子,包括其语义,这使得使用参考基评估指标进行评估具有重大挑战。对于英语,GPT-3.5具有高召回,生成流畅的修正,并通常保留句子语义。然而,对于英语和俄语的人类评估表明,尽管其强大的错误检测能力,GPT-3.5在几种错误类型上仍然遇到困难,包括标点错误、时态错误、单词之间的语义关系和句子级别上的词汇兼容性。
https://arxiv.org/abs/2405.08469
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
大语言模型(LLMs)在执行需要自然语言指令的语义理解任务方面表现出了惊人的能力。最近,许多工作进一步扩展了这种能力,以感知多模态音频和文本输入,但它们的性能往往局限于特定的微调任务,如自动语音识别和翻译。因此,我们开发了SpeechVerse,一种 robust 的多任务训练和课程学习框架,通过一些可学习参数将预训练的语音和文本基础模型结合在一起,而在训练过程中冻结预训练模型。通过连续从语音基础模型中提取的潜在表示进行指令微调,模型能够在多样性的语音处理任务上实现最佳零散shot性能,并使用自然语言指令进行指导。我们对多个数据集和任务进行了广泛的基准测试,并进一步测试了模型在跨域数据集和未见过的任务上的能力。我们的实证实验证实,我们的多任务SpeechVerse模型甚至优于传统任务特定基线,在9个任务上。
https://arxiv.org/abs/2405.08295
Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: this https URL.
现有的文本到图像模型很难跟上复杂的文本提示,因此需要额外的 grounded 输入以实现更好的可控制性。在这项工作中,我们提出了一种将场景分解为视觉基本单元的方法,称之为密集模糊表示,这些基本单元包含了场景的细粒度细节,同时具有模块化、人类可解释性和易于构建的特点。基于模糊表示,我们开发了一种基于模糊的文本到图像扩散模型,称为 BlobGEN,用于级联生成。特别地,我们引入了一个新的遮罩跨注意模块,用于区分模糊表示和视觉特征之间的融合。为了利用大型语言模型的可组合性(LLMs),我们引入了一种新的在上下文中学习从文本提示生成模糊表示的方法。我们的广泛实验结果表明,BlobGEN 在 MS-COCO 上实现了卓越的零散生成质量和更好的布局引导可控制性。当用LLM进行增强时,我们的方法在级联图像生成基准上展示了卓越的数值和空间正确性。项目页面:https:// this URL。
https://arxiv.org/abs/2405.08246
As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.
随着训练数据越来越来自于无结构、无控制的环境(如Web),研究人员和实践者越来越依赖数据筛选技术来“滤除网络爬取数据的噪音”。尽管数据集已被广泛证明反映了其创建者的偏见和价值观,但在这篇论文中,我们为评估创建这些数据集所使用的过滤器的研究新进展做出了贡献。我们证明了图像文本数据过滤也存在偏见和价值,并编码了“高质量”数据的特定概念。在我们的工作中,我们通过对学术基准DataComp的CommonPool进行图像文本CLIP过滤的标准方法进行分析,研究了通过各种标注技术在多个图像、文本和网站来源之间过滤的差异。我们发现,与几个假设的人口群体相关的数据——例如LGBTQ+人员、老年女性和年轻男性——被排除的比例较高。此外,我们证明了排除放大案例:不仅是某些边缘群体在未过滤的数据中已经代表性不足,而且CLIP过滤器在这些群体的数据上排除的比例更高。因此,机器学习管道中的数据过滤步骤可能加剧数据收集阶段已经存在的代表性差异,特别是在现有的过滤器被设计为优化特定的下游性能指标(如零散图像分类准确性)时。最后,我们发现NSFW过滤器无法从CommonPool中移除性暗示内容,而CLIP过滤器包括多个类别的受版权保护的内容。我们的结论指出,数据创建和过滤实践需要进行根本性的改变。
https://arxiv.org/abs/2405.08209
Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times and absolute improvements of 11.66% to 44.64% on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information in the traditional TableQA manner. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community. We also provide a human evaluation of error cases to analyze further the aspects in which the model can be improved. Project page: this https URL.
由于表格的简洁和结构化特点,其中包含的知识可能不完整或缺失,这对表格问题回答(TableQA)和数据分析系统构成了重大挑战。现有数据集中,要么未能解决表格外部知识的这个问题,要么仅将结构化文本作为表格的补充信息。在本文中,我们将知识库(KB)作为表格问题回答的外部知识来源,并构建了一个细粒度 gold 证据注释的 dataset KET-QA。数据集中的每个表对应于整个知识库的子图,每个问题都需要从表格和子图整合信息来回答。为了从庞大的知识子图中提取相关信息并应用于表格问题回答,我们设计了一个retriever-reasoner结构化管道模型。实验结果表明,我们的模型在三个不同的设置(微调、零散和少散)上实现了引人注目的相对性能改进,从1.9到6.5倍,以及绝对性能改进11.66%到44.64%。与仅依赖表格信息的传统 TableQA 方法相比,即使在最好的模型上,EM 分数也落后了人类水平。这揭示了对于问答社区来说,KET-QA 的具有挑战性的本质。我们还提供了人类评估错误案例以进一步分析模型可以改进的方面。项目页面:https:// this URL。
https://arxiv.org/abs/2405.08099
Zero-shot anomaly segmentation using pre-trained foundation models is a promising approach that enables effective algorithms without expensive, domain-specific training or fine-tuning. Ensuring that these methods work across various environmental conditions and are robust to distribution shifts is an open problem. We investigate the performance of WinCLIP [14] zero-shot anomaly segmentation algorithm by perturbing test data using three semantic transformations: bounded angular rotations, bounded saturation shifts, and hue shifts. We empirically measure a lower performance bound by aggregating across per-sample worst-case perturbations and find that average performance drops by up to 20% in area under the ROC curve and 40% in area under the per-region overlap curve. We find that performance is consistently lowered on three CLIP backbones, regardless of model architecture or learning objective, demonstrating a need for careful performance evaluation.
使用预训练的基础模型进行零样本异常分割是一种有前途的方法,无需进行昂贵且针对特定领域的训练或微调。确保这些方法在各种环境下都有效且对分布漂移具有鲁棒性是一个开放性问题。我们通过扰动测试数据使用三种语义变换:有界角旋转、有界饱和平移和色度平移。我们通过聚合每个样本的最坏情况扰动来经验性地测量性能最低点,发现在ROC曲线下的面积降低了20%,而在每个区域重叠曲线下的面积降低了40%。我们发现在三个CLIP骨干网络上,性能始终会降低,无论模型架构还是学习目标如何,这表明需要谨慎地进行性能评估。
https://arxiv.org/abs/2405.07969
Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to consider detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches
文本到图像生成模型已经成为一个突出且强大的工具,在生成高分辨率的高质量图像方面表现出色。然而,将这些模型的生成过程引导为考虑反映风格和/或结构信息的详细条件仍然是一个开放问题。在本文中,我们提出了LoRAdapter,一种方法,将样式和结构条件统一在一个公式中,使用一种新颖的条件下LoRA模块实现零样本控制。LoRAdapter是一种高效、强大且对架构无关的条件的文本到图像扩散模型,在生成过程中实现细粒度控制,并超越了最近的最先进方法。
https://arxiv.org/abs/2405.07913
Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.
语言模型(LMs)将自己限制在其词典中,该词典将原始文本映射为词汇项(词)的序列。这限制了其灵活性:例如,主要针对英语的LM在其他国家自然和编程语言上表现良好,但由于其英语中心词典,效率大幅降低。为了减轻这种限制,我们应该能够随时在动态过程中交换原始LM词典,而不会降低性能。因此,在本文中我们定义了一个新问题:零样本词典迁移(ZeTT)。 ZeTT的核心挑战是找到新词典中词的嵌入。由于为初始化嵌入应用的先前启发式方法在ZeTT设置中表现具有偶然性水平,我们提出了一种新方法:我们通过训练一个超网络,接受词典作为输入并预测相应嵌入。我们通过实验实证证明,超网络既可以在编码器(如XLM-R)中泛化,也可以在解码器LLM(如Mistral-7B)中泛化。 我们的方法在跨语言和编码任务上与原始模型相当,同时显著减少了词标本的尺寸。我们还发现,通过在不到1B个词上继续训练,可以迅速关闭剩余的差距。最后,我们证明了为基(L)LM训练的ZeTT超网络也可以应用于细粒度调整版本,而无需额外训练。总体而言,我们的结果在使LM从词典中独立出来方面取得了显著的进步。
https://arxiv.org/abs/2405.07883
This paper undertakes an empirical study to revisit the latest advancements in Multimodal Large Language Models (MLLMs): Video Assistant. This study, namely FreeVA, aims to extend existing image-based MLLM to the video domain in a training-free manner. The study provides an essential, yet must-know baseline, and reveals several surprising findings: 1) FreeVA, leveraging only offline image-based MLLM without additional training, excels in zero-shot video question-answering (e.g., MSVD-QA, ActivityNet-QA, and MSRVTT-QA), even surpassing state-of-the-art methods that involve video instruction tuning. 2) While mainstream video-based MLLMs typically initialize with an image-based MLLM (e.g., LLaVA) and then fine-tune using video instruction tuning, the study indicates that utilizing the widely adopted VideoInstruct-100K for video instruction tuning doesn't actually lead to better performance compared to not training at all. 3) The commonly used evaluation metrics in existing works are significantly influenced by changes in the GPT API version over time. If ignored, this could affect the fairness and uniformity of comparisons between different methods and impact the analysis and judgment of researchers in the field. The advancement of MLLMs is currently thriving, drawing numerous researchers into the field. We aim for this work to serve as a plug-and-play, simple yet effective baseline, encouraging the direct evaluation of existing MLLMs in video domain while also standardizing the field of video conversational models to a certain extent. Also, we encourage researchers to reconsider: Have current video MLLM methods truly acquired knowledge beyond image MLLM? Code is available at this https URL
本论文进行了一项实证研究,旨在回顾在Multimodal Large Language Models(MLLMs)方面最新的进展:Video Assistant。这项研究,即FreeVA,旨在以无需额外训练的方式将现有的基于图像的MLLM扩展到视频领域。这项研究提供了一个基本的、必不可少的基线,并揭示了几项令人惊讶的发现:1)FreeVA,仅利用离线图像为基础的MLLM,在零散拍摄视频问题回答(例如,MSVD-QA,ActivityNet-QA和MSRVTT-QA)方面表现优异,甚至超过了涉及视频指令微调的先进方法。2)虽然主流视频MLLM通常从基于图像的MLLM(例如,LLaVA)开始,然后通过视频指令微调进行微调,但这项研究表明,使用广泛采用的VideoInstruct-100K进行视频指令微调实际上并不能带来更好的性能,与不进行训练相比。3)现有作品中共享的评估指标在很大程度上受到GPT API版本变化的影响。如果被忽略,这可能会影响不同方法的公平性和可比性,从而影响领域内研究人员的分析和判断。MLLM的发展目前正处于繁荣时期,吸引了大量研究人员进入该领域。我们希望这项工作能成为一种可插拔、简单而有效的基线,鼓励研究人员直接评估现有的MLLM在视频领域的性能,同时标准化该领域的视频会话模型。此外,我们鼓励研究人员重新考虑:当前的视频MLLM方法是否已经获得了超出图像MLLM的知识?代码在此处:https://this URL
https://arxiv.org/abs/2405.07798
It is an interesting question Can and How Large Language Models (LLMs) understand non-language network data, and help us detect unknown malicious flows. This paper takes Carpet Bombing as a case study and shows how to exploit LLMs' powerful capability in the networking area. Carpet Bombing is a new DDoS attack that has dramatically increased in recent years, significantly threatening network infrastructures. It targets multiple victim IPs within subnets, causing congestion on access links and disrupting network services for a vast number of users. Characterized by low-rates, multi-vectors, these attacks challenge traditional DDoS defenses. We propose DoLLM, a DDoS detection model utilizes open-source LLMs as backbone. By reorganizing non-contextual network flows into Flow-Sequences and projecting them into LLMs semantic space as token embeddings, DoLLM leverages LLMs' contextual understanding to extract flow representations in overall network context. The representations are used to improve the DDoS detection performance. We evaluate DoLLM with public datasets CIC-DDoS2019 and real NetFlow trace from Top-3 countrywide ISP. The tests have proven that DoLLM possesses strong detection capabilities. Its F1 score increased by up to 33.3% in zero-shot scenarios and by at least 20.6% in real ISP traces.
这是一个有趣的问题:CAN和How Large Language Models(LLMs)如何理解非语言网络数据,并帮助我们检测未知的恶意流量。本文以网络攻击Carpet Bombing为例,展示了在网络领域利用LLMs强大能力的方法。Carpet Bombing是一种近年来显著增加的新型DDoS攻击,对网络基础设施造成了严重威胁。它针对多播IP目标,在子网内攻击多个受害者IP,导致访问链路拥堵,并影响了成千上万的用户。这种攻击的特点是低带宽,多向量,挑战了传统的DDoS防御。我们提出了DoLLM,一种利用开源LLMs作为骨架的DDoS检测模型。通过将非上下文网络流量重新组织为流序列并将其投影到LLMs语义空间中的标记词嵌入,DoLLM利用LLMs的上下文理解来提取整个网络上下文的流量表示。这些表示用于提高DDoS检测性能。我们使用公共数据集CIC-DDoS2019和来自Top-3国家ISP的实时NetFlow迹对DoLLM进行了评估。测试证明了DoLLM具有很强的检测能力。在零击场景中的F1得分增加了33.3%,而在真实ISP迹中的F1得分至少增加了20.6%。
https://arxiv.org/abs/2405.07638
Linking (aligning) biomedical concepts across diverse data sources enables various integrative analyses, but it is challenging due to the discrepancies in concept naming conventions. Various strategies have been developed to overcome this challenge, such as those based on string-matching rules, manually crafted thesauri, and machine learning models. However, these methods are constrained by limited prior biomedical knowledge and can hardly generalize beyond the limited amounts of rules, thesauri, or training samples. Recently, large language models (LLMs) have exhibited impressive results in diverse biomedical NLP tasks due to their unprecedentedly rich prior knowledge and strong zero-shot prediction abilities. However, LLMs suffer from issues including high costs, limited context length, and unreliable predictions. In this research, we propose PromptLink, a novel biomedical concept linking framework that leverages LLMs. It first employs a biomedical-specialized pre-trained language model to generate candidate concepts that can fit in the LLM context windows. Then it utilizes an LLM to link concepts through two-stage prompts, where the first-stage prompt aims to elicit the biomedical prior knowledge from the LLM for the concept linking task and the second-stage prompt enforces the LLM to reflect on its own predictions to further enhance their reliability. Empirical results on the concept linking task between two EHR datasets and an external biomedical KG demonstrate the effectiveness of PromptLink. Furthermore, PromptLink is a generic framework without reliance on additional prior knowledge, context, or training data, making it well-suited for concept linking across various types of data sources. The source code is available at this https URL.
跨多个数据源链接生物医学概念,可以进行各种整合分析,但这是具有挑战性的,因为概念命名惯例的差异。已经开发了许多策略来克服这一挑战,例如基于字符匹配规则的方法、人工制作的同构词和机器学习模型。然而,这些方法都受到先验生物医学知识的限制,并且很难将它们扩展到有限规则、同构词或训练样本的数量。最近,大型语言模型(LLMs)在各种生物医学自然语言处理任务中表现出色,因为它们具有史无前例的丰富先验知识和完善零散预测能力。然而,LLMs也存在一些问题,包括高昂的成本、有限上下文长度和不可靠的预测。在这项研究中,我们提出了PromptLink,一种新颖的生物医学概念链接框架,用于利用LLM。它首先使用专门针对生物医学领域的预训练语言模型生成候选概念,使其适应LLM上下文窗口。然后,它利用LLM通过两阶段提示进行概念链接,第一阶段提示旨在从LLM中激发概念链接任务的生物医学先验知识,第二阶段提示则要求LLM考虑其自身的预测以进一步增强其可靠性。在两个电子病历数据集和一个外部生物医学知识库之间的概念链接任务之间的实证结果证明了PromptLink的有效性。此外,PromptLink是一个通用的框架,不依赖于其他先验知识、上下文或训练数据,因此非常适合各种数据源的概念链接。源代码可在此https URL中获取。
https://arxiv.org/abs/2405.07500
In-hand manipulation is an integral component of human dexterity. Our hands rely on tactile feedback for stable and reactive motions to ensure objects do not slip away unintentionally during manipulation. For a robot hand, this level of dexterity requires extracting and utilizing rich contact information for precise motor control. In this paper, we present AnyRotate, a system for gravity-invariant multi-axis in-hand object rotation using dense featured sim-to-real touch. We construct a continuous contact feature representation to provide tactile feedback for training a policy in simulation and introduce an approach to perform zero-shot policy transfer by training an observation model to bridge the sim-to-real gap. Our experiments highlight the benefit of detailed contact information when handling objects with varying properties. In the real world, we demonstrate successful sim-to-real transfer of the dense tactile policy, generalizing to a diverse range of objects for various rotation axes and hand directions and outperforming other forms of low-dimensional touch. Interestingly, despite not having explicit slip detection, rich multi-fingered tactile sensing can implicitly detect object movement within grasp and provide a reactive behavior that improves the robustness of the policy, highlighting the importance of information-rich tactile sensing for in-hand manipulation.
手中操作是人类灵巧的一个关键组成部分。我们的手依赖于触觉反馈来确保在操作过程中物体不会意外滑动。对于机器人手来说,这种程度的灵巧需要提取和利用丰富的接触信息来进行精确的电机控制。在本文中,我们提出了AnyRotate系统,用于通过密集特征的模拟-现实触摸实现重力无关的多轴手部物体旋转。我们构建了一个连续的接触特征表示来提供触觉反馈以训练模拟中的策略,并引入了一种通过训练观察模型来实现零 shots策略转移的方法。我们的实验突出了在处理具有不同特性的物体时详细接触信息的好处。在现实世界中,我们成功地将密集触觉策略从模拟转移到了现实,并将其扩展到各种旋转轴和手部方向,超过了其他低维度触觉的形式。有趣的是,尽管没有明确的滑动检测,但丰富的多指触觉感知可以隐含地检测到物体在握中的运动,并提供一个反应行为,从而提高了策略的稳健性,突出了信息丰富的触觉感知在手中操作中的重要性。
https://arxiv.org/abs/2405.07391
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at this https URL
我们提出了MedConceptsQA,一个专用的开源医疗概念问题回答基准。基准包括来自不同词汇表的医疗概念问题的各种类型:诊断、程序和药物。问题分为三个难度级别:容易、中难和难。我们使用各种大型语言模型对基准进行评估。我们的研究结果表明,尽管预先训练的临床大型语言模型在医疗数据上进行预训练,但它们在基准上的准确性接近于随机猜测。然而,GPT-4在临床大型语言模型上实现了几乎27%-37%(27%的零散学习百分比和37%的少数学习百分比)的绝对平均改善。我们的基准为评估大型语言模型对医疗概念的理解和推理提供了有价值的资源。基准可以在https:// this URL
https://arxiv.org/abs/2405.07348