Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at this https URL.
食管癌是全球最常见的癌症之一,在癌症相关死亡率中排名第六。准确的多模态癌症分期诊断方法可以帮助医生有效地个性化治疗方案。目前,基于CT的癌症诊断方法因全面检查患者状况而受到广泛关注。然而,基于多模态的方法可能会引入信息冗余,导致性能下降。此外,需要进一步探索多模态表示之间的高效且有效的交互作用,缺乏对多模态特征的预测关联的深入探讨。在这项工作中,我们基于CT图像的 multi-modal 异质图条件特征指导扩散模型研究淋巴结转移的诊断,并使用临床测量和放射学数据。为了探索多模态特征之间的复杂关系,我们构建了一个异质图。接下来,应用条件特征引导扩散方法消除信息冗余。此外,我们提出了一种遮罩关系表示学习策略,旨在揭示原发肿瘤和淋巴结图像表示的潜在预后相关性和优先级。各种实验结果证实了我们提出方法的的有效性。代码可在此处访问:https://www.xxx.com
https://arxiv.org/abs/2405.09539
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Can web-based image processing and visualization tools easily integrate into existing websites without significant time and effort? Our Boostlet.js library addresses this challenge by providing an open-source, JavaScript-based web framework to enable additional image processing functionalities. Boostlet examples include kernel filtering, image captioning, data visualization, segmentation, and web-optimized machine-learning models. To achieve this, Boostlet.js uses a browser bookmark to inject a user-friendly plugin selection tool called PowerBoost into any host website. Boostlet also provides on-site access to a standard API independent of any visualization framework for pixel data and scene manipulation. Web-based Boostlets provide a modular architecture and client-side processing capabilities to apply advanced image-processing techniques using consumer-level hardware. The code is open-source and available.
基于Web的图像处理和可视化工具能否轻松地集成到现有的网站中而不需要大量的时间和精力?我们的Boostlet.js库通过提供一个基于JavaScript的开源框架来解决这一挑战,从而使图像处理功能得以增强。Boostlet示例包括内核滤波、图像标题、数据可视化、分割和针对网络优化的机器学习模型。为了实现这一点,Boostlet.js使用浏览器书签在任意主机网站上注入一个用户友好的插件选择工具,名为PowerBoost。Boostlet还提供了一个标准的API,使其具有与任何可视化框架(针对像素数据和场景操作)独立的本地访问。基于Web的Boostlet提供了一种模块化架构和客户端处理能力,使消费者水平硬件上应用高级图像处理技术。代码是开源的,可以获取。
https://arxiv.org/abs/2405.07868
We present SLIP (SAM+CLIP), an enhanced architecture for zero-shot object segmentation. SLIP combines the Segment Anything Model (SAM) \cite{kirillov2023segment} with the Contrastive Language-Image Pretraining (CLIP) \cite{radford2021learning}. By incorporating text prompts into SAM using CLIP, SLIP enables object segmentation without prior training on specific classes or categories. We fine-tune CLIP on a Pokemon dataset, allowing it to learn meaningful image-text representations. SLIP demonstrates the ability to recognize and segment objects in images based on contextual information from text prompts, expanding the capabilities of SAM for versatile object segmentation. Our experiments demonstrate the effectiveness of the SLIP architecture in segmenting objects in images based on textual cues. The integration of CLIP's text-image understanding capabilities into SAM expands the capabilities of the original architecture and enables more versatile and context-aware object segmentation.
我们提出了SLIP(SAM+CLIP),一种增强的零 shot 对象分割架构。SLIP将段 Anything 模型(SAM)与对比性语言图像预训练(CLIP)相结合。通过使用 CLIP 将文本提示集成到 SAM 中,SLIP 使得在没有特定类或类别的先验训练的情况下进行对象分割。我们在精灵宝可梦数据集上微调 CLIP,使其能够学习有意义的图像文本表示。SLIP证明了基于文本提示识别和分割图像中物体的能力,擴展了 SAM 的多樣化物體分割能力。我们的實驗證明了 SLIP 架構在基于文本提示分割图像物体方面的有效性。將 CLIP 的文本理解能力集成到 SAM 中擴展了原始架構的功能,並使更有多樣化和具有上下文意識的物體分割。
https://arxiv.org/abs/2405.07284
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose to take advantage of existing pre-trained large-scale vision and language models to directly generate captions with test time adaptation. Specifically, we bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP as well as frozen CLIP. Differing from the conventional way to train these tokens with training data, we update these tokens with pseudo-targets of the inference data under several carefully crafted loss functions which enable the tokens to absorb video information catered for GPT-2. This procedure can be done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
尽管在完全监督的视频标题中取得了显著的进展,但零样本方法仍然没有被充分利用。在本文中,我们提出了一种利用现有的大规模视觉和语言预训练模型直接生成适应测试时间调整的标题的方法。具体来说,我们通过三个关键模型:一个通用的视频理解模型XCLIP、一个通用的图像理解模型CLIP和一个文本生成模型GPT-2来桥接视频和文本,因为它们具有源代码可用性。主要挑战是如何让文本生成模型对给定视频的内容有足够的意识,从而生成相应的标题。为解决这个问题,我们提出了一种使用可学习标记作为在冻定的GPT-2和冻定的XCLIP以及冻定的CLIP之间的通信媒介的方法。与传统利用训练数据来训练这些标记的方式不同,我们在多个精心设计的损失函数下使用推理数据的伪目标更新这些标记。这个过程可以在几轮迭代后完成(我们在实验中使用了16轮)并且不需要地面真实数据。在三个广泛使用的数据集MSR-VTT、MSVD和VATEX上的实验结果表明,与现有最先进的方法相比,提高幅度在4%到20%之间。
https://arxiv.org/abs/2405.07046
In Magnetic Resonance Imaging (MRI), image acquisitions are often undersampled in the measurement domain to accelerate the scanning process, at the expense of image quality. However, image quality is a crucial factor that influences the accuracy of clinical diagnosis; hence, high-quality image reconstruction from undersampled measurements has been a key area of research. Recently, deep learning (DL) methods have emerged as the state-of-the-art for MRI reconstruction, typically involving deep neural networks to transform undersampled MRI images into high-quality MRI images through data-driven processes. Nevertheless, there is clear and significant room for improvement in undersampled DL MRI reconstruction to meet the high standards required for clinical diagnosis, in terms of eliminating aliasing artifacts and reducing image noise. In this paper, we introduce a self-supervised pretraining procedure using contrastive learning to improve the accuracy of undersampled DL MRI reconstruction. We use contrastive learning to transform the MRI image representations into a latent space that maximizes mutual information among different undersampled representations and optimizes the information content at the input of the downstream DL reconstruction models. Our experiments demonstrate improved reconstruction accuracy across a range of acceleration factors and datasets, both quantitatively and qualitatively. Furthermore, our extended experiments validate the proposed framework's robustness under adversarial conditions, such as measurement noise, different k-space sampling patterns, and pathological abnormalities, and also prove the transfer learning capabilities on MRI datasets with completely different anatomy. Additionally, we conducted experiments to visualize and analyze the properties of the proposed MRI contrastive learning latent space.
在磁共振成像(MRI)中,图像采集通常在测量域中 undersampled 以加速扫描过程,但以牺牲图像质量为代价。然而,图像质量是影响临床诊断准确性至关重要的一个因素,因此,高质量的 undersampled 测量图像重建一直是一个研究热点。最近,深度学习(DL)方法已成为 MRI 重建的最新技术,通常涉及使用深度神经网络将 undersampled MRI 图像转换为高质量 MRI 图像,通过数据驱动的过程。然而,在 undersampled DL MRI 重建中,仍然存在显而易见的改进空间,以满足临床诊断的高标准,即消除混叠伪影并减少图像噪声。在本文中,我们使用对比学习引入自监督预训练程序来提高 undersampled DL MRI 重建的准确性。我们使用对比学习将 MRI 图像表示转换为具有最大相互信息的不同 undersampled 表示的潜在空间,并优化输入下游 DL 重建模型的信息内容。我们的实验结果表明,在不同的加速因子和数据集上,图像重建准确性得到了显著提高,无论是定量还是定性。此外,我们的扩展实验证明了所提出的框架在逆境条件下的鲁棒性,例如测量噪声、不同的 k-空间采样模式和病理性异常,以及证明其在具有完全不同解剖学结构的 MRI 数据上的迁移学习能力。此外,我们还进行了实验来可视化和分析所提出的 MRI 对比学习潜在空间的性质。
https://arxiv.org/abs/2306.00530
An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.
文本分类模型开发的一个瓶颈是训练数据的标注需要,这对多语言分类器来说更是如此。幸运的是,当代机器翻译模型既易于获取,又具有可靠的翻译质量,因此可以将一个语言的带标签训练数据翻译到另一个语言。在这里,我们探讨了使用机器翻译对跨多个语言分类任务进行微调的效果。我们还研究了使用图像标题领域中提出的一种新型技术来解释模型对翻译数据潜在负面影响的益处。我们发现,翻译数据具有足够的质量,可以用于微调跨多个语言分类器,并且这种新型损失技术能够提供比没有它更好的效果。
https://arxiv.org/abs/2405.05478
In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
在本文中,我们提出了LOC-ZSON,一种新的面向对象的图像表示方法,用于解决复杂场景中的目标导航任务。我们提出了一个以对象为中心的图像表示和相应的损失,用于视觉语言模型(VLM)的微调,可以处理复杂的对象级查询。此外,我们还设计了一种新的LLM-基的增强和提示模板,用于训练和零散推理中的稳定性。我们将在Astro机器人上实现我们的方法,并在模拟和现实世界的环境中进行零散对象导航。我们证明了我们的方法在检索任务中的文本到图像召回率上可以实现1.38 - 13.38%的改进。对于对象导航,我们在模拟和现实世界中展示了我们的方法的益处,分别实现了5%和16.67%的导航成功率的增长。
https://arxiv.org/abs/2405.05363
The diagnosis and treatment of chest diseases play a crucial role in maintaining human health. X-ray examination has become the most common clinical examination means due to its efficiency and cost-effectiveness. Artificial intelligence analysis methods for chest X-ray images are limited by insufficient annotation data and varying levels of annotation, resulting in weak generalization ability and difficulty in clinical dissemination. Here we present EVA-X, an innovative foundational model based on X-ray images with broad applicability to various chest disease detection tasks. EVA-X is the first X-ray image based self-supervised learning method capable of capturing both semantic and geometric information from unlabeled images for universal X-ray image representation. Through extensive experimentation, EVA-X has demonstrated exceptional performance in chest disease analysis and localization, becoming the first model capable of spanning over 20 different chest diseases and achieving leading results in over 11 different detection tasks in the medical field. Additionally, EVA-X significantly reduces the burden of data annotation in the medical AI field, showcasing strong potential in the domain of few-shot learning. The emergence of EVA-X will greatly propel the development and application of foundational medical models, bringing about revolutionary changes in future medical research and clinical practice. Our codes and models are available at: this https URL.
胸部疾病的诊断和治疗在维护人类健康中起着关键作用。X光检查由于其高效和成本效益成为最常用的临床检查手段。用于胸部X光图像的人工智能分析方法由于缺乏注释数据和注释水平的差异,导致泛化能力差和临床传播困难。在这里,我们介绍了EVA-X,一种基于胸部X光图像的创新基础模型,具有广泛的应用各种胸部疾病检测任务的能力。EVA-X是第一个基于无标签图像捕捉语义和几何信息的全局X光图像自监督学习方法。通过广泛的实验,EVA-X在胸部疾病分析和定位方面表现出色,成为第一个在医疗领域跨越20种不同胸部疾病并实现11个不同检测任务领先结果的模型。此外,EVA-X显著减轻了医学人工智能领域数据注释的负担,展示了在领域内的少量样本学习巨大潜力。EVA-X的出现将极大地推动基础医疗模型的开发和应用,带来未来医学研究和临床实践的革命性变化。我们的代码和模型可在此处访问:https://this URL。
https://arxiv.org/abs/2405.05237
The search for refining 3D LiDAR data has attracted growing interest motivated by recent techniques such as supervised learning or generative model-based methods. Existing approaches have shown the possibilities for using diffusion models to generate refined LiDAR data with high fidelity, although the performance and speed of such methods have been limited. These limitations make it difficult to execute in real-time, causing the approaches to struggle in real-world tasks such as autonomous navigation and human-robot interaction. In this work, we introduce a novel approach based on conditional diffusion models for fast and high-quality sparse-to-dense upsampling of 3D scene point clouds through an image representation. Our method employs denoising diffusion probabilistic models trained with conditional inpainting masks, which have been shown to give high performance on image completion tasks. We introduce a series of experiments, including multiple datasets, sampling steps, and conditional masks, to determine the ideal configuration, striking a balance between performance and inference speed. This paper illustrates that our method outperforms the baselines in sampling speed and quality on upsampling tasks using the KITTI-360 dataset. Furthermore, we illustrate the generalization ability of our approach by simultaneously training on real-world and synthetic datasets, introducing variance in quality and environments.
寻找精化3D LiDAR数据的搜索吸引了越来越多的关注,这是由最近使用的如监督学习或基于生成模型的方法等技术引起的。虽然已经证明了使用扩散模型生成具有高保真度的精化LiDAR数据的可能性,但这种方法的性能和速度仍然有限。这些限制使得在实时执行中很难实现,导致在现实世界的任务(如自主导航和人类机器人交互)中,这些方法遇到困难。 在本文中,我们介绍了一种基于条件扩散模型的新的方法,用于通过图像表示对3D场景点云进行高保真度的平滑和压缩。我们的方法采用带条件修补掩码的噪声扩散概率模型进行训练,这些模型已经在图像完成任务中表现出良好的性能。我们介绍了一系列实验,包括多个数据集、采样步骤和条件掩码,以确定理想的配置,在性能和推理速度之间取得平衡。本文证明了,我们的方法在KITTI-360数据集上的采样速度和质量方面超过了基线。此外,我们还通过同时在一手真实世界和合成数据上训练,展示了我们方法的一般化能力。我们还展示了在不同质量和环境下的平滑和压缩效果。
https://arxiv.org/abs/2405.04889
In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
在最近几年,双编码器视觉语言模型(例如CLIP)在文本到图像检索方面取得了显著的性能。然而,我们发现,这些模型通常对于一对同义词查询的检索结果会有非常不同的情况。这种行为可能会使得检索系统更具预测性,从而导致用户不满。在这项工作中,我们考虑了同义词文本到图像检索的任务,其中模型旨在通过一对同义词查询返回类似的结果。为了开始,我们收集了一个同义词图像描述的 dataset,以便为这个任务进行定量评估。然后我们假设现有双编码器模型的不良行为是因为它们在训练时使用的文本塔,这些文本塔针对图像-句子对进行训练,无法捕捉到同义词查询之间的语义相似性。为了改进这种状况,我们研究了从预训练于大型文本语料库的语言模型开始,为双编码器模型训练多种策略。与公共的双编码器模型(如CLIP和OpenCLIP)相比,我们使用最佳适应策略训练的双编码器模型在非同义词查询的检索排名相似性方面明显更高,同时保持相同的零散分类和检索准确性。
https://arxiv.org/abs/2405.03190
Effective image classification hinges on discerning relevant features from both foreground and background elements, with the foreground typically holding the critical information. While humans adeptly classify images with limited exposure, artificial neural networks often struggle with feature selection from rare samples. To address this challenge, we propose a novel method for selecting class-relevant patch embeddings. Our approach involves splitting support and query images into patches, encoding them using a pre-trained Vision Transformer (ViT) to obtain class embeddings and patch embeddings, respectively. Subsequently, we filter patch embeddings using class embeddings to retain only the class-relevant ones. For each image, we calculate the similarity between class embedding and each patch embedding, sort the similarity sequence in descending order, and only retain top-ranked patch embeddings. By prioritizing similarity between the class embedding and patch embeddings, we select top-ranked patch embeddings to be fused with class embedding to form a comprehensive image representation, enhancing pattern recognition across instances. Our strategy effectively mitigates the impact of class-irrelevant patch embeddings, yielding improved performance in pre-trained models. Extensive experiments on popular few-shot classification benchmarks demonstrate the simplicity, efficacy, and computational efficiency of our approach, outperforming state-of-the-art baselines under both 5-shot and 1-shot scenarios.
有效的图像分类依赖于从前景和背景元素中辨别相关特征,通常前景持有关键信息。虽然人类在有限曝光下也能够分类图像,但人工神经网络通常在从罕见样本中选择特征时遇到困难。为了应对这个挑战,我们提出了一种选择类相关补丁嵌入的新方法。我们的方法将支持性和查询图像分割成补丁,并使用预训练的Vision Transformer(ViT)对其进行编码,分别获得类嵌入和补丁嵌入。接下来,我们使用类嵌入过滤补丁嵌入,保留只有类相关的补丁。对于每个图像,我们计算类嵌入与每个补丁嵌入之间的相似度,将相似度序列按下降顺序排序,并仅保留排名靠前的补丁嵌入。通过优先考虑类嵌入与补丁嵌入之间的相似性,我们选择排名靠前的补丁嵌入与类嵌入融合,形成全面图像表示,增强模式识别。通过有效地减轻类无关补丁嵌入的影响,我们的策略在预训练模型上产生了改进。在流行的小样本分类基准上进行广泛的实验,证明了我们的方法的简单性、有效性和计算效率,在5-shot和1-shot场景下均优于最先进的基线。
https://arxiv.org/abs/2405.03722
Despite the longstanding adage "an image is worth a thousand words," creating accurate and hyper-detailed image descriptions for training Vision-Language models remains challenging. Current datasets typically have web-scraped descriptions that are short, low-granularity, and often contain details unrelated to the visual content. As a result, models trained on such data generate descriptions replete with missing information, visual inconsistencies, and hallucinations. To address these issues, we introduce ImageInWords (IIW), a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process. We validate the framework through evaluations focused on the quality of the dataset and its utility for fine-tuning with considerations for readability, comprehensiveness, specificity, hallucinations, and human-likeness. Our dataset significantly improves across these dimensions compared to recently released datasets (+66%) and GPT-4V outputs (+48%). Furthermore, models fine-tuned with IIW data excel by +31% against prior work along the same human evaluation dimensions. Given our fine-tuned models, we also evaluate text-to-image generation and vision-language reasoning. Our model's descriptions can generate images closest to the original, as judged by both automated and human metrics. We also find our model produces more compositionally rich descriptions, outperforming the best baseline by up to 6% on ARO, SVO-Probes, and Winoground datasets.
尽管谚语“图片价值千言万语”,为训练视觉语言模型创建准确且超详细图像描述仍然具有挑战性。通常的数据集包含从互联网上抓取的简短、低粒度和通常与视觉内容无关的描述。因此,训练在这些数据上的模型生成的描述充满了缺失信息、视觉不一致性和幻觉。为了应对这些问题,我们引入了ImageInWords (IIW),一种由人类和机器共同审核的超详细图像描述的元数据框架和一个由此过程产生的新数据集。我们通过关注数据集质量和其在微调方面的可用性、可读性、全面性、幻觉和人性化来评估该框架。我们的数据在这些方面显著优于最近发布的数据集(+66%)和GPT-4V输出(+48%)。此外,使用IIW数据进行微调的模型在相同的人类评价维度上表现优异,与之前的工作相比提高了31%。鉴于我们微调的模型,我们还评估了文本到图像生成和视觉语言推理。我们的模型描述可以生成与原始图像最接近的图像,这由自动和人类指标进行评估。我们发现,我们的模型在ARO、SVO-Probes和Winoground数据集上产生了更丰富的描述,比最佳基线高出6%以上。
https://arxiv.org/abs/2405.02793
The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.
子种群分布是数据集中一个重要的属性。揭示和分析数据集中的子种群分布提供了对数据集的全面了解,作为各种下游任务的有力工具,包括数据集子种群组织、子种群平移和切片发现。尽管这对数据集非常重要,但我们不知道有没有系统地研究了数据集中的子种群分布。为了克服这一局限,并统一解决所有提到的任务,我们引入了一个新的子种群结构概念,用于表示、分析和利用数据集中的子种群分布。为了以可解释的方式描述结构,我们提出了 Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework,该框架利用大型语言模型的世界知识和指令跟随能力进行语言分析,并总结结构。此外,我们提出了完整的下游任务工作流程,名为任务特定调整,展示了发现的结构在子种群相关任务中的应用,包括数据集子种群组织、子种群平移和切片发现。此外,我们还提出了完整的下游任务工作流程,名为任务特定调整,展示了发现的结构在子种群相关任务中的应用,包括数据集子种群组织、子种群平移和切片发现。
https://arxiv.org/abs/2405.02363
This report presents the ECO (Ensembled Clip score and cOnsensus score) pipeline from team DSBA LAB, which is a new framework used to evaluate and rank captions for a given image. ECO selects the most accurate caption describing image. It is made possible by combining an Ensembled CLIP score, which considers the semantic alignment between the image and captions, with a Consensus score that accounts for the essentialness of the captions. Using this framework, we achieved notable success in the CVPR 2024 Workshop Challenge on Caption Re-ranking Evaluation at the New Frontiers for Zero-Shot Image Captioning Evaluation (NICE). Specifically, we secured third place based on the CIDEr metric, second in both the SPICE and METEOR metrics, and first in the ROUGE-L and all BLEU Score metrics. The code and configuration for the ECO framework are available at this https URL DSBA-Lab/ECO .
本报告介绍了一个名为ECO(集成剪辑分数和共识分数)的框架,该框架用于评估和排名给定图像的 caption。ECO 选择最准确的图像描述。这是通过将集成 CLIP 分数(考虑图像与捕获的文本之间的语义对齐)与共识分数(考虑捕获的文本的重要性)相结合而实现的。使用这个框架,我们在 CVPR 2024 工作站挑战中对捕捉重新排名评估的新前沿(NICE)取得了显著的成功。具体来说,我们在 CIDEr 指标上获得了第三名的成绩,在 SPICE 和 METEOR 指标上排名第二,而在 ROUGE-L 和所有 BLEU 分数指标上排名第一。ECO 框架的代码和配置可用於此链接 DSBA-Lab/ECO。
https://arxiv.org/abs/2405.01028
Communication is defined as ``Who says what to whom with what effect.'' A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior data is often ignored while training large language models. We show that training LLMs on receiver behavior can actually help improve their content-understanding abilities. Specifically, we show that training LLMs to predict the receiver behavior of likes and comments improves the LLM's performance on a wide variety of downstream content understanding tasks. We show this performance increase over 40 video and image understanding tasks over 23 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines. Moreover, since receiver behavior, such as likes and comments, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We release the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with our instruction-tuning data.
交流被定义为“谁对谁说什么,对谁产生什么效果。” 发送者的信息产生下游接收者影响,也称为行为。接收者的行为作为信息下游的效果,对其携带的丰富信号具有重要的影响。即使在携带关于信息的行为数据之后,行为数据通常也会在训练大语言模型时被忽略。我们证明,在接收者行为上训练LLM实际上可以帮助提高其内容理解能力。具体来说,我们证明了将LLM预测接收者喜欢和评论的行为来改善LLM在各种下游内容理解任务上的表现。我们在0-shot和微调设置的23个基准数据集上展示了这种性能增加,超越了许多监督基线。此外,由于接收者行为(如喜欢和评论)默认情况下在互联网上收集,不需要任何人类注释即可变得有用,因此我们训练这种数据获得的性能提升本质上是一种免费午餐。我们还将我们指令调整后的750万张图像和视频的接收者行为和喜欢/评论一起发布。
https://arxiv.org/abs/2405.00942
Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.
近年来,随着Vision语言模型(VLMs)的出现,它们在理解图像和文本数据的双模态方面得到了关注。例如,LLaVA、ChatGPT-4和Gemini等VLM最近在自然图像描述性、视觉问答(VQA)和空间推理等任务中表现出色。此外,由元人工智能(Meta AI)开发的普遍分割模型Semantic Anywhere Model(SAM)在从未见过的图像中隔离物体方面表现出史无前例的性能。由于医疗专家、生物学家和材料科学家通常将显微镜图像或医学图像与文本信息(标题、文献或报告)一起检查,并从中得出重要且有益的结论,因此测试VLM和基础模型(如SAM)在这些图像上的性能无疑至关重要。在这项研究中,我们对ChatGPT、LLaVA、Gemini和SAM在各种显微镜图像上执行分类、分割、计数和VQA任务。我们观察到,ChatGPT和Gemini在显微镜图像的视觉特征方面表现出惊人的理解能力,而SAM在分离总体上的伪影方面表现相当出色。然而,这些模型的性能与领域专家的相当距离,模型很容易受到图像中存在的杂质、缺陷、伪影和多样性等因素的影响。
https://arxiv.org/abs/2405.00876
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.
组成图像检索(CIR)是一个复杂的任务,它使用查询来检索图像,该查询配置了一个图像和一个描述对图像所需修改的文本。监督的CIR方法已经展示了强大的性能,但它们依赖于昂贵的手动标注数据集,从而限制了它们的可扩展性和更广泛的适用性。为解决这些问题,以前的研究提出了基于伪词词向量的零 shots CIR(ZS-CIR)方法,该方法利用投影模块将图像映射到词向量。然而,我们推测这种方法的一个缺点是:投影模块扭曲了原始图像表示,并将所得组合嵌入限制在文本侧。为了解决这个问题,我们引入了一种新颖的ZS-CIR方法,该方法使用球面线性插值(Slerp)直接将图像和文本表示合并。此外,我们还引入了文本锚定调整(TAT)方法,该方法在保持文本编码器固定的情况下,对图像编码器进行微调。TAT缩小了图像和文本之间的模式差距,使得Slerp过程更加有效。值得注意的是,TAT方法不仅在训练数据规模和训练时间方面具有效率,而且还可以作为训练监督CIR模型的良好初始检查点,从而突出其更广泛的潜力。将Slerp-based ZS-CIR与TAT调整的模型相结合,使得我们的方法在CIR基准测试中实现了最先进的检索性能。
https://arxiv.org/abs/2405.00571
We introduce a formal information-theoretic framework for image captioning by regarding it as a representation learning task. Our framework defines three key objectives: task sufficiency, minimal redundancy, and human interpretability. Building upon this foundation, we propose a novel Pyramid of Captions (PoCa) method, which constructs caption pyramids by generating localized captions for zoomed-in image patches and integrating them with global caption information using large language models. This approach leverages intuition that the detailed examination of local patches can reduce error risks and address inaccuracies in global captions, either by correcting the hallucination or adding missing details. Based on our theoretical framework, we formalize this intuition and provide formal proof demonstrating the effectiveness of PoCa under certain assumptions. Empirical tests with various image captioning models and large language models show that PoCa consistently yields more informative and semantically aligned captions, maintaining brevity and interpretability.
我们提出了一个形式化的信息论框架来处理图像标题任务,将其视为一种表示学习任务。我们的框架定义了三个关键目标:任务完备性、最小冗余性和人可解释性。在此基础上,我们提出了一个新的金字塔式标题方法(PoCa) ,通过为缩放的图像补丁生成局部标题,并使用大型语言模型将它们与全局标题信息集成来构建标题金字塔。这种方法利用直觉,即对局部补丁的详细检查可以降低错误风险并解决全局标题的不准确性,或者通过纠正幻觉或添加缺失细节来解决。根据我们的理论框架,我们形式化了这个直觉,并提供了形式化的证明,证明在某些假设下,PoCa具有有效性。用各种图像标题模型和大型语言模型进行实证测试,结果表明,PoCa始终产生更有信息量和语义一致性的标题,保持简短和可解释性。
https://arxiv.org/abs/2405.00485
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.
视觉语言数据对于文本到图像(T2I)和图像到文本(I2T)研究来说至关重要。然而,现有的数据集缺乏描述性,这些描述可以让模型更丰富地学习关联。为了填补这个空白,我们引入了连接和对比图像的描述(DOCCI)数据集,这是一个由单个研究人员收集、策划和捐赠的15000张图片的数据集,旨在捕捉一些关键挑战,如空间关系、计数、文本渲染、世界知识等。我们指示人类标注者为每张图片创建全面的描述;这些描述通常长度为136个词,并刻意区分彼此的关系或相似性。每个描述高度可组合,通常涵盖多个挑战。通过数量和定性分析,我们证明DOCCI是一个有效的图像到文本生成训练资源——在DOCCI上训练的PaLI 5B模型与高度表现的大模型(如LaVA-1.5 7B和InstructBLIP 7B)相比,表现出相同或更好的效果。此外,我们还展示了DOCCI对于文本到图像生成的测试平台的价值,突出了当前文本到图像模型的局限性,即捕捉不了长描述和细节。
https://arxiv.org/abs/2404.19753