While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
Deep image hashing aims to map input images into simple binary hash codes via deep neural networks and thus enable effective large-scale image retrieval. Recently, hybrid networks that combine convolution and Transformer have achieved superior performance on various computer tasks and have attracted extensive attention from researchers. Nevertheless, the potential benefits of such hybrid networks in image retrieval still need to be verified. To this end, we propose a hybrid convolutional and self-attention deep hashing method known as HybridHash. Specifically, we propose a backbone network with stage-wise architecture in which the block aggregation function is introduced to achieve the effect of local self-attention and reduce the computational complexity. The interaction module has been elaborately designed to promote the communication of information between image blocks and to enhance the visual representations. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that the method proposed in this paper has superior performance with respect to state-of-the-art deep hashing methods. Source code is available this https URL.
深度图像哈希的目的是通过深度神经网络将输入图像映射到简单的二进制哈希码,从而实现有效的 large-scale 图像检索。最近,结合卷积神经网络(CNN)和 Transformer 的混合网络在各种计算机任务上取得了卓越的性能,并吸引了广泛的关注。然而,这类混合网络在图像检索中的潜在好处仍需得到证实。为此,我们提出了一种名为 HybridHash 的混合卷积和自注意力深度哈希方法。具体来说,我们提出了一种具有阶段式架构的骨干网络,其中引入了块聚合函数以实现局部自注意力和降低计算复杂度。交互模块已经详细设计,以促进图像块之间的信息交流和增强视觉表示。我们在三个广泛使用的数据集上进行了全面的实验:CIFAR-10、NUS-WIDE 和 IMAGENET。实验结果表明,本文提出的方法在深度哈希方法方面具有优越的性能。源代码可以从此链接下载。
https://arxiv.org/abs/2405.07524
Breast cancer is a significant global health concern, particularly for women. Early detection and appropriate treatment are crucial in mitigating its impact, with histopathology examinations playing a vital role in swift diagnosis. However, these examinations often require a substantial workforce and experienced medical experts for proper recognition and cancer grading. Automated image retrieval systems have the potential to assist pathologists in identifying cancerous tissues, thereby accelerating the diagnostic process. Nevertheless, due to considerable variability among the tissue and cell patterns in histological images, proposing an accurate image retrieval model is very challenging. This work introduces a novel attention-based adversarially regularized variational graph autoencoder model for breast histological image retrieval. Additionally, we incorporated cluster-guided contrastive learning as the graph feature extractor to boost the retrieval performance. We evaluated the proposed model's performance on two publicly available datasets of breast cancer histological images and achieved superior or very competitive retrieval performance, with average mAP scores of 96.5% for the BreakHis dataset and 94.7% for the BACH dataset, and mVP scores of 91.9% and 91.3%, respectively. Our proposed retrieval model has the potential to be used in clinical settings to enhance diagnostic performance and ultimately benefit patients.
乳腺癌是一个全球健康问题,特别是对女性的影响非常大。早期诊断和适当的治疗对减轻其影响至关重要,而组织病理学检查在迅速诊断中发挥着关键作用。然而,这些检查通常需要大量的工作力和经验丰富的医疗专家来进行适当的识别和癌症分级。自动图像检索系统有可能帮助病理学家识别出恶性组织,从而加速诊断过程。然而,由于组织和细胞在组织病理图中的变异很大,提出准确的组织病理图检索模型非常具有挑战性。本工作提出了一种新颖的关注基于对抗训练的变分图自编码器模型用于乳腺癌组织病理图检索。此外,我们还引入了聚类引导的对比学习作为图特征提取器,以提高检索性能。我们在两个公开可用的乳腺癌组织病理图数据集上评估所提出的模型的性能,实现了卓越或非常竞争力的检索性能,平均mAP得分分别为96.5%的BreakHis数据集和94.7%的BACH数据集,平均mVP得分分别为91.9%和91.3%。我们提出的检索模型有望在临床实践中提高诊断性能,最终为患者带来好处。
https://arxiv.org/abs/2405.04211
Image-based retrieval in large Earth observation archives is challenging because one needs to navigate across thousands of candidate matches only with the query image as a guide. By using text as information supporting the visual query, the retrieval system gains in usability, but at the same time faces difficulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often suffers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Text-Image Retrieval (KTIR) method for remote sensing images. By mining relevant information from an external knowledge graph, KTIR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Moreover, by integrating domain-specific knowledge, KTIR also enhances the adaptation of pre-trained vision-language models to remote sensing applications. Experimental results on three commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.
大地球观测档案中基于图像的检索具有挑战性,因为需要仅以查询图像为指南穿越数千个候选匹配。通过将文本作为支持视觉查询的信息,检索系统在可用性方面获得了提高,但同时由于视觉信号的多样性无法仅通过短文标题来总结,因此面临着困难。因此,作为一种匹配为基础的任务,跨模态文本-图像检索常常存在文本和图像之间的信息不对称。为了应对这一挑战,我们提出了一个知识引导的文本-图像检索(KTIR)方法来解决遥感图像。通过从外部知识图中挖掘相关信息,KTIR为搜索查询提供了更丰富的文本范围,并减轻了文本和图像之间的信息缺口,从而实现更好的匹配。此外,通过整合领域特定知识,KTIR还增强了预训练视觉语言模型对远程观测应用的适应性。在三个常用的遥感文本-图像检索基准测试中,与最先进的检索方法相比,所提出的知识引导方法产生了各种不同的检索结果,但均具有更好的表现。
https://arxiv.org/abs/2405.03373
In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of paraphrased queries. Such behavior might render the retrieval system less predictable and lead to user frustration. In this work, we consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries. To start with, we collect a dataset of paraphrased image descriptions to facilitate quantitative evaluation for this task. We then hypothesize that the undesired behavior of existing dual-encoder model is due to their text towers which are trained on image-sentence pairs and lack the ability to capture the semantic similarity between paraphrased queries. To improve on this, we investigate multiple strategies for training a dual-encoder model starting from a language model pretrained on a large text corpus. Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries while maintaining similar zero-shot classification and retrieval accuracy.
在最近几年,双编码器视觉语言模型(例如CLIP)在文本到图像检索方面取得了显著的性能。然而,我们发现,这些模型通常对于一对同义词查询的检索结果会有非常不同的情况。这种行为可能会使得检索系统更具预测性,从而导致用户不满。在这项工作中,我们考虑了同义词文本到图像检索的任务,其中模型旨在通过一对同义词查询返回类似的结果。为了开始,我们收集了一个同义词图像描述的 dataset,以便为这个任务进行定量评估。然后我们假设现有双编码器模型的不良行为是因为它们在训练时使用的文本塔,这些文本塔针对图像-句子对进行训练,无法捕捉到同义词查询之间的语义相似性。为了改进这种状况,我们研究了从预训练于大型文本语料库的语言模型开始,为双编码器模型训练多种策略。与公共的双编码器模型(如CLIP和OpenCLIP)相比,我们使用最佳适应策略训练的双编码器模型在非同义词查询的检索排名相似性方面明显更高,同时保持相同的零散分类和检索准确性。
https://arxiv.org/abs/2405.03190
Given a query consisting of a reference image and a relative caption, Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption. The reliance of supervised methods on labor-intensive manually labeled datasets hinders their broad applicability. In this work, we introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset. We propose an approach named iSEARLE (improved zero-Shot composEd imAge Retrieval with textuaL invErsion) that involves mapping the visual information of the reference image into a pseudo-word token in CLIP token embedding space and combining it with the relative caption. To foster research on ZS-CIR, we present an open-domain benchmarking dataset named CIRCO (Composed Image Retrieval on Common Objects in context), the first CIR dataset where each query is labeled with multiple ground truths and a semantic categorization. The experimental results illustrate that iSEARLE obtains state-of-the-art performance on three different CIR datasets -- FashionIQ, CIRR, and the proposed CIRCO -- and two additional evaluation settings, namely domain conversion and object composition. The dataset, the code, and the model are publicly available at this https URL.
给定一个由参考图像和相对描述组成的查询,组合图像检索(CIR)旨在通过包含在相对描述中指定的变化来检索与参考图像视觉上相似的目标图像,同时实现这一点。依赖于有监督方法对劳动密集型手动标注数据集的依赖会限制其广泛的适用性。在这项工作中,我们引入了一个新的任务,名为零 shot组合图像检索(ZS-CIR),它不需要有标签的训练数据集来解决组合图像检索(CIR)。我们提出了一个名为iSEARLE(改进零 shot组合图像age检索与文本ual invErsion)的方法,该方法涉及将参考图像的视觉信息映射到CLIP标记嵌入空间中的伪词标记,并将其与相对描述相结合。为了促进对零 shot组合图像检索的研究,我们提出了一个名为CIRCO(在上下文中共同对象图像检索)的开源领域基准数据集,它是第一个每个查询都带有多个地面真实值和语义分类的CIR数据集。实验结果表明,iSEARLE在三个不同的CIR数据集--FashionIQ,CIRR和所提出的CIRCO--上都取得了最先进的性能,同时还取得了另外两个评估设置,即领域转换和对象组合。数据集、代码和模型都可以在https://这个链接上获得。
https://arxiv.org/abs/2405.02951
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.
组成图像检索(CIR)是一个复杂的任务,它使用查询来检索图像,该查询配置了一个图像和一个描述对图像所需修改的文本。监督的CIR方法已经展示了强大的性能,但它们依赖于昂贵的手动标注数据集,从而限制了它们的可扩展性和更广泛的适用性。为解决这些问题,以前的研究提出了基于伪词词向量的零 shots CIR(ZS-CIR)方法,该方法利用投影模块将图像映射到词向量。然而,我们推测这种方法的一个缺点是:投影模块扭曲了原始图像表示,并将所得组合嵌入限制在文本侧。为了解决这个问题,我们引入了一种新颖的ZS-CIR方法,该方法使用球面线性插值(Slerp)直接将图像和文本表示合并。此外,我们还引入了文本锚定调整(TAT)方法,该方法在保持文本编码器固定的情况下,对图像编码器进行微调。TAT缩小了图像和文本之间的模式差距,使得Slerp过程更加有效。值得注意的是,TAT方法不仅在训练数据规模和训练时间方面具有效率,而且还可以作为训练监督CIR模型的良好初始检查点,从而突出其更广泛的潜力。将Slerp-based ZS-CIR与TAT调整的模型相结合,使得我们的方法在CIR基准测试中实现了最先进的检索性能。
https://arxiv.org/abs/2405.00571
In patent prosecution, image-based retrieval systems for identifying similarities between current patent images and prior art are pivotal to ensure the novelty and non-obviousness of patent applications. Despite their growing popularity in recent years, existing attempts, while effective at recognizing images within the same patent, fail to deliver practical value due to their limited generalizability in retrieving relevant prior art. Moreover, this task inherently involves the challenges posed by the abstract visual features of patent images, the skewed distribution of image classifications, and the semantic information of image descriptions. Therefore, we propose a language-informed, distribution-aware multimodal approach to patent image feature learning, which enriches the semantic understanding of patent image by integrating Large Language Models and improves the performance of underrepresented classes with our proposed distribution-aware contrastive losses. Extensive experiments on DeepPatent2 dataset show that our proposed method achieves state-of-the-art or comparable performance in image-based patent retrieval with mAP +53.3%, Recall@10 +41.8%, and MRR@10 +51.9%. Furthermore, through an in-depth user analysis, we explore our model in aiding patent professionals in their image retrieval efforts, highlighting the model's real-world applicability and effectiveness.
在专利申请过程中,基于图像的检索系统用于识别当前专利图像与先驱技术的相似性,以确保专利申请的新颖性和非显而易见性。尽管近年来它在同一专利中识别图像的有效性得到了显著提高,但现有的尝试虽然能在同一专利中识别图像,但由于其在提取相关先驱技术方面具有有限的可扩展性,导致实际应用价值有限。此外,这项任务本身涉及专利图像抽象视觉特征、图像分类分布的不对称性和图像描述的语义信息等挑战。因此,我们提出了一个语言驱动、分布关注的多模态方法来进行专利图像特征学习,通过整合大型语言模型和改进我们提出的分布关注对比损失,从而丰富专利图像的语义理解。在DeepPatent2数据集上的大量实验证明,我们提出的方法在基于图像的专利检索中实现了与mAP +53.3%和Recall@10 +41.8%相当或更好的性能。通过深入的用户分析,我们探讨了我们的模型如何帮助专利专业人员提高图像检索工作,突出了模型的实际应用价值和有效性。
https://arxiv.org/abs/2404.19360
Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
图像搜索在多媒体和计算机视觉领域具有关键作用,应用于各种领域,从互联网搜索到医疗诊断。传统的图像搜索系统通过接受文本或图像查询,从数据库中检索最相关的候选结果来操作。然而,普遍方法往往依赖于单轮过程,这可能导致不准确的结果和有限的召回率。这些方法还面临着词汇不匹配和语义鸿沟等挑战,限制了其整体效果。为了应对这些问题,我们提出了一个多轮图像检索系统,该系统可以根据用户相关反馈来优化查询。该系统采用基于视觉语言模型的图像摘要器来提高文本查询的质量,从而每次迭代产生更有信息性的查询。此外,我们还引入了一个基于大语言模型的去噪器来优化基于文本的查询扩展,减轻 captioning 模型生成的图像描述中的不准确。为了评估我们的系统,我们通过将 MSR-VTT 视频检索数据集改编为图像检索任务,为每个查询提供多个相关 ground truth 图像。通过全面的实验,我们验证了我们提出的系统在基线方法上的有效性,在召回率方面取得了显著的10%提升。我们的贡献包括开发了一个创新的交互式图像检索系统、引入了基于 LLM 的去噪器、创建了精心设计的评估数据集,以及进行了充分的实验验证。
https://arxiv.org/abs/2404.18746
Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.
基于手绘图的图像检索(SBIR)将手绘图与相应的现实图像相关联。在本文中,我们试图同时解决这个任务的两个主要挑战:i)零散 shot,处理未见类别;ii)精细程度,涉及类内实例级别的检索。我们创新的关键在于意识到,仅从一般化角度解决这个问题可能是不充分的,因为有限见的类别所积累的知识可能不完全是宝贵的,或者不能完全转移给未见的目标类别。 受到这个启发,本文我们提出了一种双模提示的CLIP(DP-CLIP)网络,其中设计了一种自适应提示策略。具体来说,为了促进我们的DP-CLIP适应不可预测的目标类别,我们在目标类别和文本类别标签内分别构建了一系列类适应的提示词和通道比例。通过整合生成的指导,DP-CLIP可以在每个目标类别中获得宝贵的类中心见解,高效适应新类别,并捕捉到每个目标类别的独特区分线索,从而实现有效的检索。 通过这些设计,我们的DP-CLIP在Sketchy数据集上的Acc.@1值比最先进的细粒度零散 shot SBIR方法提高了7.3%。同时,在另外两个类级零散 shot SBIR基准中,我们的方法也取得了很好的表现。
https://arxiv.org/abs/2404.18695
A novel algorithm, called semantic line combination detector (SLCD), to find an optimal combination of semantic lines is proposed in this paper. It processes all lines in each line combination at once to assess the overall harmony of the lines. First, we generate various line combinations from reliable lines. Second, we estimate the score of each line combination and determine the best one. Experimental results demonstrate that the proposed SLCD outperforms existing semantic line detectors on various datasets. Moreover, it is shown that SLCD can be applied effectively to three vision tasks of vanishing point detection, symmetry axis detection, and composition-based image retrieval. Our codes are available at this https URL.
本文提出了一种名为语义线组合检测器(SLCD)的新算法,以寻找语义线的最佳组合。它一次性处理所有线组合以评估线的整体和谐度。首先,我们从可靠的线中生成各种线组合。然后,我们估计每个线组合的得分并确定最佳的一个。实验结果表明,与现有语义线检测器相比,提出的SLCD在各种数据集上表现出优异的性能。此外,还发现SLCD可以有效应用于三个视觉任务:消失点检测、对称轴检测和基于组成的图像检索。我们的代码可在此处访问:https:// URL。
https://arxiv.org/abs/2404.18399
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.
我们描述了一个研究文本到视频检索训练的协议,使用未标记的视频,其中我们假设(i)没有访问任何视频的标签,即没有访问地面真实字幕集,但(ii)访问以文本形式表示的标记图像。由于给图像专家模型建立现实场景是合理的,因为给图像贴标签比给视频贴标签更便宜,因此具有可扩展性,与昂贵的视频标注方案相比,更具有可行性。最近,零散式图像专家,如CLIP,已经为视频理解任务建立了新的强基准。在本文中,我们利用这一进步并使用两种模型实例化图像专家:一种是从文本到图像检索模型,为初始骨架提供支持,另一种是图像标题模型,为未标记的视频提供监督信号。我们证明了自动给视频帧贴上图像标题标签可以让文本到视频检索训练。这一过程在不进行手动注释的情况下适应了目标领域,从而在CLIP零散式基准之外表现出色。在训练期间,我们从多个视频帧中采样相应的标题,并对帧表示进行时间池化,根据每个标题对帧进行评分。我们进行了广泛的实验,以提供有关此简单框架的有效性的见解,并通过在ActivityNet、MSR-VTT和MSVD等三个标准数据集上实现文本到视频检索的CLIP零散式基准的超越来证实其有效性。
https://arxiv.org/abs/2404.17498
Shoeprints are a common type of evidence found at crime scenes and are used regularly in forensic investigations. However, existing methods cannot effectively employ deep learning techniques to match noisy and occluded crime-scene shoeprints to a shoe database due to a lack of training data. Moreover, all existing methods match crime-scene shoeprints to clean reference prints, yet our analysis shows matching to more informative tread depth maps yields better retrieval results. The matching task is further complicated by the necessity to identify similarities only in corresponding regions (heels, toes, etc) of prints and shoe treads. To overcome these challenges, we leverage shoe tread images from online retailers and utilize an off-the-shelf predictor to estimate depth maps and clean prints. Our method, named CriSp, matches crime-scene shoeprints to tread depth maps by training on this data. CriSp incorporates data augmentation to simulate crime-scene shoeprints, an encoder to learn spatially-aware features, and a masking module to ensure only visible regions of crime-scene prints affect retrieval results. To validate our approach, we introduce two validation sets by reprocessing existing datasets of crime-scene shoeprints and establish a benchmarking protocol for comparison. On this benchmark, CriSp significantly outperforms state-of-the-art methods in both automated shoeprint matching and image retrieval tailored to this task.
足印是一种在犯罪现场常见的证据,通常在法医调查中定期使用。然而,现有的方法无法有效利用深度学习技术将嘈杂和遮挡的犯罪现场足印与鞋数据库匹配,因为缺乏训练数据。此外,所有现有方法都是将犯罪现场足印与清洁参考印匹配,然而我们的分析表明,将足印的清洁参考印与更相关的足印纹理地图匹配会产生更好的检索结果。匹配任务进一步复杂化,因为需要仅在足印和鞋纹理的对应区域(如脚趾、脚踝等)中识别相似性。为了克服这些挑战,我们利用在线零售商的鞋子纹理图像,并使用一个现成的预测器估计深度纹理和清洁纹理。我们的方法名为CriSp,通过在這種數據上進行訓練來匹配犯罪現場足印到紋理深度圖。CriSp包括數據擴展來模擬犯罪現場足印,一個學習空間感知特徵的編碼器,以及一個遮罩模塊,確保只顯示足印的可见區域影響检索結果。为了验证我们的方法,我们通过重新處理現有犯罪現場足印數據集來引入兩個驗證集,並建立了一個比較協議。在這個驗證標準下,CriSp在自動足印匹配和針對這個任務的图像檢索方面 significantly 优于最先进的技術。
https://arxiv.org/abs/2404.16972
Many image retrieval studies use metric learning to train an image encoder. However, metric learning cannot handle differences in users' preferences, and requires data to train an image encoder. To overcome these limitations, we revisit relevance feedback, a classic technique for interactive retrieval systems, and propose an interactive CLIP-based image retrieval system with relevance feedback. Our retrieval system first executes the retrieval, collects each user's unique preferences through binary feedback, and returns images the user prefers. Even when users have various preferences, our retrieval system learns each user's preference through the feedback and adapts to the preference. Moreover, our retrieval system leverages CLIP's zero-shot transferability and achieves high accuracy without training. We empirically show that our retrieval system competes well with state-of-the-art metric learning in category-based image retrieval, despite not training image encoders specifically for each dataset. Furthermore, we set up two additional experimental settings where users have various preferences: one-label-based image retrieval and conditioned image retrieval. In both cases, our retrieval system effectively adapts to each user's preferences, resulting in improved accuracy compared to image retrieval without feedback. Overall, our work highlights the potential benefits of integrating CLIP with classic relevance feedback techniques to enhance image retrieval.
许多图像检索研究使用元学习来训练图像编码器。然而,元学习无法处理用户偏好的差异,并需要数据来训练图像编码器。为了克服这些限制,我们重新审视了相关性反馈,这是一个经典的交互式检索系统技术,并提出了一种基于相关性反馈的交互式CLIP图像检索系统。我们的检索系统首先执行检索,通过二进制反馈收集每个用户的独特偏好,然后返回用户喜欢的图像。即使用户有各种偏好,我们的检索系统也会通过反馈学习每个用户的偏好,并适应偏好。此外,我们的检索系统利用CLIP的零 shot传输能力,无需训练就能实现高准确度。我们通过实验实证证明,我们的检索系统在基于类别的图像检索中与最先进的元学习技术竞争相当,即使没有为每个数据集专门训练图像编码器。此外,我们还设置了两个额外的实验设置,用户具有各种偏好:一标签图像检索和条件图像检索。在这两种情况下,我们的检索系统都能有效适应每个用户的偏好,从而改善了与没有反馈时的图像检索的准确性。总之,我们的工作突出了将CLIP与经典相关性反馈技术相结合可以提高图像检索的潜力的可能性。
https://arxiv.org/abs/2404.16398
Fine-grained image retrieval (FGIR) is to learn visual representations that distinguish visually similar objects while maintaining generalization. Existing methods propose to generate discriminative features, but rarely consider the particularity of the FGIR task itself. This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models. These guidelines include emphasizing the object (G1), highlighting subcategory-specific discrepancies (G2), and employing effective training strategy (G3). Following G1 and G2, we design a novel Dual Visual Filtering mechanism for the plain visual transformer, denoted as DVF, to capture subcategory-specific discrepancies. Specifically, the dual visual filtering mechanism comprises an object-oriented module and a semantic-oriented module. These components serve to magnify objects and identify discriminative regions, respectively. Following G3, we implement a discriminative model training strategy to improve the discriminability and generalization ability of DVF. Extensive analysis and ablation studies confirm the efficacy of our proposed guidelines. Without bells and whistles, the proposed DVF achieves state-of-the-art performance on three widely-used fine-grained datasets in closed-set and open-set settings.
细粒度图像检索(FGIR)是通过学习具有区分力视觉表示,同时保持普适性的视觉表示来研究的问题。现有的方法提出了生成有区分力的特征,但很少考虑FGIR任务的独特性。本文提出了一种 meticulous的分析,导致了制定针对子类别特定差异的实用指南,以设计有效的FGIR模型。这些指南包括强调对象(G1),突出子类别特定差异(G2),并采用有效的训练策略(G3)。遵循G1和G2,我们为平视变换器设计了一个新颖的双视觉过滤机制,表示为DVF,以捕捉子类别特定差异。具体来说,双视觉过滤机制包括一个面向对象的模块和一个面向语义的模块。这些组件分别用于放大对象和识别具有区分性的区域。遵循G3,我们实现了一个用于提高DVF的鉴别率和泛化能力的歧视模型训练策略。广泛的分析和消融实验证实了我们提出的指南的有效性。没有花哨的装饰,DVF在闭合设置和开设置下的三个广泛使用的细粒度数据集上实现了最先进的性能。
https://arxiv.org/abs/2404.15771
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
组合图像检索(CIR)是一个根据给定文本修改查询图像的任务。目前的方法依赖于有标签的三元组来训练CIR模型,这些三元组并不像简单的图像-文本对那么常见,从而限制了CIR的广泛应用和其可扩展性。另一方面,零散式CIR可以通过图像描述性对无监督训练进行相对容易的实现,但不考虑图像之间的关系,这种方法往往导致较低的准确性。我们提出了一种新的半监督CIR方法,其中我们在辅助数据中寻找参考图像及其相关目标图像,并使用基于大型语言模型(VDG)生成描述两个视觉差异(即视觉差)的文本。VDG,拥有流畅的语义知识,且对模型无依赖,可以生成伪三元组来提高CIR模型的性能。我们的方法显著提高了现有监督学习方法,并在CIR基准测试中实现了最先进的性能。
https://arxiv.org/abs/2404.15516
The main objective of this paper is to address the mobile robot localization problem with Triplet Convolutional Neural Networks and test their robustness against changes of the lighting conditions. We have used omnidirectional images from real indoor environments captured in dynamic conditions that have been converted to panoramic format. Two approaches are proposed to address localization by means of triplet neural networks. First, hierarchical localization, which consists in estimating the robot position in two stages: a coarse localization, which involves a room retrieval task, and a fine localization is addressed by means of image retrieval in the previously selected room. Second, global localization, which consists in estimating the position of the robot inside the entire map in a unique step. Besides, an exhaustive study of the loss function influence on the network learning process has been made. The experimental section proves that triplet neural networks are an efficient and robust tool to address the localization of mobile robots in indoor environments, considering real operation conditions.
本文的主要目标是使用三元卷积神经网络(Triplet Convolutional Neural Networks)解决移动机器人定位问题,并测试它们对照明条件变化的鲁棒性。我们使用从动态条件下捕获的现实室内环境中捕获的全方位图像,并将其转换为全景格式。我们提出了两种通过三元神经网络解决定位的方法。首先,是分层定位,其包括粗定位和细定位两个阶段:粗定位涉及房间检索任务,而细定位通过先前选定的房间的图像检索来解决。其次,是全局定位,它包括在一次性估计机器人在整个地图上的位置。此外,对网络学习过程中损失函数影响的全面研究已经进行了探讨。实验部分证明,三元神经网络是解决移动机器人室内环境定位的有效且鲁棒工具。考虑到实际操作条件。
https://arxiv.org/abs/2404.14117
Visual Place Recognition (VPR) aims to estimate the location of an image by treating it as a retrieval problem. VPR uses a database of geo-tagged images and leverages deep neural networks to extract a global representation, called descriptor, from each image. While the training data for VPR models often originates from diverse, geographically scattered sources (geo-tagged images), the training process itself is typically assumed to be centralized. This research revisits the task of VPR through the lens of Federated Learning (FL), addressing several key challenges associated with this adaptation. VPR data inherently lacks well-defined classes, and models are typically trained using contrastive learning, which necessitates a data mining step on a centralized database. Additionally, client devices in federated systems can be highly heterogeneous in terms of their processing capabilities. The proposed FedVPR framework not only presents a novel approach for VPR but also introduces a new, challenging, and realistic task for FL research, paving the way to other image retrieval tasks in FL.
VPR(视觉空间识别)旨在通过将其视为检索问题来估计图像的位置。VPR使用一个带有地理标记的图像数据库,并利用深度神经网络从每张图像中提取全局表示,称为描述符。虽然VPR模型的训练数据通常来自地理上分散的来源(带有地理标记的图像),但通常假设训练过程是集中的。这项研究通过Federated Learning(FL)的视角重新审视了VPR任务,解决了与这种适应相关的几个关键挑战。VPR数据固有的类本不明确,通常使用对比学习进行训练,这需要在一个集中式的数据库上进行数据挖掘。此外,分布式系统中的客户端设备在处理能力上可能高度异构。所提出的FedVPR框架不仅为VPR带来了新颖的方法,还为FL研究引入了一个新的、具有挑战性和真实性的任务,为FL领域中的其他图像检索任务铺平道路。
https://arxiv.org/abs/2404.13324
The Composed Image Retrieval (CIR) task aims to retrieve target images using a composed query consisting of a reference image and a modified text. Advanced methods often utilize contrastive learning as the optimization objective, which benefits from adequate positive and negative examples. However, the triplet for CIR incurs high manual annotation costs, resulting in limited positive examples. Furthermore, existing methods commonly use in-batch negative sampling, which reduces the negative number available for the model. To address the problem of lack of positives, we propose a data generation method by leveraging a multi-modal large language model to construct triplets for CIR. To introduce more negatives during fine-tuning, we design a two-stage fine-tuning framework for CIR, whose second stage introduces plenty of static representations of negatives to optimize the representation space rapidly. The above two improvements can be effectively stacked and designed to be plug-and-play, easily applied to existing CIR models without changing their original architectures. Extensive experiments and ablation analysis demonstrate that our method effectively scales positives and negatives and achieves state-of-the-art results on both FashionIQ and CIRR datasets. In addition, our methods also perform well in zero-shot composed image retrieval, providing a new CIR solution for the low-resources scenario.
组合图像检索(CIR)任务旨在使用由参考图像和修改文本组成的组合查询来检索目标图像。高级方法通常将对比学习作为优化目标,因为这种方法具有足够的正负样本。然而,CIR的三元组费用高昂,导致正样本数量有限。此外,现有的方法通常使用批量负采样,这使得模型可用的负样本数量减少。为了应对缺乏正样本的问题,我们提出了一个利用多模态大型语言模型构建CIR三元组的方法。为了在微调期间引入更多负样本,我们设计了一个两阶段微调框架,其中第二阶段引入了大量静态负样本,以迅速优化表示空间。上述两种改进可以有效地堆叠,并可以轻松地应用于现有的CIR模型,而无需改变其原始架构。大量实验和消融分析证明,我们的方法有效地扩展了正样本和负样本,并在FashionIQ和CIRR数据集上取得了最先进的性能。此外,我们的方法在零散组合图像检索中也表现良好,为低资源情况下的CIR提供了一种新的解决方案。
https://arxiv.org/abs/2404.11317
In the face of burgeoning image data, efficiently retrieving similar images poses a formidable challenge. Past research has focused on refining hash functions to distill images into compact indicators of resemblance. Initial attempts used shallow models, evolving to attention mechanism-based architectures from Convolutional Neural Networks (CNNs) to advanced models. Recognizing limitations in gradient-based models for spatial information embedding, we propose an innovative image hashing method, NeuroHash leveraging Hyperdimensional Computing (HDC). HDC symbolically encodes spatial information into high-dimensional vectors, reshaping image representation. Our approach combines pre-trained large vision models with HDC operations, enabling spatially encoded feature representations. Hashing with locality-sensitive hashing (LSH) ensures swift and efficient image retrieval. Notably, our framework allows dynamic hash manipulation for conditional image retrieval. Our work introduces a transformative image hashing framework enabling spatial-aware conditional retrieval. By seamlessly combining DNN-based neural and HDC-based symbolic models, our methodology breaks from traditional training, offering flexible and conditional image retrieval. Performance evaluations signify a paradigm shift in image-hashing methodologies, demonstrating enhanced retrieval accuracy.
面对快速增长的图像数据,高效地检索相似的图像是一个具有挑战性的任务。过去的研究所侧重于优化哈希函数,以将图像压缩成相似性的简洁指标。初始尝试使用浅层模型,从卷积神经网络(CNNs)进化到关注机制为基础的架构,最终达到更先进的模型。然而,对于基于梯度的模型的空间信息嵌入限制,我们提出了创新性的图像哈希方法:NeuroHash,利用高维计算(HDC)。HDC 符号化地编码空间信息为高维向量,重新塑造图像表示。我们的方法将预训练的大视觉模型与 HDC 操作相结合,实现了空间编码特征表示。使用局部感知哈希(LSH)进行哈希确保快速且高效的图像检索。值得注意的是,我们的框架允许动态哈希操作进行条件图像检索。我们的工作引入了一个 transformative 图像哈希框架,实现空间感知条件检索。通过将基于深度神经网络(DNN)的神经模型与基于高维计算(HDC)的符号模型无缝结合,我们的方法摒弃了传统的训练方式,实现了灵活的带有条件图像检索。性能评估表明,图像哈希方法论正处于一种范式性的转变,并证明了更准确的检索精度。
https://arxiv.org/abs/2404.11025