Image search stands as a pivotal task in multimedia and computer vision, finding applications across diverse domains, ranging from internet search to medical diagnostics. Conventional image search systems operate by accepting textual or visual queries, retrieving the top-relevant candidate results from the database. However, prevalent methods often rely on single-turn procedures, introducing potential inaccuracies and limited recall. These methods also face the challenges, such as vocabulary mismatch and the semantic gap, constraining their overall effectiveness. To address these issues, we propose an interactive image retrieval system capable of refining queries based on user relevance feedback in a multi-turn setting. This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries, resulting in more informative queries with each iteration. Moreover, we introduce a large language model (LLM) based denoiser to refine text-based query expansions, mitigating inaccuracies in image descriptions generated by captioning models. To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task, offering multiple relevant ground truth images for each query. Through comprehensive experiments, we validate the effectiveness of our proposed system against baseline methods, achieving state-of-the-art performance with a notable 10\% improvement in terms of recall. Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
图像搜索在多媒体和计算机视觉领域具有关键作用,应用于各种领域,从互联网搜索到医疗诊断。传统的图像搜索系统通过接受文本或图像查询,从数据库中检索最相关的候选结果来操作。然而,普遍方法往往依赖于单轮过程,这可能导致不准确的结果和有限的召回率。这些方法还面临着词汇不匹配和语义鸿沟等挑战,限制了其整体效果。为了应对这些问题,我们提出了一个多轮图像检索系统,该系统可以根据用户相关反馈来优化查询。该系统采用基于视觉语言模型的图像摘要器来提高文本查询的质量,从而每次迭代产生更有信息性的查询。此外,我们还引入了一个基于大语言模型的去噪器来优化基于文本的查询扩展,减轻 captioning 模型生成的图像描述中的不准确。为了评估我们的系统,我们通过将 MSR-VTT 视频检索数据集改编为图像检索任务,为每个查询提供多个相关 ground truth 图像。通过全面的实验,我们验证了我们提出的系统在基线方法上的有效性,在召回率方面取得了显著的10%提升。我们的贡献包括开发了一个创新的交互式图像检索系统、引入了基于 LLM 的去噪器、创建了精心设计的评估数据集,以及进行了充分的实验验证。
https://arxiv.org/abs/2404.18746
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.
我们描述了一个研究文本到视频检索训练的协议,使用未标记的视频,其中我们假设(i)没有访问任何视频的标签,即没有访问地面真实字幕集,但(ii)访问以文本形式表示的标记图像。由于给图像专家模型建立现实场景是合理的,因为给图像贴标签比给视频贴标签更便宜,因此具有可扩展性,与昂贵的视频标注方案相比,更具有可行性。最近,零散式图像专家,如CLIP,已经为视频理解任务建立了新的强基准。在本文中,我们利用这一进步并使用两种模型实例化图像专家:一种是从文本到图像检索模型,为初始骨架提供支持,另一种是图像标题模型,为未标记的视频提供监督信号。我们证明了自动给视频帧贴上图像标题标签可以让文本到视频检索训练。这一过程在不进行手动注释的情况下适应了目标领域,从而在CLIP零散式基准之外表现出色。在训练期间,我们从多个视频帧中采样相应的标题,并对帧表示进行时间池化,根据每个标题对帧进行评分。我们进行了广泛的实验,以提供有关此简单框架的有效性的见解,并通过在ActivityNet、MSR-VTT和MSVD等三个标准数据集上实现文本到视频检索的CLIP零散式基准的超越来证实其有效性。
https://arxiv.org/abs/2404.17498
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
近年来,短视频应用程序的用户基础经历了空前的增长,导致对视频内容分析的需求显著增加。特别是,文本-视频检索,旨在从庞大的视频语料库中找到与给定文本描述的匹配视频,是至关重要的功能,其挑战在于弥合模态差距。然而,大多数现有方法仅仅将文本视为离散的标记,而忽略了它们的语法结构。此外,视频中丰富的空间和时间线索往往没有被充分利用,因为缺乏与文本的交互。为了应对这些问题,我们认为将文本作为指导,集中关注视频中的相关时态帧和空间区域,从两个角度弥合模态差距是有益的。在本文中,我们提出了一种新颖的语法层次结构增强文本-视频检索方法(SHE-Net),它利用了文本固有的语义和语法层次结构,从两个角度弥合模态差距。首先,为了促进更细粒度的视觉内容整合,我们采用文本语法层次结构,该结构揭示了文本描述的语法结构,指导视觉表示。其次,为了进一步增强多模态交互和对齐,我们还利用语法层次结构指导相似度计算。我们对MSR-VTT、MSVD、DiDeMo和ActivityNet等四个公共文本-视频检索数据集进行了实验评估。实验结果和消融实验证实了我们提出的方法的优势。
https://arxiv.org/abs/2404.14066
Text-video retrieval aims to find the most relevant cross-modal samples for a given query. Recent methods focus on modeling the whole spatial-temporal relations. However, since video clips contain more diverse content than captions, the model aligning these asymmetric video-text pairs has a high risk of retrieving many false positive results. In this paper, we propose Probabilistic Token Aggregation (\textit{ProTA}) to handle cross-modal interaction with content asymmetry. Specifically, we propose dual partial-related aggregation to disentangle and re-aggregate token representations in both low-dimension and high-dimension spaces. We propose token-based probabilistic alignment to generate token-level probabilistic representation and maintain the feature representation diversity. In addition, an adaptive contrastive loss is proposed to learn compact cross-modal distribution space. Based on extensive experiments, \textit{ProTA} achieves significant improvements on MSR-VTT (50.9%), LSMDC (25.8%), and DiDeMo (47.2%).
文本-视频检索的目的是找到与给定查询最相关的跨模态样本。最近的方法集中于建模整个空间-时间关系。然而,由于视频片段包含比字幕更丰富的内容,因此模型对 these 不对称视频-文本对进行对齐有很高的风险,可能导致许多假阳性结果。在本文中,我们提出概率词聚合(ProTA)来处理跨模态交互中的内容差异。具体来说,我们提出了一种 dual partial-related aggregation 来解离和重新聚合低维度和高维度空间中的标记词表示。我们提出基于标记词的概率对齐来生成标记级概率表示,并保持特征表示多样性。此外,还提出了一种自适应对比损失来学习紧凑的跨模态分布空间。通过广泛的实验,ProTA在 MSR-VTT(50.9%)、LSMDC(25.8%)和 DiDeMo(47.2%)上取得了显著的改进。
https://arxiv.org/abs/2404.12216
The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video, relying on consistent embedding representations to compute similarity. However, the text content in existing datasets is generally short and concise, making it hard to fully describe the redundant semantics of a video. Correspondingly, a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study, we propose a new stochastic text modeling method T-MASS, i.e., text is modeled as a stochastic embedding, to enrich text embedding with a flexible and resilient semantic range, yielding a text mass. To be specific, we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus, we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones, but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.
随着视频片段的日益普及,对文本-视频检索的兴趣不断增加。最近,研究的重点在于建立文本和视频共同的嵌入空间,利用一致的嵌入表示计算相似度。然而,现有数据集中的文本内容通常较短且简洁,使得难以完全描述视频的冗余语义。相应地,单个文本嵌入可能不足以捕捉视频嵌入并增强检索。在这项研究中,我们提出了一种新的随机文本建模方法T-MASS,即文本被视为随机嵌入,以丰富文本嵌入的灵活性和韧性,产生文本质量。具体来说,我们引入了一个相似度感知半径模块,以便在给定的文本-视频对中调整文本质量的规模。此外,我们还设计和开发了一种支持文本正则化,以在训练过程中进一步控制文本质量。推理过程也专门设计以充分利用文本质量进行准确检索。实验证据表明,T-MASS不仅有效地吸引了相关的文本-视频对,还将无关的 ones远离,而且还可以精确地确定相关对之间的文本嵌入。我们的实验结果表明,与基线相比(相对精度@1从3%到6.3%),T-MASS在T@1方面有显著的改进。此外,T-MASS在包括MSRVTT、LSMDC、DiDeMo、VATEX和Charades在内的五个基准数据集上实现了最先进的性能。
https://arxiv.org/abs/2403.17998
Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score. Our code, models, detailed language descriptions for WebViD-CoVR dataset are available at \url{this https URL}
组成的视频检索(CoVR)是计算机视觉领域的一个具有挑战性的问题,最近在大型数据库中强调了将修改文本与视觉查询相结合以实现更复杂视频搜索的重要性。现有工作主要依赖于视觉查询与修改文本的组合来区分相关的视频。然而,这种策略很难在检索到的目标视频中完全保留查询特定的上下文信息,仅使用视觉嵌入来表示目标视频。我们提出了一个新颖的CoVR框架,它利用详细的语言描述来明确编码查询特定的上下文信息,并仅使用视觉和文本嵌入来学习更准确的匹配目标视频。我们提出的框架可以灵活应用于组合视频(CoVR)和图像(CoIR)检索任务。在三个数据集上的实验表明,我们的方法在CovR和零散CoIR任务上均取得了最先进的性能,召回@K=1得分甚至达到了约7%的提高。我们的代码、模型和详细语言描述可在 \url{this <https://this URL>} 这个网站上获得。
https://arxiv.org/abs/2403.16997
Video corpus moment retrieval~(VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a natural language text as query. The relevance between the video and query is partial, mainly evident in two aspects: (1) Scope: The untrimmed video contains information-rich frames, and not all are relevant to the query. Strong correlation is typically observed only within the relevant moment, emphasizing the importance of capturing key content. (2) Modality: The relevance of query to different modalities varies; action descriptions align more with the visual elements, while character conversations are more related to textual information. Recognizing and addressing these modality-specific nuances is crucial for effective retrieval in VCMR. However, existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating distinct query representations tailored for different modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content, followed by fusing multi-modal information for moment localization. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR.
视频语料库 moment 检索(VCMR)是一项新的视频检索任务,旨在从大量未剪辑的视频中检索相关时刻,使用自然语言文本作为查询。视频与查询之间的相关性是局部的,主要表现在两个方面:(1)范围:未剪辑的视频包含信息丰富的帧,但并非所有都与查询相关。通常只有相关时刻之间的强相关性才会出现,强调捕捉关键内容的重要性;(2)模式:查询对不同模式的相关性有所不同;动作描述更接近视觉元素,而角色对话则更涉及文本信息。识别和解决这些模式特定微小差异对于 VCMR 的有效检索至关重要。然而,现有的方法通常将所有视频内容平等对待,导致 VCMR 的性能下降。我们认为,准确捕捉查询与视频之间的部分相关性对 VCMR 任务至关重要。为此,我们提出了一个 Partial Relevance Enhanced Model(PREM)来提高 VCMR 的性能。VCMR 包括两个子任务:视频检索和时刻定位。为了与它们的独特目标保持一致,我们为实现针对不同模式的专用部分相关性增强策略。对于视频检索,我们引入了一种多模态协同视频检索器,通过模式特定的池化生成针对不同模态的单独查询表示,确保更有效的匹配。对于时刻定位,我们提出了一个关注- then-融合时刻定位器,利用模式特定的门捕捉关键内容,然后将多模态信息融合为时刻定位。在 TVR 和 DiDeMo 数据集上的实验结果表明,与基线相比,所提出的模型表现出色,实现了 VCMR 的新最好状态。
https://arxiv.org/abs/2402.13576
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos based on maximum frame similarity.However, this approach overlooks the semantic structure embedded within the information between frames, namely, the event, a crucial element for human comprehension of videos. Motivated by this, we propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval. The model extracts event representations through event reasoning and hierarchical event encoding. The event reasoning module groups consecutive and visually similar frame representations into events, while the hierarchical event encoding encodes information at both the frame and event levels. We also introduce anchor multi-head self-attenion to encourage Transformer to capture the relevance of adjacent content in the video. The training of EventFormer is conducted by two-branch contrastive learning and dual optimization for two sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo benchmarks show the effectiveness and efficiency of EventFormer in VCMR, achieving new state-of-the-art results. Additionally, the effectiveness of EventFormer is also validated on partially relevant video retrieval task.
视频数据集 Moment 检索(VCMR)是一个关注于在广泛的未剪辑视频数据集中查找特定时刻的实用视频检索任务,使用自然语言查询。现有的 VCMR 方法通常依赖于帧感知视频检索,通过计算查询和视频帧之间的相似度来排名视频。然而,这种方法忽视了帧间信息内嵌入的语义结构,即事件,这是人类理解视频的关键元素。为了实现这一目标,我们提出了 EventFormer 模型,该模型明确利用视频中的事件作为视频检索的基本单位。通过事件推理和层次结构事件编码来提取事件表示。事件推理模块将连续且视觉上相似的帧表示分组为事件,而层次结构事件编码在帧和事件级别上编码信息。我们还引入了锚多头自注意力,以鼓励 Transformer 捕捉视频中的相关内容。通过两个分支的对比学习和双优化来训练 EventFormer。在 TVR、ANetCaps 和 DiDeMo 基准测试上进行的实验表明,EventFormer 在 VCMR 取得了有效性和效率,实现了最新的最先进水平。此外,EventFormer 的有效性还在部分相关视频检索任务上得到了验证。
https://arxiv.org/abs/2402.13566
Though pre-training vision-language models have demonstrated significant benefits in boosting video-text retrieval performance from large-scale web videos, fine-tuning still plays a critical role with manually annotated clips with start and end times, which requires considerable human effort. To address this issue, we explore an alternative cheaper source of annotations, single timestamps, for video-text retrieval. We initialise clips from timestamps in a heuristic way to warm up a retrieval model. Then a video clip editing method is proposed to refine the initial rough boundaries to improve retrieval performance. A student-teacher network is introduced for video clip editing. The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips. The teacher weights are updated from the student's after the student's performance increases. Our method is model agnostic and applicable to any retrieval models. We conduct experiments based on three state-of-the-art retrieval models, COOT, VideoCLIP and CLIP4Clip. Experiments conducted on three video retrieval datasets, YouCook2, DiDeMo and ActivityNet-Captions show that our edited clips consistently improve retrieval performance over initial clips across all the three retrieval models.
尽管预训练的视觉语言模型已经在大型网络视频中的视频文本检索表现出了显著的提高,但通过手动注释带有开始和结束时间的片段仍然在视频文本检索中扮演着关键角色,这需要大量的人力劳动。为了解决这个问题,我们探索了一个替代的、更便宜的标注来源,即单时刻度,用于视频文本检索。我们以一种启发式的方式从时刻开始对片段进行初始化,以热身检索模型。然后我们提出了一种视频片段编辑方法,以优化初始粗略边界的精细度,从而提高检索性能。我们还引入了一个学生-教师网络来进行视频片段编辑。教师模型用于编辑训练集中的片段,而学生模型在编辑后的片段上进行训练。教师权重在学生成绩增加后从学生那里更新。我们的方法对模型一无所知,而且适用于任何检索模型。我们对三种最先进的检索模型:COOT、VideoCLIP和CLIP4Clip进行了实验。对三个视频检索数据集:YouCook2、DiDeMo和ActivityNet-Captions的实验表明,我们编辑的片段在所有三个检索模型中都能持续提高检索性能。
https://arxiv.org/abs/2402.02335
Existing video-language studies mainly focus on learning short video clips, leaving long-term temporal dependencies rarely explored due to over-high computational cost of modeling long videos. To address this issue, one feasible solution is learning the correspondence between video clips and captions, which however inevitably encounters the multi-granularity noisy correspondence (MNC) problem. To be specific, MNC refers to the clip-caption misalignment (coarse-grained) and frame-word misalignment (fine-grained), hindering temporal learning and video understanding. In this paper, we propose NOise Robust Temporal Optimal traNsport (Norton) that addresses MNC in a unified optimal transport (OT) framework. In brief, Norton employs video-paragraph and clip-caption contrastive losses to capture long-term dependencies based on OT. To address coarse-grained misalignment in video-paragraph contrast, Norton filters out the irrelevant clips and captions through an alignable prompt bucket and realigns asynchronous clip-caption pairs based on transport distance. To address the fine-grained misalignment, Norton incorporates a soft-maximum operator to identify crucial words and key frames. Additionally, Norton exploits the potential faulty negative samples in clip-caption contrast by rectifying the alignment target with OT assignment to ensure precise temporal modeling. Extensive experiments on video retrieval, videoQA, and action segmentation verify the effectiveness of our method. Code is available at this https URL.
现有视频语言研究主要集中在学习短视频片段,而很少涉及对长视频的长期依赖关系的探讨,因为模型的计算成本过高,导致对长视频的建模存在过高的问题。为解决这一问题,一个可行的解决方案是学习视频片段和字幕之间的对应关系,然而这不可避免地遇到了多粒度噪声匹配(MNC)问题。具体来说,MNC指的是视频片段和字幕对齐误差(粗粒度)和帧词对齐误差(细粒度),这阻碍了时序学习和视频理解。在本文中,我们提出了一个统一最优传输(OT)框架下的Norton噪声鲁棒时序优化(Norton)来解决MNC问题。总之,Norton通过视频段落和视频片段对比损失来捕捉基于OT的长期依赖关系。为解决视频段落对比中的粗粒度对齐问题,Norton通过可调的提示桶对无关的片段和字幕进行过滤,并根据传输距离将异步视频片段对齐。为解决细粒度对齐问题,Norton使用软最大操作来确定关键单词和关键帧。此外,Norton还利用视频片段对比中的潜在有缺陷负样本,通过OT分配对齐目标来确保精确的时间建模。对视频检索、视频QA和动作分割等领域的实验证实了我们的方法的有效性。代码可以从该链接获取:https://this URL。
https://arxiv.org/abs/2401.16702
There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.
长期以来,人们一直在寻求一个统一的多媒体理解模型,以实现各种多模态理解任务,这模仿了人类听、看和阅读的过程。人类倾向于使用两种独立系统来表示知识:一种表示口头(文本)信息,另一种表示非口头(视觉和听觉)信息。这两种系统可以独立运行,也可以相互交互。为了理解人类认知,本文我们引入了CoAVT——一种新颖的基于认知的跨模态预训练模型,以连接这三个模态。它包含一个联合音频-视觉编码器,学会了在非口头信息中同时编码音频-视觉同步信息,以及一个文本编码器来处理口头信息。为了弥合模态之间的差距,CoAVT采用了一个查询编码器,其中包含一系列可学习的查询嵌入,并提取相应文本中的最有信息量的音频-视觉特征。此外,为了利用音频和视觉与语言之间的对应关系,我们还基于发现的音频-文本和视觉-文本双模态对齐来建立音频-文本和视觉-文本的生物模态对齐,以增强多模态表示学习。最后,我们与三个多模态目标共同优化CoAVT模型:对比损失、匹配损失和语言建模损失。大量实验证明,CoAVT可以学习强大的多模态相关性,并在音频词视频检索任务OnAudioCaps上实现零散和微调设置,以及AudioSet和VGGSound上的音频-视觉事件分类和音频-视觉检索任务。
https://arxiv.org/abs/2401.12264
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at this https URL
文本-视频检索是一个关键的多模态任务,用于找到与文本查询最相关的视频。尽管像CLIP这样的预训练模型在這個領域展現了令人驚嘆的潛力,但由于模型的成本因模型大小增加而持續增加,這仍然是一個問題。為了解決這個挑戰,已經出現了提示調試作為一種替代方法。然而,當將預訓練的圖像-文本模型適應下游視頻-文本任務時,現有作品仍然面臨兩個問題:(1)視覺編碼器只能編碼帧級特征,而無法提取全局層次的視頻信息。(2)將視覺和文本編碼器與分開的提示相结合,無法解決視覺-文本模態差距。因此,我們提出了DGL,一個跨模態的動態提示調試方法,具有全局-local視頻注意力。與以前的提示調試方法不同,我們使用共享的潜在空間生成局部級別的文本和圖像提示,鼓勵模態交互。此外,我們提出了一种將視頻建模為全局-local注意力的全局視頻信息捕捉方法。大量的實驗發現,僅僅調整0.67%的參數,我們的跨模態提示調試策略DGL在MSR-VTT、VATEX、LSMDC和ActivityNet數據集上的表現已超越或與完全調試方法相当。該URL為:
https://arxiv.org/abs/2401.10588
The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video-language model is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.
近年来,视觉语言模型的进步很大程度上归功于图像-文本数据的丰富。我们试图复制这一成功,但目前可用的视频-文本数据仅仅足够少,无法满足需求。因此,我们不得不从具有强图像语言基线的视频语言模型进行微调,并使用合成指令数据进行微调。这样,我们得到的视频语言模型被用于自动标注数百万个视频以生成高质量字幕。我们证明了调整后的视频语言模型在广泛的视频语言基准测试中表现良好。例如,它比open-ended NExT-QA的最佳先前结果提高了2.8%。此外,我们的模型为之前未见过的视频生成了详细的描述,这比现有方法提供了更好的文本监督。实验证明,在为这些自动生成的字幕进行视频语言双重编码的对比训练后,视频语言双编码器模型比最强的基线模型提高了3.8%。我们的最佳模型在MSR-VTT零散文本到视频检索上比最强的基线模型提高了6%。
https://arxiv.org/abs/2401.06129
Text-video retrieval is a challenging task that aims to identify relevant videos given textual queries. Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content. Previous works primarily focus on aligning the query and the video by finely aggregating word-frame matching signals. Inspired by the human cognitive process of modularly judging the relevance between text and video, the judgment needs high-order matching signal due to the consecutive and complex nature of video contents. In this paper, we propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit, and the video chunks are segmented into distinct clips from videos. We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video and introduce a multi-modal hypergraph for n-ary correlation modeling. By representing textual units and video frames as nodes and using hyperedges to depict their relationships, a multi-modal hypergraph is constructed. In this way, the query and the video can be aligned in a high-order semantic space. In addition, to enhance the model's generalization ability, the extracted features are fed into a variational inference component for computation, obtaining the variational representation under the Gaussian distribution. The incorporation of hypergraphs and variational inference allows our model to capture complex, n-ary interactions among textual and visual contents. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the text-video retrieval task.
文本-视频检索是一个具有挑战性的任务,旨在根据文本查询识别相关的视频。与传统的文本检索相比,文本-视频检索的主要障碍是查询的文本性质和视频内容的视觉丰富性之间的语义差距。以前的工作主要集中在通过精细聚合词帧匹配信号将查询和视频对齐。受到人类在判断文本和视频之间的相关性时采用模块化判断过程的启发,由于视频内容的连续和复杂性,判断需要高阶匹配信号。在本文中,我们提出了一种基于块级的文本-视频匹配,其中查询块被提取以描述特定的检索单位,视频块被分割成来自视频的片段。我们将块级匹配建模为词查询和视频帧之间的n阶相关性建模,并引入了n阶相关性建模的多模态超图。通过将文本单元和视频帧表示为节点,并使用边来描绘它们之间的关系,得到了一个n阶相关性建模的多模态超图。这样,查询和视频可以在高阶语义空间对齐。此外,为了提高模型的泛化能力,提取的特征被输入计算机制以获得高斯分布下的变分表示。超图和变分推理的引入使得我们的模型能够捕捉文本和视觉内容之间的复杂n阶交互。实验结果表明,我们提出的方法在文本-视频检索任务上实现了最先进的性能。
https://arxiv.org/abs/2401.03177
We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way, the goal is to find a related ''detour video'' that satisfies the requested alteration. To address this challenge, we propose VidDetours, a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore, we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos, where a user can detour from their current recipe to find steps with alternate ingredients, tools, and techniques. Validating on a ground truth annotated dataset of 16K samples, we show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.
我们提出了一个名为视频路径问题的问题,用于指导教学视频的导航。给定一个源视频和一个自然语言查询,请求以某种方式改变如何视频的执行路径,目标是找到一个相关的“路径视频”,满足所请求的改变。为了解决这个挑战,我们提出了VidDetours,一种新颖的视频-文本方法,它学会了从大量如何-视频的存储库中检索目标时间片段。此外,我们还设计了一个基于语言的管道,利用了如何-视频的旁白文本来创建弱监督训练数据。我们将我们的想法应用于如何烹饪视频的领域,用户可以从当前食谱中找到替代食材、工具和技术。在16K个带标签的地面真实现习语数据集上进行验证,我们证明了我们的模型在视频检索和问题回答方面的显著改进,召回率超过现有方法的35%。
https://arxiv.org/abs/2401.01823
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
近年来,基于CLIP的文本转视频检索方法经历了快速的发展。演变的主要方向是利用更广泛的视觉和文本提示来实现对齐。具体来说,那些具有出色性能的方法通常为句子(单词)-视频(帧)交互设计重的融合模块,无论计算复杂度如何。然而,这些方法在特征利用和检索效率方面并不是最优的。为了解决这个问题,我们采用了多粒度视觉特征学习,确保在训练阶段模型能够全面捕捉从抽象到详细程度的视觉内容特征。为了更好地利用多粒度特征,我们在检索阶段设计了一个两阶段检索架构。这个解决方案巧妙地平衡了检索内容的粗细粒度。此外,它还实现了检索有效性和效率的和谐平衡。具体来说,在训练阶段,我们设计了一个无参数文本门控(TIB)用于精细视频表示学习,并内嵌入一个Pearson约束以优化跨模态表示学习。在检索阶段,我们使用粗粒度的视频表示来快速召回前k个候选项,然后通过精细的视觉表示对其进行排序。在四个基准测试上进行的大量实验证明,这种方法具有高效性和有效性。值得注意的是,与最先进的方法相比,我们的方法在性能上具有相似的竞争力,而速度却快了几乎50倍。
https://arxiv.org/abs/2401.00701
Self-supervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image domain (e.g., contrastive learning) that do not explicitly promote the learning of temporal features. We identify two factors that limit existing temporal self-supervision: 1) tasks are too simple, resulting in saturated training performance, and 2) we uncover shortcuts based on local appearance statistics that hinder the learning of high-level features. To address these issues, we propose 1) a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks and 2) an effective augmentation strategy to mitigate shortcuts. Our model extends a representation of single video frames, pre-trained through contrastive learning, with a transformer that we train through temporal self-supervision. We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision. The generalization capability of our self-supervised video method is evidenced by its state-of-the-art performance in a wide range of high-level semantic tasks, including video retrieval, action classification, and video attribute recognition (such as object and scene identification), as well as low-level temporal correspondence tasks like video object segmentation and pose tracking. Additionally, we show that the video representations learned through our method exhibit increased robustness to the input perturbations.
自监督方法在视频理解任务中已经取得了令人印象深刻的成果。然而,与利用时域自监督的早期工作不同,目前最先进的视频方法主要依赖于来自图像领域的任务(例如对比学习),而没有明确促进学习时域特征。我们发现现有时域自监督的两个限制因素:1)任务过于简单,导致训练效果饱和;2)我们基于局部外观统计学发现了短路,从而阻碍了高层次特征的学习。为了应对这些问题,我们提出了以下两个方案:1)将时域自监督重新定义为帧级(而不是片段级)识别任务,2)采用有效的增强策略来缓解短路。我们的模型通过通过时域自监督预训练的单个视频帧,加上我们通过时域自监督训练的Transformer,来扩展预训练的表示。我们通过实验证明了, our more challenging frame-level task formulations and the removal of shortcuts significantly improve the quality of features learned through temporal self-supervision. 我们模型的自监督视频方法的泛化能力体现在其在广泛的先进语义任务中的最佳性能,包括视频检索、动作分类和视频属性识别(如物体和场景识别),以及低级时序对应任务,如视频物体分割和姿态跟踪。此外,我们还证明了通过我们的方法学习到的视频表示具有对输入扰动的增强鲁棒性。
https://arxiv.org/abs/2312.13008
A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.
一段短视频可能包含多个事件的进展和有趣的故事线。人类需要捕捉每个镜头中的事件,并将它们联系在一起,以理解其背后的故事。在这项工作中,我们提出了一个新的多镜头视频理解基准Shot2Story20K,带有详细的镜头级别字幕和全面的视频摘要。为了促进更好地语义理解视频,我们提供了视觉信号和人类叙述的 caption。我们设计了几种不同的任务,包括单镜头视频和叙述性 captioning,多镜头视频摘要和带有描述的图像检索。初步实验表明,生成一个长且全面的视频摘要存在一些挑战。然而,生成的不完美的摘要已经可以显著提高现有视频理解任务的性能,如视频问答,探索了一个未被探索的视频理解设置,带有详细的摘要。
https://arxiv.org/abs/2312.10300
Visual retrieval aims to search for the most relevant visual items, e.g., images and videos, from a candidate gallery with a given query item. Accuracy and efficiency are two competing objectives in retrieval tasks. Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval. Furthermore, we discover that the similarities obtained by different retrieval models are diversified and incommensurable, which makes it challenging to jointly distill knowledge from multiple models. Therefore, we propose to whiten the output of teacher models before fusion, which enables effective multi-teacher distillation for retrieval models. Whiten-MTD is conceptually simple and practically effective. Extensive experiments on two landmark image retrieval datasets and one video retrieval dataset demonstrate the effectiveness of our proposed method, and its good balance of retrieval performance and efficiency. Our source code is released at this https URL.
视觉检索旨在寻找给定查询项目的情选画廊中最相关的视觉项目,例如图像和视频。准确性和效率是检索任务中两个相互竞争的目标。为了追求进一步提高准确度,本文提出了一种多教师蒸馏框架Whiten-MTD,它能够将预训练的检索模型的知识传递给轻量级的 student 模型,实现高效的视觉检索。此外,我们还发现不同检索模型的相似性是多样化和不可比较的,这使得从多个模型共同蒸馏知识具有挑战性。因此,我们在融合前对教师模型的输出进行预处理,使得多个教师蒸馏模型能够有效进行。Whiten-MTD 具有直观的简单性和实际有效的特点。在两个里程碑图像检索数据集和一个视频检索数据集上的实验表明,我们所提出的方法的有效性,以及其在检索性能和效率上的良好平衡。我们的源代码已发布在上述链接处。
https://arxiv.org/abs/2312.09716
Text-video retrieval, a prominent sub-field within the broader domain of multimedia content management, has witnessed remarkable growth and innovation over the past decade. However, existing methods assume the video scenes are consistent and the description annotators are unbiased. These limitations fail to align with fluid real-world scenarios, and descriptions can be influenced by annotator biases, diverse writing styles, and varying textual perspectives. To overcome the aforementioned problems, we introduce WAVER, a cross-domain knowledge distillation mechanism designed to tackle the challenge of handling writing-style agnostics. WAVER capitalizes on the open-vocabulary properties inherent in pre-trained vision-language models and employs an implicit knowledge distillation approach to transfer text-based knowledge from a teacher model to a vision-based student. Empirical studies conducted across four standard benchmark datasets, encompassing various settings, provide compelling evidence that \WAVER can achieve state-of-the-art performance in text-video retrieval tasks while handling writing-style variations.
文本视频检索是多媒体内容管理的一个重要子领域,在过去的十年里见证了显著的增长和创新。然而,现有的方法假设视频场景是一致的,描述注释者没有偏见。这些限制没有与流动的现实场景对齐,并且描述可能受到注释者偏见、多样写作风格和不同文本观点的影响。为了克服上述问题,我们引入了WAVER,一个跨领域知识蒸馏机制,旨在解决处理写作风格不确定的挑战。WAVER利用预训练视觉语言模型固有的开放词汇性质,并采用一种隐式知识蒸馏方法,将基于文本的知识从教师模型传递给基于视觉的学生模型。通过在四个标准基准数据集上进行实证研究,涵盖各种设置,我们提供了令人信服的证据,表明WAVER可以在处理写作风格变化的情况下,在文本视频检索任务中实现最先进的性能。
https://arxiv.org/abs/2312.09507