Learning text-to-video retrieval from image captioning

Abstract
Abstract (translated)
URL
PDF

Abstract

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.

Abstract (translated)

我们描述了一个研究文本到视频检索训练的协议，使用未标记的视频，其中我们假设（i）没有访问任何视频的标签，即没有访问地面真实字幕集，但（ii）访问以文本形式表示的标记图像。由于给图像专家模型建立现实场景是合理的，因为给图像贴标签比给视频贴标签更便宜，因此具有可扩展性，与昂贵的视频标注方案相比，更具有可行性。最近，零散式图像专家，如CLIP，已经为视频理解任务建立了新的强基准。在本文中，我们利用这一进步并使用两种模型实例化图像专家：一种是从文本到图像检索模型，为初始骨架提供支持，另一种是图像标题模型，为未标记的视频提供监督信号。我们证明了自动给视频帧贴上图像标题标签可以让文本到视频检索训练。这一过程在不进行手动注释的情况下适应了目标领域，从而在CLIP零散式基准之外表现出色。在训练期间，我们从多个视频帧中采样相应的标题，并对帧表示进行时间池化，根据每个标题对帧进行评分。我们进行了广泛的实验，以提供有关此简单框架的有效性的见解，并通过在ActivityNet、MSR-VTT和MSVD等三个标准数据集上实现文本到视频检索的CLIP零散式基准的超越来证实其有效性。

URL

https://arxiv.org/abs/2404.17498

PDF

https://arxiv.org/pdf/2404.17498.pdf

Learning text-to-video retrieval from image captioning

Abstract

Abstract (translated)

URL

PDF Copy

PDF