Paper Reading AI Learner

Learning text-to-video retrieval from image captioning

2024-04-26 15:56:08
Lucas Ventura, Cordelia Schmid, Gül Varol

Abstract

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.

Abstract (translated)

我们描述了一个研究文本到视频检索训练的协议,使用未标记的视频,其中我们假设(i)没有访问任何视频的标签,即没有访问地面真实字幕集,但(ii)访问以文本形式表示的标记图像。由于给图像专家模型建立现实场景是合理的,因为给图像贴标签比给视频贴标签更便宜,因此具有可扩展性,与昂贵的视频标注方案相比,更具有可行性。最近,零散式图像专家,如CLIP,已经为视频理解任务建立了新的强基准。在本文中,我们利用这一进步并使用两种模型实例化图像专家:一种是从文本到图像检索模型,为初始骨架提供支持,另一种是图像标题模型,为未标记的视频提供监督信号。我们证明了自动给视频帧贴上图像标题标签可以让文本到视频检索训练。这一过程在不进行手动注释的情况下适应了目标领域,从而在CLIP零散式基准之外表现出色。在训练期间,我们从多个视频帧中采样相应的标题,并对帧表示进行时间池化,根据每个标题对帧进行评分。我们进行了广泛的实验,以提供有关此简单框架的有效性的见解,并通过在ActivityNet、MSR-VTT和MSVD等三个标准数据集上实现文本到视频检索的CLIP零散式基准的超越来证实其有效性。

URL

https://arxiv.org/abs/2404.17498

PDF

https://arxiv.org/pdf/2404.17498.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot