Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at this https URL
目前用于长视频理解的数据集往往无法提供真正的长视频理解挑战,因为许多来源于这些数据集的任务只要分析视频中的一两个随机帧就可以成功解决。为解决这个问题,我们提出了一个名为CinePile的新数据集和基准,专门为真正的长视频理解而设计。本文详细介绍了我们创造问题-答案数据集的方法,利用先进的神经网络与人类交互并基于人类生成的原始数据。我们全面的数据集包括305,000个多选题问题(MCQs),涵盖各种视觉和多模态方面,包括时间理解、理解人-对象交互和推理场景中事件或动作。此外,我们还评估了我们的数据集中的最新视频相关LLM,包括开源和专有版本,在测试集上。研究结果表明,即使是最先进的视频相关LLM在这些任务上也无法与人类 performance相媲美,这突出了视频理解的复杂性和挑战性。该数据集可在此处访问:<https://this URL>
https://arxiv.org/abs/2405.08813
Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as \textit{SqueezeTime}, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves $+1.2\%$ accuracy and $+80\%$ GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at this https URL and this https URL.
目前用于视频理解的架构主要基于3D卷积块或2D卷积块,并添加了用于时间建模的操作。然而,这些方法都将时间轴视为视频序列的单独维度,需要大量的计算和内存预算,因此限制了它们在移动设备上的使用。在本文中,我们提出了一种将视频序列的时间轴压缩到通道维度,并提出了一个轻量级的移动视频理解网络,称为\textit{SqueezeTime},用于移动视频理解。为了增强所提出的网络的时序建模能力,我们设计了一个通道时学习(CTL)模块来捕捉序列的时变动态。这个模块有两个互补的分支,其中一个是用于时间重要性学习,另一个是用于时间位置恢复能力的分支,以增强跨时间物体建模能力。所提出的SqueezeTime在移动视频理解中非常轻便且快速,具有很高的准确性。在各种视频识别和动作检测基准上进行的广泛实验(即Kinetics400、Kinetics600、HMDB51、AVA2.1和THUMOS14)证明了我们的模型的优越性。例如,我们的SqueezeTime在Kinetics400上实现了比 prior 方法 $+1.2\%$ 的准确性和$+80\%$的GPU吞吐量增益。代码公开可用,请访问以下链接:https://this URL 和 https://this URL。
https://arxiv.org/abs/2405.08344
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose to take advantage of existing pre-trained large-scale vision and language models to directly generate captions with test time adaptation. Specifically, we bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP as well as frozen CLIP. Differing from the conventional way to train these tokens with training data, we update these tokens with pseudo-targets of the inference data under several carefully crafted loss functions which enable the tokens to absorb video information catered for GPT-2. This procedure can be done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
尽管在完全监督的视频标题中取得了显著的进展,但零样本方法仍然没有被充分利用。在本文中,我们提出了一种利用现有的大规模视觉和语言预训练模型直接生成适应测试时间调整的标题的方法。具体来说,我们通过三个关键模型:一个通用的视频理解模型XCLIP、一个通用的图像理解模型CLIP和一个文本生成模型GPT-2来桥接视频和文本,因为它们具有源代码可用性。主要挑战是如何让文本生成模型对给定视频的内容有足够的意识,从而生成相应的标题。为解决这个问题,我们提出了一种使用可学习标记作为在冻定的GPT-2和冻定的XCLIP以及冻定的CLIP之间的通信媒介的方法。与传统利用训练数据来训练这些标记的方式不同,我们在多个精心设计的损失函数下使用推理数据的伪目标更新这些标记。这个过程可以在几轮迭代后完成(我们在实验中使用了16轮)并且不需要地面真实数据。在三个广泛使用的数据集MSR-VTT、MSVD和VATEX上的实验结果表明,与现有最先进的方法相比,提高幅度在4%到20%之间。
https://arxiv.org/abs/2405.07046
In this paper, we show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object Segmentation (VOS) under complex circumstances. Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object Segmentation. Prior works on VOS mostly rely on direct comparison of semantic and contextual features to perform dense matching between current and past frames, passing over actual motion structure. On the other hand, Optical Flow Estimation task aims to approximate the scene motion field, exposing global motion patterns which are typically undiscoverable during all pairs similarity search. We present WarpFormer, an architecture for semi-supervised Video Object Segmentation that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching. Our framework employs a generic pretrained Optical Flow Estimation network whose prediction is used to warp both past frames and instance segmentation masks to the current frame domain. Consequently, warped segmentation masks are refined and fused together aiming to inpaint occluded regions and eliminate artifacts caused by flow field imperfects. Additionally, we employ novel large-scale MOSE 2023 dataset to train model on various complex scenarios. Our method demonstrates strong performance on DAVIS 2016/2017 validation (93.0% and 85.9%), DAVIS 2017 test-dev (80.6%) and YouTube-VOS 2019 validation (83.8%) that is competitive with alternative state-of-the-art methods while using much simpler memory mechanism and instance understanding logic.
在本文中,我们证明了将来自其他领域视频理解的知识与大规模学习相结合可以提高视频对象分割(VOS)在复杂情况下的鲁棒性。具体来说,我们关注将场景全局运动知识集成到大型半监督视频对象分割中,以提高大规模半监督视频对象分割的鲁棒性。先前的VOS研究主要依赖于对当前和过去帧进行语义和上下文特征的直接比较来进行密集匹配,而忽略了实际运动结构。另一方面,光学流估计任务旨在近似场景运动场,揭示在所有相似度搜索过程中通常不可见的全局运动模式。我们提出了WarpFormer,一种半监督视频对象分割架构,利用现有的运动理解知识进行平滑传播和更精确的匹配。我们的框架采用了一个通用的预训练的光学流估计网络,其预测用于将过去帧和实例分割掩码 warp 到当前帧领域。因此,warped分割掩码经过精细化和融合,旨在修复遮挡区域并消除由流场不完美引起的伪影。此外,我们还使用新的大型MOSE 2023数据集来训练模型,以训练其在各种复杂场景中的性能。我们的方法在DAVIS 2016/2017验证集(93.0%和85.9%)和DAVIS 2017测试集(80.6%)以及YouTube-VOS 2019验证集(83.8%)上的表现与最先进的替代方法竞争,同时使用更简单的内存机制和实例理解逻辑。
https://arxiv.org/abs/2405.07031
Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.
动作识别是构建交互式虚拟现实的关键技术。随着深度学习的快速发展,动作识别方法也取得了很大的进步。研究人员设计并实现了多个观点的动作识别骨骼,这导致了方法的多样性和遇到的新挑战。本文回顾了几种基于深度神经网络的动作识别方法。我们在第一部分介绍了两个通道网络及其变体,尤其是在本文中,使用RGB视频帧和光学流模式作为输入;第二部分介绍了3D卷积网络,它们致力于利用RGB模式直接提取不同运动信息;第三部分介绍了基于Transformer的方法,将自然语言处理模型的思想引入计算机视觉和视频理解。我们在本文的回顾中提供了客观的看法,并希望为未来的研究提供参考。
https://arxiv.org/abs/2405.05584
State Space Model (SSM) is a mathematical model used to describe and analyze the behavior of dynamic systems. This model has witnessed numerous applications in several fields, including control theory, signal processing, economics and machine learning. In the field of deep learning, state space models are used to process sequence data, such as time series analysis, natural language processing (NLP) and video understanding. By mapping sequence data to state space, long-term dependencies in the data can be better captured. In particular, modern SSMs have shown strong representational capabilities in NLP, especially in long sequence modeling, while maintaining linear time complexity. Notably, based on the latest state-space models, Mamba merges time-varying parameters into SSMs and formulates a hardware-aware algorithm for efficient training and inference. Given its impressive efficiency and strong long-range dependency modeling capability, Mamba is expected to become a new AI architecture that may outperform Transformer. Recently, a number of works have attempted to study the potential of Mamba in various fields, such as general vision, multi-modal, medical image analysis and remote sensing image analysis, by extending Mamba from natural language domain to visual domain. To fully understand Mamba in the visual domain, we conduct a comprehensive survey and present a taxonomy study. This survey focuses on Mamba's application to a variety of visual tasks and data types, and discusses its predecessors, recent advances and far-reaching impact on a wide range of domains. Since Mamba is now on an upward trend, please actively notice us if you have new findings, and new progress on Mamba will be included in this survey in a timely manner and updated on the Mamba project at this https URL.
状态空间模型(SSM)是一种用于描述和分析动态系统行为的数学模型。在控制理论、信号处理、经济学和机器学习等领域,该模型得到了广泛应用。在深度学习领域,状态空间模型用于处理序列数据,如时间序列分析、自然语言处理(NLP)和视频理解。通过将序列数据映射到状态空间,可以更好地捕捉数据中的长期依赖关系。特别地,现代SSM在NLP领域显示出强大的表示能力,尤其是在长序列建模方面,同时保持线性时间复杂度。值得注意的是,根据最新的状态空间模型,Mamba将时间可变的参数合并到SSM中,并提出了一个硬件感知的学习算法,用于高效训练和推理。鉴于其令人印象深刻的效率和强大的长期依赖关系建模能力,Mamba预计将成为一个新的AI架构,可能超过Transformer。最近,许多工作试图研究Mamba在各种领域的潜力,例如一般视觉、多模态、医学图像分析和遥感图像分析,通过将Mamba从自然语言领域扩展到视觉领域。要完全了解Mamba在视觉领域,我们进行了一次全面的调查,并进行了分类研究。本次调查重点关注Mamba在各种视觉任务和数据类型上的应用,并讨论了其前辈、最近的发展以及广泛领域的深远影响。由于Mamba正处于上升趋势,如果您有任何新发现,请及时通知我们,并及时更新Mamba项目在https://www.mamaproject.org上的信息。
https://arxiv.org/abs/2405.04404
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. This survey analyzes over 200 video foundational models, offering a comprehensive overview of benchmarks and evaluation metrics across 14 distinct video tasks categorized into 3 main categories. Additionally, we offer an in-depth performance analysis of these models for the 6 most common video tasks. We categorize ViFMs into three categories: 1) Image-based ViFMs, which adapt existing image models for video tasks, 2) Video-Based ViFMs, which utilize video-specific encoding methods, and 3) Universal Foundational Models (UFMs), which combine multiple modalities (image, video, audio, and text etc.) within a single framework. By comparing the performance of various ViFMs on different tasks, this survey offers valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis surprisingly reveals that image-based foundation models consistently outperform video-based models on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance on video tasks. We share the comprehensive list of ViFMs studied in this work at: \url{this https URL}
视频基金会模型(ViFMs)旨在学习各种视频理解任务的通用表示。通过利用大规模数据集和强大的模型,ViFMs通过捕获视频数据中的稳健和通用特征来实现这一目标。本调查分析了超过200个视频基本模型,为14个不同的视频任务提供了全面的基准和评估指标概述。此外,我们深入分析了这些模型在6个最常见视频任务上的性能。我们将ViFMs分为三类:1)基于图像的ViFMs,它们为视频任务适应现有的图像模型;2)基于视频的ViFMs,它们利用了视频特定的编码方法;3)通用基础模型(UFMs),它们将多种模式(图像、视频、音频和文本等)结合在一个框架中。通过比较各种ViFMs在不同任务上的表现,本调查为视频理解未来的发展提供了宝贵的洞见。我们的分析令人惊讶地发现,基于图像的基本模型在大多数视频理解任务上都显著优于基于视频的模型。此外,利用多样模态的UFMs在视频任务上表现出卓越的性能。我们在这里与您分享本工作中研究的所有ViFMs的全面列表:\url{这个链接}
https://arxiv.org/abs/2405.03770
Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical imaging, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, {especially open-source ones,} struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: this https URL.
近年来,在大型语言模型(LLMs)方面取得了显著进展,导致开发了视频大型多模态模型(Video-LMMs),这些模型可以处理广泛的视频理解任务。这些模型在现实世界的应用场景,如机器人学、人工智能助手、医学成像和自动驾驶车辆等方面具有潜在部署价值。在我们日常生活中广泛部署Video-LMMs,突显了在复杂、现实世界语境中确保和评估其稳健性能的重要性。然而,现有的Video-LMM基准主要关注于通用视频理解能力,而忽略了评估其在复杂视频中的推理能力,以及通过用户提示的视角评估模型的稳健性。在本文中,我们提出了CVRR-ES(复杂视频推理和稳健性评估套装),一种新颖的基准,全面评估了11种不同现实世界视频维度中Video-LMM的性能。我们对9个最近的后开源和闭源模型进行了评估,发现大多数视频模型(尤其是开源模型)在处理复杂视频时,表现出了稳健性和推理能力不足。根据我们的分析,我们提出了一种无需训练的自助双步上下文提示(DSCP)技术,以提高现有视频模型的性能。我们的研究结果为构建具有高级稳健性和推理能力的下一代人类中心化人工智能系统提供了宝贵的洞见。我们的数据和代码 publicly available at:this <https:// this URL.
https://arxiv.org/abs/2405.03690
Multimodal information, together with our knowledge, help us to understand the complex and dynamic world. Large language models (LLM) and large multimodal models (LMM), however, still struggle to emulate this capability. In this paper, we present WorldQA, a video understanding dataset designed to push the boundaries of multimodal world models with three appealing properties: (1) Multimodal Inputs: The dataset comprises 1007 question-answer pairs and 303 videos, necessitating the analysis of both auditory and visual data for successful interpretation. (2) World Knowledge: We identify five essential types of world knowledge for question formulation. This approach challenges models to extend their capabilities beyond mere perception. (3) Long-Chain Reasoning: Our dataset introduces an average reasoning step of 4.45, notably surpassing other videoQA datasets. Furthermore, we introduce WorldRetriever, an agent designed to synthesize expert knowledge into a coherent reasoning chain, thereby facilitating accurate responses to WorldQA queries. Extensive evaluations of 13 prominent LLMs and LMMs reveal that WorldRetriever, although being the most effective model, achieved only 70% of humanlevel performance in multiple-choice questions. This finding highlights the necessity for further advancement in the reasoning and comprehension abilities of models. Our experiments also yield several key insights. For instance, while humans tend to perform better with increased frames, current LMMs, including WorldRetriever, show diminished performance under similar conditions. We hope that WorldQA,our methodology, and these insights could contribute to the future development of multimodal world models.
多模态信息,加上我们的知识,帮助我们更好地理解复杂和动态的世界。然而,大语言模型(LLM)和大多模态模型(LMM)仍然很难实现这种能力。在本文中,我们提出了WorldQA,一个旨在推动多模态世界模型界限的视频理解数据集: (1)多模态输入:该数据集包括1007个问题-答案对和303个视频,因此需要对听力和视觉数据进行分析才能成功解释。 (2)世界知识:我们确定了问题陈述的五个基本类型。这种方法挑战了模型在感知能力之外扩展其功能。 (3)长链条推理:我们的数据集引入了平均推理步骤为4.45的推理步骤,这在其他视频QA数据集中超过了。此外,我们还引入了WorldRetriever,一种设计用于合成专家知识的代理,从而促进对WorldQA查询的准确回答。对13个知名LLM和LMM的广泛评估发现,尽管WorldRetriever是最有效的模型,但在多选题中只实现了70%的人类水平性能。这一发现强调了模型在推理和理解能力方面的进一步发展。我们的实验还产生了几个关键见解。例如,虽然人类在增加帧数时表现更好,但包括WorldRetriever在内的当前LLM在类似条件下表现出的性能减弱。我们希望WorldQA,我们的方法,以及这些见解能为多模态世界模型的未来发展做出贡献。
https://arxiv.org/abs/2405.03272
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.
我们描述了一个研究文本到视频检索训练的协议,使用未标记的视频,其中我们假设(i)没有访问任何视频的标签,即没有访问地面真实字幕集,但(ii)访问以文本形式表示的标记图像。由于给图像专家模型建立现实场景是合理的,因为给图像贴标签比给视频贴标签更便宜,因此具有可扩展性,与昂贵的视频标注方案相比,更具有可行性。最近,零散式图像专家,如CLIP,已经为视频理解任务建立了新的强基准。在本文中,我们利用这一进步并使用两种模型实例化图像专家:一种是从文本到图像检索模型,为初始骨架提供支持,另一种是图像标题模型,为未标记的视频提供监督信号。我们证明了自动给视频帧贴上图像标题标签可以让文本到视频检索训练。这一过程在不进行手动注释的情况下适应了目标领域,从而在CLIP零散式基准之外表现出色。在训练期间,我们从多个视频帧中采样相应的标题,并对帧表示进行时间池化,根据每个标题对帧进行评分。我们进行了广泛的实验,以提供有关此简单框架的有效性的见解,并通过在ActivityNet、MSR-VTT和MSVD等三个标准数据集上实现文本到视频检索的CLIP零散式基准的超越来证实其有效性。
https://arxiv.org/abs/2404.17498
Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges.Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose MovieChat to overcome these challenges. We lift pre-trained multi-modal large language models for understanding long videos without incorporating additional trainable temporal modules, employing a zero-shot approach. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method. The code along with the dataset can be accessed via the following this https URL.
最近,将视频基础模型和大型语言模型集成构建视频理解系统可以克服特定预定义视觉任务的限制。然而,现有的方法要么采用复杂的空间-时间模块,要么依赖附加感知模型提取视频理解中的时间特征,并且它们只能在短视频中表现良好。对于长视频,长期时间连接的计算复杂性和内存成本显著增加,带来了额外的挑战。 利用Atkinson-Shiffrin记忆模型,将Transformer中的标记作为记忆的载体,并采用我们专门设计的记忆机制,我们提出了MovieChat来克服这些挑战。我们不包含额外训练时间模块预训练的多模态大型语言模型,采用零散射击方法。MovieChat在长视频理解方面实现了最先进的性能,同时发布了MovieChat-1K基准,包括1K个长视频、2K个时间标注和14K个手动标注,以验证我们方法的有效性。代码和数据集可以通过以下链接访问:https://this URL。
https://arxiv.org/abs/2404.17176
In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these V-FER models cannot deal with unknown classes that are prevalent in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming at identifying not only known classes but also new, unknown human facial expressions not encountered during training. While existing approaches address open-set recognition by leveraging large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the nuanced and subtle human expression patterns required by the OV-FER task. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively, thereby presenting a new CLIP-based OV-FER approach. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompt representations to complement the original CLIP textual prompts and enhance the textual representations of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, 3) a delicately designed open-set multi-task learning scheme that facilitates prompt learning and encourages interactions between the textual and visual prompting modules. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin.
在基于视频的人脸表情识别(V-FER)中,通常使用已知类别的闭合集数据集来训练模型。然而,这些V-FER模型无法处理现实场景中普遍存在的未知类别的情绪。在本文中,我们引入了一个具有挑战性的开放式视频基于人脸表情识别(OV-FER)任务,旨在识别不仅是已知类别,而且还有在训练过程中未遇到的新未知成人面部表情。虽然现有的方法通过利用如CLIP等大规模视觉语言模型来解决开放式识别问题,但我们认为这些方法可能不足以捕捉OV-FER任务所需的微妙的和细微的表情模式。为了克服这一限制,我们提出了一个新颖的人表情敏感提示(HESP)机制,旨在显著增强CLIP在模型建模视频面部表情细节方面的能力,从而提出了一种新的基于CLIP的OV-FER方法。我们提出的HESP包括三个组件:1)一个可学习提示表示的文本提示模块,以补充原始CLIP文本提示并增强已知和未知情感的文本表示;2)一个使用表情敏感性注意力的视频帧编码模块,使CLIP具有提取情感丰富信息的新视觉建模能力;3)一个设计精致的闭合集多任务学习方案,促进提示学习和鼓励文本和视觉提示模块之间的交互。在四个OV-FER任务设置的广泛实验中进行的实验证明,HESP可以显著提高CLIP的性能(在AUROC上的相对改善率为17.93%,在OSCR上的相对改善率为106.18%),并显著优于其他最先进的开放式视频理解方法。
https://arxiv.org/abs/2404.17100
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-language models. This paper investigates a straightforward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames as inputs on video datasets leads to performance saturation or even a drop. Our further investigation reveals that it is largely attributed to the bias of learned high-norm visual features. Motivated by this finding, we propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. The new model is termed Pooling LLaVA, or \nameofmethod{} in short. \nameofmethod{} achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. Notably, on the recent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9\%. On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1\% accuracy on average across 20 sub-tasks, 14.5\% higher than GPT4V (IG-VLM). Code is available at \url{this https URL}.
视觉语言预训练在广泛的图像-语言应用中显著提高了性能。然而,对于视频相关任务,预训练过程需要极其大量的计算和数据资源,这阻碍了视频语言模型的进步。本文研究了一种简单而高效的方法,将现有的图像-语言预训练模型 adapted for dense video understanding。我们的初步实验结果表明,直接在视频数据集中使用多个帧作为输入对预训练的图像-语言模型进行微调会导致性能达到饱和甚至下降。进一步的研究表明,这主要是由于学习到的具有高范数视觉特征的偏差。为了激励这一发现,我们提出了一个简单而有效的池化策略,平滑沿时间维度的特征分布,从而减少极端特征的支配影响。所提出的新模型被称为Pooling LLaVA,或者简称为\nameofmethod{}。\nameofmethod{}在现代基准数据集上实现了对于视频问答和字幕任务的最先进的性能。值得注意的是,在最近的流行视频聊天GPT基准上,PLLaVA在平均五个评估维度上获得了3.48的分数,比前SOTA结果(GPT4V)高9%。在最新的多选题基准MVBench上,PLLaVA在20个子任务上的平均准确度为58.1%,比GPT4V(IG-VLM)高14.5%。代码可在此处访问:\url{this <https://this URL>.}
https://arxiv.org/abs/2404.16994
The task of spatiotemporal action localization in chaotic scenes is a challenging task toward advanced video understanding. Paving the way with high-quality video feature extraction and enhancing the precision of detector-predicted anchors can effectively improve model performance. To this end, we propose a high-performance dual-stream spatiotemporal feature extraction network SFMViT with an anchor pruning strategy. The backbone of our SFMViT is composed of ViT and SlowFast with prior knowledge of spatiotemporal action localization, which fully utilizes ViT's excellent global feature extraction capabilities and SlowFast's spatiotemporal sequence modeling capabilities. Secondly, we introduce the confidence maximum heap to prune the anchors detected in each frame of the picture to filter out the effective anchors. These designs enable our SFMViT to achieve a mAP of 26.62% in the Chaotic World dataset, far exceeding existing models. Code is available at this https URL.
在复杂场景中进行时空动作局部化的任务是高级视频理解的一个具有挑战性的任务。通过高质量的视频特征提取和增强检测器预测锚点的精度,可以有效地提高模型性能。为此,我们提出了一个高性能的双流时空特征提取网络SFMViT,采用锚点剪枝策略。我们SFMViT的骨干网络由ViT和SlowFast组成,基于先前对时空动作局部化的知识,充分利用ViT的卓越全局特征提取能力和SlowFast的时空序列建模能力。其次,我们引入了最大置信度堆来剪枝检测器在每个帧中检测到的锚点,以过滤出有效的锚点。这些设计使我们的SFMViT在Chaotic World数据集上的mAP达到26.62%,远超过现有模型。代码可以从该链接获得。
https://arxiv.org/abs/2404.16609
Video anomaly detection (VAD) is a challenging task aiming to recognize anomalies in video frames, and existing large-scale VAD researches primarily focus on road traffic and human activity scenes. In industrial scenes, there are often a variety of unpredictable anomalies, and the VAD method can play a significant role in these scenarios. However, there is a lack of applicable datasets and methods specifically tailored for industrial production scenarios due to concerns regarding privacy and security. To bridge this gap, we propose a new dataset, IPAD, specifically designed for VAD in industrial scenarios. The industrial processes in our dataset are chosen through on-site factory research and discussions with engineers. This dataset covers 16 different industrial devices and contains over 6 hours of both synthetic and real-world video footage. Moreover, we annotate the key feature of the industrial process, ie, periodicity. Based on the proposed dataset, we introduce a period memory module and a sliding window inspection mechanism to effectively investigate the periodic information in a basic reconstruction model. Our framework leverages LoRA adapter to explore the effective migration of pretrained models, which are initially trained using synthetic data, into real-world scenarios. Our proposed dataset and method will fill the gap in the field of industrial video anomaly detection and drive the process of video understanding tasks as well as smart factory deployment.
视频异常检测(VAD)是一个具有挑战性的任务,旨在识别视频帧中的异常情况,现有的大规模VAD研究主要集中在道路交通和人类活动场景。在工业场景中,通常存在多种不可预测的异常情况,VAD方法在这些场景中发挥着重要作用。然而,由于对隐私和安全问题的担忧,缺乏针对工业生产场景的可应用数据和方法。为了填补这一空白,我们提出了一个专门为工业场景设计的新的数据集IPAD。我们通过对现场工厂研究和与工程师的讨论来选择工业过程。这个数据集涵盖了16种不同的工业设备,包含了超过6小时的合成和现实世界的视频录像。此外,我们还对工业过程的关键特征,即周期性进行了标注。基于所提出的数据集,我们引入了周期记忆模块和滑动窗口检查机制,有效调查了基本重构模型的周期信息。我们的框架利用了LoRA适配器,探索将预训练模型有效迁移到真实世界场景。我们所提出的数据集和方法将填补工业视频异常检测领域中的空白,推动视频理解任务和智能工厂部署的发展。
https://arxiv.org/abs/2404.15033
In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at \href{this https URL }{here}.
在本文中,我们研究了一个名为叙述动作评估(NAE)的新问题。NAE的目标是生成专业的评论来评估一个行动的执行。与传统的评分基于动作质量评估和涉及浅层句子的视频标题等任务不同,NAE专注于在自然语言中创建详细的叙述。这些叙述提供了动作的详细描述以及客观评价。因为需要叙述的灵活性和评估的严谨性,NAE是一个更具挑战性的任务。一个现有的可能解决方案是使用多任务学习,其中叙述语言和评估信息分别预测。然而,由于任务之间存在差异和语言信息与评估信息之间的差异,这种方法在个人任务上产生了较低的性能。为了解决这个问题,我们提出了一个提示引导的多模态交互框架。这个框架使用了一对变压器来促进不同信息模态之间的交互。它还使用提示将评分回归任务转化为视频文本匹配任务,从而实现任务交互。为了支持在这个领域进一步的研究,我们用高质量、全面的动作叙述重新标注了MTL-AQA和FineGym数据集。此外,我们还为NAE建立了基准。大量实验结果证明,我们的方法超越了单独学习和 naive multi-task learning 方法。数据和代码发布在 \href{this <https://this URL> }{这里}。
https://arxiv.org/abs/2404.14471
Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences. It differs from standard video captioning in that it requires not only describing key visual details but also inferring the plots developed across multiple movie shots, thus posing unique and ongoing challenges. To advance the development of automatic movie narrating systems, we first revisit the limitations of existing datasets and develop a large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages and tentatively focus on the initial stages featuring understanding within individual clips. We also introduce a new narration assessment to align with our staged task goals. Third, using our new dataset, we baseline several leading large vision-language models, including GPT-4V, and conduct in-depth investigations into the challenges current models face for movie narration generation. Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
自动电影解说旨在为视觉受损的观众创建与视频对齐的剧情描述。它与标准视频字幕的不同之处在于,它不仅描述关键视觉细节,而且推断了多个电影镜头之间开发的剧情,因此提出了独特的持续挑战。为了推动自动电影解说系统的发展,我们首先回顾现有数据集的局限性,并开发了一个大规模、双语的电影解说数据集MOV101v2。其次,考虑到实现适用性电影解说所面临的根本困难,我们将长期目标分解为三个渐进阶段,并暂时将重点放在理解个体剪辑内的理解上。我们还引入了一个新的解说评估,以与我们的阶段任务目标保持一致。第三,使用我们的新数据集,我们对比了几个领先的大视图语言模型,包括GPT-4V,并对当前模型在电影解说生成方面的挑战进行了深入调查。我们的研究结果表明,实现适用性电影解说生成是一个迷人的目标,需要进行深入的研究。
https://arxiv.org/abs/2404.13370
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these this http URL response to these challenges, this work introduces an efficient method that strategically leverages the priors of Image LLMs, facilitating a resource-efficient transition from Image to Video LLMs. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs, which utilizes a temporal adaptation plug-and-play structure within the image fusion module of Image LLMs. This adaptation extends their understanding capabilities to include temporal information, enabling the development of Video LLMs that not only surpass baseline performances but also do so with minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models, effectively building upon the foundational work of Image LLMs.
多模态大型语言模型(MLLMs)在理解多模态信息方面取得了深刻的潜力,从图像LL模型到更复杂的视频LL模型。许多研究都证明了它们的跨模态理解能力。最近,将视频基础模型与大型语言模型集成以构建全面视频理解系统的主张,以克服特定预定义的视觉任务的局限性。然而,当前的视频LL模型的发展往往忽视了图像LL模型的基础贡献,通常选择更复杂的设计和各种多模态数据进行预训练。这种方法显著增加了为应对这些挑战而产生的成本,本文提出了一种有效的策略,通过战略性地利用图像LL模型的先验知识,促使从图像到视频LL模型的资源高效过渡。我们提出了RED-VILLM,一个资源高效的图像LL模型开发流程,该流程利用了图像LL模型的图像融合模块中的时间适应插件和 play-pause 结构。这个适应扩展了他们的理解能力,使他们能够开发出不仅超越基线性能,而且可以用最少的数据和训练资源实现的视频LL模型。我们的方法突出了在多模态模型中实现更成本效益和可扩展性的发展的潜力,有效地建立在图像LL模型的基础工作之上。
https://arxiv.org/abs/2404.11865
Pretrained vision-language models have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Code is available at this https URL
预训练的视觉-语言模型已经在视频理解方面取得了有效性。然而,最近的研究并没有充分利用视频的必要时间信息,仅仅是平均帧级表示或参考连续帧。我们引入了 Temporally Contextualized CLIP (TC-CLIP),这是一个先驱性的框架,用于视频理解,有效且高效地利用了全面视频信息。我们提出了 Temporal Contextualization (TC),一种新的层间时间信息注入机制,用于提取每个帧的核心信息,将视频中的相关信息连接起来,并最终在特征编码过程中利用上下文 tokens。此外,我们的 Video-conditional Prompting (VP) 模块用于在文本模态生成有信息的提示。我们在零散、少散、基础到 novel 和完全监督的动作识别上进行了广泛的实验,以验证我们的 TC-CLIP 的优越性。TC 和 VP 的消融研究确保了我们的设计选择。代码可以从该链接获取:
https://arxiv.org/abs/2404.09490
The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities. The 2024 edition featured five tracks, attracting unprecedented interest from 726 teams in 47 countries and regions. Track 1 dealt with multi-target multi-camera (MTMC) people tracking, highlighting significant enhancements in camera count, character number, 3D annotation, and camera matrices, alongside new rules for 3D tracking and online tracking algorithm encouragement. Track 2 introduced dense video captioning for traffic safety, focusing on pedestrian accidents using multi-camera feeds to improve insights for insurance and prevention. Track 3 required teams to classify driver actions in a naturalistic driving analysis. Track 4 explored fish-eye camera analytics using the FishEye8K dataset. Track 5 focused on motorcycle helmet rule violation detection. The challenge utilized two leaderboards to showcase methods, with participants setting new benchmarks, some surpassing existing state-of-the-art achievements.
第八届AI城市挑战突出了计算机视觉和人工智能在零售、仓库设置和智能交通系统(ITS)等领域的汇聚,为研究提供了重要机会。2024版活动设有五个赛道,吸引了来自726支来自47个国家和地区的队伍,前所未有的关注。赛道1涉及多目标多摄像头(MTMC)人员跟踪,重点关注摄像头数量、角色数量、3D注释和相机矩阵的显著提升,以及为3D跟踪和在线跟踪算法鼓励的新规则。赛道2介绍了用于交通安全的密集视频字幕,利用多摄像头数据改善对保险和预防的洞察。赛道3要求团队对自然驾驶分析中的驾驶员动作进行分类。赛道4探索了使用FishEye8K数据集的鱼眼相机数据分析。赛道5关注摩托车头盔违规检测。挑战使用了两个排行榜来展示方法,参与者设定了新基准,有些甚至超越了现有最先进的成就。
https://arxiv.org/abs/2404.09432