Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.
在视频中的多标签动作识别是一个对机器人动态环境应用的显著挑战,尤其是在机器人需要与人类在涉及对象的任務中進行合作时。现有的方法仍然很难识别未见到的动作,或者需要大量的训练数据。为了克服这些问题,我们提出了Dual-VCLIP,一种用于零散标签多标签动作识别的统一方法。Dual-VCLIP通过DualCoOp方法增强了VCLIP,一种用于零散标签图像分类的零散动作识别方法。我们方法的优势在于,在训练时它只学习两个提示,因此它比其他方法要简单得多。我们在包含大量物体为基础的动作的Charades数据集上验证我们的方法,证明了--尽管其简单性--我们的方法在完整数据集上与现有方法的表现相当,而在测试未见到的动作时具有 promising 的表现。我们的贡献强调了在机器人训练过程中动词-物体类别的分割对新型合作任务的影响,突出了对表现和减轻偏见的影响。
https://arxiv.org/abs/2405.08695
We focus on the problem of recognising the end state of an action in an image, which is critical for understanding what action is performed and in which manner. We study this focusing on the task of predicting the coarseness of a cut, i.e., deciding whether an object was cut "coarsely" or "finely". No dataset with these annotated end states is available, so we propose an augmentation method to synthesise training data. We apply this method to cutting actions extracted from an existing action recognition dataset. Our method is object agnostic, i.e., it presupposes the location of the object but not its identity. Starting from less than a hundred images of a whole object, we can generate several thousands images simulating visually diverse cuts of different coarseness. We use our synthetic data to train a model based on UNet and test it on real images showing coarsely/finely cut objects. Results demonstrate that the model successfully recognises the end state of the cutting action despite the domain gap between training and testing, and that the model generalises well to unseen objects.
我们专注于在图像中识别动作的终点状态,这对理解执行动作的方式至关重要。我们研究这个问题,重点是预测切割的粗糙程度,即决定物体是否被切割得“粗”或“细”。由于不存在带有这些标注终点的数据集,我们提出了一个增强方法来合成训练数据。我们将这种方法应用于从现有动作识别数据集中提取的切割动作。我们的方法是对象无关的,即它假设物体的位置,而不是其 identity。从整物体不到100张图片开始,我们可以生成数千张模拟不同粗糙程度的切割的图像。我们将我们的合成数据用于基于UNet的模型训练,并在显示粗/细切割物体的真实图像上测试它。结果表明,尽管训练和测试之间存在领域差异,但模型成功识别了切割动作的终点状态,并且对未见过的物体表现出良好的泛化能力。
https://arxiv.org/abs/2405.07723
Recognizing human actions from point cloud sequence has attracted tremendous attention from both academia and industry due to its wide applications. However, most previous studies on point cloud action recognition typically require complex networks to extract intra-frame spatial features and inter-frame temporal features, resulting in an excessive number of redundant computations. This leads to high latency, rendering them impractical for real-world applications. To address this problem, we propose a Plane-Fit Redundancy Encoding point cloud sequence network named PRENet. The primary concept of our approach involves the utilization of plane fitting to mitigate spatial redundancy within the sequence, concurrently encoding the temporal redundancy of the entire sequence to minimize redundant computations. Specifically, our network comprises two principal modules: a Plane-Fit Embedding module and a Spatio-Temporal Consistency Encoding module. The Plane-Fit Embedding module capitalizes on the observation that successive point cloud frames exhibit unique geometric features in physical space, allowing for the reuse of spatially encoded data for temporal stream encoding. The Spatio-Temporal Consistency Encoding module amalgamates the temporal structure of the temporally redundant part with its corresponding spatial arrangement, thereby enhancing recognition accuracy. We have done numerous experiments to verify the effectiveness of our network. The experimental results demonstrate that our method achieves almost identical recognition accuracy while being nearly four times faster than other state-of-the-art methods.
从点云序列中识别人类动作引起了学术界和产业界的高度关注,因为它具有广泛的应用。然而,大多数先前的点云动作识别研究通常需要复杂的网络来提取帧内空间特征和帧间时间特征,导致冗余计算数量过多。这导致延迟过高,使得它们对于现实应用不再实用。为了解决这个问题,我们提出了一个名为PRENet的平滑fit冗余编码点云序列网络。我们方法的主要思想是利用平滑fit来减轻序列内的空间冗余,同时编码整个序列的时间冗余以最小化冗余计算。具体来说,我们的网络由两个主要模块组成:平滑fit嵌入模块和时域-空间一致性编码模块。平滑fit嵌入模块利用观察到连续点云帧在物理空间中具有独特的几何特征的事实,实现地理位置编码数据的重复利用。时域-空间一致性编码模块将时间冗余部分与相应的空间布局相结合,从而提高识别准确性。我们进行了大量实验来验证我们网络的有效性。实验结果表明,与最先进的方法相比,我们的方法具有几乎相同的识别准确度,同时速度快了约四倍。
https://arxiv.org/abs/2405.06929
Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.
动作识别是构建交互式虚拟现实的关键技术。随着深度学习的快速发展,动作识别方法也取得了很大的进步。研究人员设计并实现了多个观点的动作识别骨骼,这导致了方法的多样性和遇到的新挑战。本文回顾了几种基于深度神经网络的动作识别方法。我们在第一部分介绍了两个通道网络及其变体,尤其是在本文中,使用RGB视频帧和光学流模式作为输入;第二部分介绍了3D卷积网络,它们致力于利用RGB模式直接提取不同运动信息;第三部分介绍了基于Transformer的方法,将自然语言处理模型的思想引入计算机视觉和视频理解。我们在本文的回顾中提供了客观的看法,并希望为未来的研究提供参考。
https://arxiv.org/abs/2405.05584
To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.
为了在3D人类运动和语言之间构建跨模态潜在空间,获取大规模且高质量的人类运动数据至关重要。然而,与图像数据的丰富相比,运动数据的稀疏性限制了现有运动-语言模型的性能。为了应对这个问题,我们引入了“运动补丁”,一种新的运动序列表示,并通过迁移学习使用Vision Transformers(ViT)作为运动编码器,旨在从图像域提取有用知识并将其应用于运动域。这些运动补丁是由基于运动部件在运动序列中进行拆分和排序的骨骼关节创建的,对不同的骨架结构具有鲁棒性,可以被视为ViT中的颜色图像补丁。我们发现,通过使用通过2D图像数据训练得到的预训练ViT权重的迁移学习,可以提高运动分析的性能,为解决运动数据有限的问题提供了一个有前途的方向。我们的广泛实验表明,与ViT共同使用的运动补丁在文本到运动检索基准测试和其他新颖挑战任务(如跨骨架识别、零散射击运动分类和人类交互识别)上实现了最先进的性能,这些任务目前由于缺乏数据而受到阻碍。
https://arxiv.org/abs/2405.04771
Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at this https URL.
由于视频监控摄像头的不断增加和犯罪预防的需求不断增加,暴力检测任务正在从研究社区获得越来越多的关注。与其他动作识别任务相比,监控视频中的暴力检测任务还表现为其他问题,例如存在显著的实战场景。然而,现有的数据集似乎与其他动作识别数据集相比非常小。此外,在视频应用中,每个视频场景的人都有所不同,视频拍摄的角度也有所不同。为了快速检测现实生活中的暴力行为,防止不良后果,因此模型可以从内存使用和计算成本的降低中获得好处。这些问题使得经典动作识别方法难以采用。为解决这些问题,我们引入了JOSENet,一种新颖的自监督框架,可以在监控视频中的暴力检测方面提供卓越的表现。与自监督先进的视频分类方法相比,所提出的模型具有更好的性能,同时每个视频片段需要四分之一帧数和降低帧率。JOSENet的源代码和重现实验的说明可以在该链接处找到。
https://arxiv.org/abs/2405.02961
Recent few-shot action recognition (FSAR) methods achieve promising performance by performing semantic matching on learned discriminative features. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, \etc) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to predict query categories more accurately under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (i.e., HMDB51, UCF101, Kinetics, and SSv2-small).
近年来,一些几帧动作识别(FSAR)方法通过在学到的区分性特征上进行语义匹配来实现出色的性能。然而,大多数FSAR方法都关注单尺度(例如,帧级别、段级别等)特征对齐,这忽略了人类具有相同语义的动作在不同的速度下可能出现的事实。为此,我们提出了一个名为多速度渐进对齐(MVP-Shot)的新框架,以在多速度级别上逐步学习和对齐语义相关动作特征。具体来说,我们设计了一个多速度特征对齐(MVFA)模块,用于测量不同速度尺度支持视频和查询视频的特征之间的相似性,然后以残差方式合并所有相似度分数。为了避免多个速度特征脱离底层运动语义,我们提出的渐进语义自适应交互(PSTI)模块通过在不同速度下的通道和时域特征交互注入速度相关的文本信息。上述两个模块相互补充,在几帧设置下更准确地预测查询类别。实验结果表明,我们的方法在多个标准几帧基准(即HMDB51、UCF101、Kinetics和SSv2-small)上优于现有技术水平。
https://arxiv.org/abs/2405.02077
Action recognition has become one of the popular research topics in computer vision. There are various methods based on Convolutional Networks and self-attention mechanisms as Transformers to solve both spatial and temporal dimensions problems of action recognition tasks that achieve competitive performances. However, these methods lack a guarantee of the correctness of the action subject that the models give attention to, i.e., how to ensure an action recognition model focuses on the proper action subject to make a reasonable action prediction. In this paper, we propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos using Directed Gromov-Wasserstein Discrepancy. Furthermore, our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets. Therefore, the contributions in this work are three-fold. Firstly, we introduce the multi-view attention consistency to solve the problem of reasonable prediction in action recognition. Secondly, we define a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy. Thirdly, we built an action recognition model based on Video Transformers and Neural Radiance Fields. Compared to the recent action recognition methods, the proposed approach achieves state-of-the-art results on three large-scale datasets, i.e., Jester, Something-Something V2, and Kinetics-400.
动作识别已成为计算机视觉领域的一个热门研究课题。基于卷积神经网络和自注意力机制的Transformer方法作为一种解决动作识别任务的空间和时间维度问题的方法,实现了竞争力的性能。然而,这些方法缺乏对动作主体正确性的保证,即如何确保模型关注到正确的动作主体进行合理的动作预测。在本文中,我们提出了一种多视角注意力一致性方法,通过计算来自不同视角的动作视频的两个注意力的相似度来计算。此外,我们的方法将神经元辐射场(Neural Radiance Field)的思想应用于在单视图数据集上训练时,在训练过程中 implicitly 渲染来自新视角的特征。因此,本工作的贡献是三重的。首先,我们将多视角注意力一致性引入到动作识别问题中,解决了合理预测的问题。其次,我们定义了一个新的多视角一致性指标,基于指向Gromov-Wasserstein差异。第三,我们基于视频Transformer和神经元辐射场构建了一个动作识别模型。与最近的动作识别方法相比,所提出的方法在三个大型数据集(即Jester、Something-Something V2和Kinetics-400)上实现了最先进的性能。
https://arxiv.org/abs/2405.01337
Human action video recognition has recently attracted more attention in applications such as video security and sports posture correction. Popular solutions, including graph convolutional networks (GCNs) that model the human skeleton as a spatiotemporal graph, have proven very effective. GCNs-based methods with stacked blocks usually utilize top-layer semantics for classification/annotation purposes. Although the global features learned through the procedure are suitable for the general classification, they have difficulty capturing fine-grained action change across adjacent frames -- decisive factors in sports actions. In this paper, we propose a novel ``Cross-block Fine-grained Semantic Cascade (CFSC)'' module to overcome this challenge. In summary, the proposed CFSC progressively integrates shallow visual knowledge into high-level blocks to allow networks to focus on action details. In particular, the CFSC module utilizes the GCN feature maps produced at different levels, as well as aggregated features from proceeding levels to consolidate fine-grained features. In addition, a dedicated temporal convolution is applied at each level to learn short-term temporal features, which will be carried over from shallow to deep layers to maximize the leverage of low-level details. This cross-block feature aggregation methodology, capable of mitigating the loss of fine-grained information, has resulted in improved performance. Last, FD-7, a new action recognition dataset for fencing sports, was collected and will be made publicly available. Experimental results and empirical analysis on public benchmarks (FSD-10) and self-collected (FD-7) demonstrate the advantage of our CFSC module on learning discriminative patterns for action classification over others.
近年来,在诸如视频安全和运动姿势纠正等应用中,人工智能行动视频识别引起了更多关注。流行的解决方案,包括基于图卷积网络(GCNs)的人体骨骼建模作为时空图的GCN方法,已经被证明非常有效。基于GCN的层叠方法通常利用上层语义进行分类/注释。尽管通过该过程学习到的全局特征适用于一般分类,但它们很难捕捉相邻帧之间的细粒度动作变化——这是运动动作的关键因素。在本文中,我们提出了一个名为“跨层微分语义级联(CFSC)”的新模块,以克服这一挑战。总之,与CFSC相比,提出的CFSC将浅层视觉知识逐步整合到高级别块中,使网络能够专注于动作细节。特别地,CFSC模块利用不同级别产生的GCN特征图以及后续级别的聚合特征来巩固细粒度特征。此外,在每层应用自定义的时间卷积来学习短期时间特征,这将从浅层传递到深层,以便最大限度地利用低层细节的杠杆。这种跨层特征聚合方法,能够减轻细粒度信息的损失,从而提高性能。最后,收集了用于击剑运动的FD-7数据,并将其公开发布。公共基准(FSD-10)和自收集(FD-7)数据集的实验结果和实证分析证明了我们的CFSC模块在动作分类中具有优越性的优势。
https://arxiv.org/abs/2404.19383
Skeleton-based action recognition is vital for comprehending human-centric videos and has applications in diverse domains. One of the challenges of skeleton-based action recognition is dealing with low-quality data, such as skeletons that have missing or inaccurate joints. This paper addresses the issue of enhancing action recognition using low-quality skeletons through a general knowledge distillation framework. The proposed framework employs a teacher-student model setup, where a teacher model trained on high-quality skeletons guides the learning of a student model that handles low-quality skeletons. To bridge the gap between heterogeneous high-quality and lowquality skeletons, we present a novel part-based skeleton matching strategy, which exploits shared body parts to facilitate local action pattern learning. An action-specific part matrix is developed to emphasize critical parts for different actions, enabling the student model to distill discriminative part-level knowledge. A novel part-level multi-sample contrastive loss achieves knowledge transfer from multiple high-quality skeletons to low-quality ones, which enables the proposed knowledge distillation framework to include training low-quality skeletons that lack corresponding high-quality matches. Comprehensive experiments conducted on the NTU-RGB+D, Penn Action, and SYSU 3D HOI datasets demonstrate the effectiveness of the proposed knowledge distillation framework.
基于骨架的动作识别对于理解以人为中心的视频至关重要,并在各种领域具有应用价值。骨架动作识别的一个挑战是处理低质量数据,例如缺失或准确度不高的骨骼。本文通过一个通用的知识蒸馏框架来提高基于骨架的动作识别,该框架采用一个教师模型和一个学生模型。教师模型通过训练高质量骨骼来指导学习学生模型,学生模型处理低质量骨骼。为了弥合高质量和低质量骨骼之间的差距,我们提出了一个新颖的部分基于骨骼匹配策略,该策略利用共享身体部分来促进局部动作模式学习。为不同动作生成特定部分矩阵,强调关键部分以帮助学生模型蒸馏部分级别知识。一种新颖的部分级别多样本对比损失实现从多个高质量骨骼向低质量骨骼的知识传递,这使得所提出的知识蒸馏框架可以包括训练低质量骨骼,这些骨骼没有相应的高质量匹配。在NTU-RGB+D、Penn Action和SYSU 3D HOI数据集上进行全面的实验证明所提出的知识蒸馏框架的有效性。
https://arxiv.org/abs/2404.18206
Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.
半监督动作识别旨在通过与大量未标记数据相结合,通过几标记数据来提高空间和时间推理能力。尽管有最近的研究进展,但现有的强大方法在稀疏标记数据下仍然容易产生模糊预测,这表现为用类似的时空信息区分不同动作的局限性。在本文中,我们通过赋予模型两个能力方面来解决这个问题,即判别性空间建模和时间结构建模,以学习具有判别性的时空表示。具体来说,我们提出了自适应对比学习(ACL)策略。它通过标记数据的类原型评估所有未标记样本的置信度,并从预标记样本库中选择正负样本进行对比学习。此外,我们还引入了多尺度时间学习(MTL)策略。它可以从长期视频片段中突出有用的语义信息,并将它们整合到短期视频片段中,同时抑制噪声信息。然后,这两种新的技术都被融入到统一的框架中,以鼓励模型做出准确的预测。在UCF101、HMDB51和Kinetics400等数据集上进行的大量实验表明,我们的方法优越于先前的最先进方法。
https://arxiv.org/abs/2404.16416
Pooling is a crucial operation in computer vision, yet the unique structure of skeletons hinders the application of existing pooling strategies to skeleton graph modelling. In this paper, we propose an Improved Graph Pooling Network, referred to as IGPN. The main innovations include: Our method incorporates a region-awareness pooling strategy based on structural partitioning. The correlation matrix of the original feature is used to adaptively adjust the weight of information in different regions of the newly generated features, resulting in more flexible and effective processing. To prevent the irreversible loss of discriminative information, we propose a cross fusion module and an information supplement module to provide block-level and input-level information respectively. As a plug-and-play structure, the proposed operation can be seamlessly combined with existing GCN-based models. We conducted extensive evaluations on several challenging benchmarks, and the experimental results indicate the effectiveness of our proposed solutions. For example, in the cross-subject evaluation of the NTU-RGB+D 60 dataset, IGPN achieves a significant improvement in accuracy compared to the baseline while reducing Flops by nearly 70%; a heavier version has also been introduced to further boost accuracy.
池化是计算机视觉中一个关键的操作,然而骨架图模型的独特结构使得现有的池化策略难以应用于骨架图建模。在本文中,我们提出了一个改进的图池化网络,被称为IGPN。主要的创新包括:我们的方法基于结构分割的局部感知池化策略。原始特征的协方差矩阵用于根据新特征不同区域的信息自适应地调整其权重,从而实现更加灵活和有效的处理。为了防止不可逆的信息损失,我们提出了跨融合模块和信息补充模块,分别为块级和输入级提供信息。作为可插拔的结构,与现有的基于图卷积网络(GCN)的模型无缝结合。我们对多个具有挑战性的基准进行了广泛的评估,实验结果表明,我们的解决方案的有效性得到了验证。例如,在NTU-RGB+D 60数据集的跨subject评估中,IGPN在准确率方面比基线提高了显著的差异,同时减少了Flops近70%;还推出了一种更重的版本,以进一步提高准确性。
https://arxiv.org/abs/2404.16359
Skeleton-based action recognition has gained considerable traction thanks to its utilization of succinct and robust skeletal representations. Nonetheless, current methodologies often lean towards utilizing a solitary backbone to model skeleton modality, which can be limited by inherent flaws in the network backbone. To address this and fully leverage the complementary characteristics of various network architectures, we propose a novel Hybrid Dual-Branch Network (HDBN) for robust skeleton-based action recognition, which benefits from the graph convolutional network's proficiency in handling graph-structured data and the powerful modeling capabilities of Transformers for global information. In detail, our proposed HDBN is divided into two trunk branches: MixGCN and MixFormer. The two branches utilize GCNs and Transformers to model both 2D and 3D skeletal modalities respectively. Our proposed HDBN emerged as one of the top solutions in the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) of 2024 ICME Grand Challenge, achieving accuracies of 47.95% and 75.36% on two benchmarks of the UAV-Human dataset by outperforming most existing methods. Our code will be publicly available at: this https URL.
基于骨架的动作识别因为利用了简洁且鲁棒的骨架表示而获得了相当大的关注。然而,当前的方法通常倾向于使用单一的骨架来建模骨架模式,这可能会受到网络骨架固有缺陷的限制。为了解决这个问题,并充分利用各种网络架构的互补特点,我们提出了一种新颖的混合双分支网络(HDBN),用于鲁棒骨架 based 动作识别,该网络从图卷积网络的擅长处理图形数据和Transformer的强大的建模能力中受益。具体而言,我们提出的 HDBN 分为两个主分支:MixGCN 和 MixFormer。这两个分支分别使用 GCN 和 Transformer 建模 2D 和 3D 骨架模式。我们提出的 HDBN 在 2024 ICME Grand Challenge 多模态视频推理和分析比赛中成为了一个一流解决方案,在两个 UAV-Human 数据集的基准上实现了准确度分别为 47.95% 和 75.36%,超过了大多数现有方法。我们的代码将在这个链接上公开发布:https://this URL。
https://arxiv.org/abs/2404.15719
Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.
理解包含多种模态的视频对来说非常重要,尤其是在自闭型视频中,将各种感官输入结合起来可以显著提高诸如动作识别和时刻定位等任务。然而,由于隐私问题、效率需求或硬件问题等原因,现实世界的应用经常面临模态不完整的情况。尽管现有的方法非常有效,但通常需要重新训练整个模型来处理缺失的模态,这使得它们在计算上是密集的,尤其是在大型训练数据集的情况下。在本文中,我们提出了一种在测试时不需要重新训练的方法来解决这个问题。我们将问题建模为测试时的自适应任务,在这个任务中,模型根据测试时的未标注数据进行调整。我们的方法MiDl~( mutual information with self-distillation)通过最小化预测和可用模态之间的互信息来鼓励模型对测试时的具体模态保持鲁棒性。此外,我们还将自监督学习集成到模型中,以便在模态都存在时保持模型的原始性能。MiDl是第一个在测试时专门处理缺失模态的自监督在线解决方案。通过使用各种预训练模型和数据集进行实验,MiDl证明了在无需重新训练的情况下具有显著的性能提升。
https://arxiv.org/abs/2404.15161
Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.
驾驶员活动分类对于确保道路安全具有至关重要的作用,应用范围从驾驶员辅助系统到自动驾驶车辆控制转换。在本文中,我们提出了一种利用可扩展性来自视觉-语言模型的驾驶员活动分类新方法。我们的方法采用了一个预训练的视觉-语言编码器来处理来自多个视角的同步视频帧。每个帧都使用预训练的视觉-语言编码器进行编码,然后将得到的嵌入进行融合以生成分类概率预测。通过利用对比学习得到的视觉-语言表示,我们的方法在多样驾驶员活动中取得了稳健的性能。我们在自然驾驶行动识别数据集上评估我们的方法,证明了在许多类别中具有强大的准确性。我们的结果表明,视觉-语言表示为驾驶员监测系统提供了有前途的途径,通过自然语言描述符实现准确性和可解释性。
https://arxiv.org/abs/2404.14906
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.
作为计算机视觉中的一个基本视频任务,Open-Vocabulary Action Recognition (OVAR) 最近受到了越来越多的关注,随着视觉语言预训练的发展。为了实现任意类别的泛化,现有方法将类标签视为文本描述,然后将 OVAR 表示为评估视觉样本与文本类之间的嵌入相似度。然而,一个关键问题被完全忽视了:用户提供的类描述可能会有噪音,例如拼写和错别字,这限制了原典 OVAR 的现实应用。为了填补研究空白,本文通过模拟各种类型的多级噪音来评估现有方法,并揭示了它们的脆弱性。为了应对噪音 OVAR 任务,我们进一步提出了一个新颖的 DENOISER 框架,包括生成和分类两个部分。具体来说,生成部分通过一个解码过程消除噪音类-文本名称,即提出文本候选者,然后利用跨模态和内模态信息进行投票,获得最佳。在分类部分,我们使用原典 OVAR 模型将视觉样本分配到类文本名称,从而获得更多的语义信息。为了优化,我们交替迭代生成和分类部分进行逐步改进。消除的文本类有助于 OVAR 模型更准确地分类视觉样本;与此同时,分类的视觉样本有助于更好地消除噪音。在三个数据集上,我们进行了广泛的实验,以展示我们卓越的鲁棒性,并对每个组件的深入剖析进行了充分研究。
https://arxiv.org/abs/2404.14890
Action Quality Assessment (AQA) is pivotal for quantifying actions across domains like sports and medical care. Existing methods often rely on pre-trained backbones from large-scale action recognition datasets to boost performance on smaller AQA datasets. However, this common strategy yields suboptimal results due to the inherent struggle of these backbones to capture the subtle cues essential for AQA. Moreover, fine-tuning on smaller datasets risks overfitting. To address these issues, we propose Coarse-to-Fine Instruction Alignment (CoFInAl). Inspired by recent advances in large language model tuning, CoFInAl aligns AQA with broader pre-trained tasks by reformulating it as a coarse-to-fine classification task. Initially, it learns grade prototypes for coarse assessment and then utilizes fixed sub-grade prototypes for fine-grained assessment. This hierarchical approach mirrors the judging process, enhancing interpretability within the AQA framework. Experimental results on two long-term AQA datasets demonstrate CoFInAl achieves state-of-the-art performance with significant correlation gains of 5.49% and 3.55% on Rhythmic Gymnastics and Fis-V, respectively. Our code is available at this https URL.
动作质量评估(AQA)对于量化体育和医疗保健等领域中的动作至关重要。现有方法通常依赖大型动作识别数据集中的预训练骨干来提高在较小AQA数据集上的性能。然而,由于这些骨干本身难以捕捉AQA所需的精微线索,这种常见策略导致 suboptimal(低效的)结果。此外,在较小数据集上进行微调可能导致过拟合。为了应对这些问题,我们提出了粗到细指令对齐(CoFInAl)。受到大型语言模型调优 recent advances 的启发,CoFInAl 将AQA与更广泛的预训练任务相联系,将其重新建模为粗到细分类任务。最初,它学习粗评估的级别原型,然后利用固定的细级评估原型进行微调。这种分层的方法反映了判断过程,从而提高了在AQA框架内的可解释性。在两个长期AQA数据集上的实验结果表明,CoFInAl在体操和游泳两个领域分别实现了与最先进方法相同的表现,相关性增加了5.49%和3.55%。我们的代码可以从该链接的 URL 获取。
https://arxiv.org/abs/2404.13999
Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. Robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants show a higher vulnerability for the optical flow networks.
深度神经网络在准确估计场景流方面取得了显著的进展,这对于许多应用,如视频分析、动作识别和导航至关重要。然而,这些技术的鲁棒性仍然是一个令人担忧的问题,尤其是在面对已知能够欺骗许多领域中最先进的深度神经网络的对抗攻击的情况下。令人惊讶的是,场景流网络对这种攻击的鲁棒性尚未被充分调查。为解决这个问题,所提出的方法旨在通过引入专门针对场景流网络的对抗白盒攻击来弥合这一差距。实验结果表明,生成的对抗样本在KITTI和FlyingThings3D数据集上的平均端点误差最多可降低33.7。研究还揭示了仅针对一维或颜色通道的点云攻击对平均端点误差的影响。通过分析这些攻击在场景流网络和其2D光流网络变体上的成功率和失败情况,表明光流网络具有更高的漏洞。
https://arxiv.org/abs/2404.13621
Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, including data augmentation and synthetic data generation. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns. We leverage the expressive power of LLMs to align motion representations with high-level linguistic cues, addressing two distinct tasks: action recognition and retrieval of walking sequences based on appearance attributes. For action recognition, we employ LLMs to generate textual descriptions of actions in the BABEL-60 dataset, facilitating the alignment of motion sequences with linguistic representations. In the domain of gait analysis, we investigate the impact of appearance attributes on walking patterns by generating textual descriptions of motion sequences from the DenseGait dataset using LLMs. These descriptions capture subtle variations in walking styles influenced by factors such as clothing choices and footwear. Our approach demonstrates the potential of LLMs in augmenting structured motion attributes and aligning multi-modal representations. The findings contribute to the advancement of comprehensive motion understanding and open up new avenues for leveraging LLMs in multi-modal alignment and data augmentation for motion analysis. We make the code publicly available at this https URL
大语言模型(LLMs)在各种领域展示了非凡的能力,包括数据增强和合成数据生成。本文探讨了使用LLMs生成丰富文本描述用于运动序列,涵盖动作和步行模式。我们利用LLMs的表现力来将运动表示与高级语言线索对齐,解决两个不同任务:动作识别和基于外观属性的步行序列检索。 对于动作识别,我们使用LLMs生成BABEL-60数据集中的动作文本描述,促进运动序列与语言表示对齐。在领域 of gait analysis,我们研究了外观属性对步行模式的影响,通过使用LLMs生成DenseGait数据集中的运动文本描述。这些描述捕捉了由服装选择和鞋子等因素引起的步行风格的微妙变化。 我们的方法展示了LLMs在增强结构化运动属性并实现多模态表示对齐方面的潜力。这些发现为全面运动理解的发展提供了促进,并为利用LLMs进行多模态对齐和数据增强以进行运动分析开辟了新的途径。您可以在上述链接处公开获取代码。
https://arxiv.org/abs/2404.12192
The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets.
人与物体之间的互动对于识别物体中心行动非常重要。现有的方法通常采用两阶段流程,首先使用预训练的检测器检测物体建议,然后将它们输入到动作识别模型中,以提取视频特征并学习动作识别中的物体关系。然而,在物体检测阶段,动作先验未知,重要物体可能很容易被忽视,导致动作识别性能下降。在本文中,我们提出了一种端到端的物体中心动作识别框架,在同一阶段同时执行检测和交互推理。特别地,在提取视频特征的基础上,我们创建了三个并发物体检测和交互推理模块。首先,基于补丁的对象编码器生成视频补丁标记的提议。然后,一个交互式物体精炼和聚合模块确定动作识别中的重要物体,根据位置和外观调整提议得分,并将物体级信息汇总到全局视频表示中。最后,一个物体关系建模模块编码物体关系。这三个模块与视频特征提取器可以以协同训练的方式进行训练,从而避免对预定义的物体检测器的过度依赖,并减少多阶段训练负担。我们对两个数据集 Something-Else 和 Ikea-Assembly 进行了实验,以评估所提出方法在传统、组合和少样本动作识别任务上的性能。通过深入的实验分析,我们证明了交互式物体在动作识别中的关键作用,并且在两个数据集上都能够超越最先进的 methods。
https://arxiv.org/abs/2404.11903