The task of generating dance from music is crucial, yet current methods, which mainly produce joint sequences, lead to outputs that lack intuitiveness and complicate data collection due to the necessity for precise joint annotations. We introduce a Dance Any Beat Diffusion model, namely DabFusion, that employs music as a conditional input to directly create dance videos from still images, utilizing conditional image-to-video generation principles. This approach pioneers the use of music as a conditioning factor in image-to-video synthesis. Our method unfolds in two stages: training an auto-encoder to predict latent optical flow between reference and driving frames, eliminating the need for joint annotation, and training a U-Net-based diffusion model to produce these latent optical flows guided by music rhythm encoded by CLAP. Although capable of producing high-quality dance videos, the baseline model struggles with rhythm alignment. We enhance the model by adding beat information, improving synchronization. We introduce a 2D motion-music alignment score (2D-MM Align) for quantitative assessment. Evaluated on the AIST++ dataset, our enhanced model shows marked improvements in 2D-MM Align score and established metrics. Video results can be found on our project page: this https URL.
将音乐生成舞蹈的任务至关重要,然而目前的做法,主要产生关节序列,导致输出缺乏直觉,且数据收集因精确关节注释的必要性而变得复杂。我们引入了基于音乐的舞蹈任何节奏扩散模型,即DabFusion,它使用音乐作为条件输入来直接从静态图像生成舞蹈视频,利用条件图像到视频生成原则。这种方法在图像到视频合成中首创了将音乐作为条件因素的使用。我们的方法分为两个阶段:训练自编码器预测参考帧和驱动帧之间的潜在光流,消除关节注释的需求,并训练基于U-Net的扩散模型根据音乐节奏编码生成这些潜在光流。尽管可以产生高质量的舞蹈视频,但基线模型在节奏对齐方面挣扎。我们通过添加节奏信息并提高同步来增强模型。我们还引入了2D运动音乐对齐评分(2D-MM Align)用于定量评估。在AIST++数据集上评估,我们增强后的模型在2D-MM Align评分和已建立的指标方面明显改进。视频结果可以在我们的项目页面上查看:https:// this URL。
https://arxiv.org/abs/2405.09266
Tissue tracking in echocardiography is challenging due to the complex cardiac motion and the inherent nature of ultrasound acquisitions. Although optical flow methods are considered state-of-the-art (SOTA), they struggle with long-range tracking, noise occlusions, and drift throughout the cardiac cycle. Recently, novel learning-based point tracking techniques have been introduced to tackle some of these issues. In this paper, we build upon these techniques and introduce EchoTracker, a two-fold coarse-to-fine model that facilitates the tracking of queried points on a tissue surface across ultrasound image sequences. The architecture contains a preliminary coarse initialization of the trajectories, followed by reinforcement iterations based on fine-grained appearance changes. It is efficient, light, and can run on mid-range GPUs. Experiments demonstrate that the model outperforms SOTA methods, with an average position accuracy of 67% and a median trajectory error of 2.86 pixels. Furthermore, we show a relative improvement of 25% when using our model to calculate the global longitudinal strain (GLS) in a clinical test-retest dataset compared to other methods. This implies that learning-based point tracking can potentially improve performance and yield a higher diagnostic and prognostic value for clinical measurements than current techniques. Our source code is available at: this https URL.
在超声心动图中,组织追踪是一个具有挑战性的任务,因为心脏运动复杂,超声采集的固有本质导致了跟踪过程中的噪声阻塞和漂移。尽管光学流方法被认为是最先进的(SOTA),但它们在长距离跟踪、噪声遮挡和整个心动周期内的漂移方面表现不佳。最近,基于学习的新颖跟踪技术已经引入,以解决这些问题。在本文中,我们沿着这些技术进行了探讨,并引入了EchoTracker,一种双精度模型,用于在超声图像序列上追踪被查询的点在组织表面的轨迹。该架构包含轨迹的初步粗初始化,然后基于细粒度外观变化进行强化迭代。它既高效又轻便,可以在中高档GPU上运行。实验证明,与SOTA方法相比,该模型具有优异的性能,平均位置精度为67%,轨迹误差为2.86像素。此外,我们还展示了使用我们的模型在临床测试再测数据集中计算全局纵向应变(GLS)时的相对改善,相比其他方法,GLS的计算性能提高了25%。这表明,基于学习的点追踪技术有可能提高性能,为临床测量提供更高的诊断和预后价值,目前的技术无法满足临床需求。我们的源代码可在此处下载:https:// this URL.
https://arxiv.org/abs/2405.08587
From a perspective of feature matching, optical flow estimation for event cameras involves identifying event correspondences by comparing feature similarity across accompanying event frames. In this work, we introduces an effective and robust high-dimensional (HD) feature descriptor for event frames, utilizing Vector Symbolic Architectures (VSA). The topological similarity among neighboring variables within VSA contributes to the enhanced representation similarity of feature descriptors for flow-matching points, while its structured symbolic representation capacity facilitates feature fusion from both event polarities and multiple spatial scales. Based on this HD feature descriptor, we propose a novel feature matching framework for event-based optical flow, encompassing both model-based (VSA-Flow) and self-supervised learning (VSA-SM) methods. In VSA-Flow, accurate optical flow estimation validates the effectiveness of HD feature descriptors. In VSA-SM, a novel similarity maximization method based on the HD feature descriptor is proposed to learn optical flow in a self-supervised way from events alone, eliminating the need for auxiliary grayscale images. Evaluation results demonstrate that our VSA-based method achieves superior accuracy in comparison to both model-based and self-supervised learning methods on the DSEC benchmark, while remains competitive among both methods on the MVSEC benchmark. This contribution marks a significant advancement in event-based optical flow within the feature matching methodology.
从特征匹配的角度来看,事件相机中的光学流估计涉及通过比较伴随事件帧之间的特征相似性来识别事件对应关系。在这项工作中,我们引入了一个有效且鲁棒的高维(HD)特征描述器来描述事件帧,利用向量符号抽象结构(VSA)。VSA中相邻变量之间的拓扑相似性有助于增强特征描述器对于匹配点的光流匹配的表示相似性,而其结构化符号表示能力则促使来自事件极性和多个空间维度的特征进行融合。基于这个HD特征描述器,我们提出了一个基于事件的光流匹配框架,涵盖了基于模型的(VSA-Flow)和自监督学习(VSA-SM)方法。在VSA-Flow中,准确的光流估计验证了HD特征描述器的有效性。在VSA-SM中,我们提出了一种基于HD特征描述器的新型相似度最大化方法来以事件的方式自监督地学习光流,消除了需要辅助灰度图像的需求。评估结果表明,基于VSA的方法在DSEC基准上实现了与基于模型的方法和自监督学习方法相比的卓越准确性,而在MVSEC基准上则与这两种方法保持竞争力。这项贡献在基于特征匹配的光流匹配方法论上取得了显著的进步。
https://arxiv.org/abs/2405.08300
Accurate and robust camera tracking in dynamic environments presents a significant challenge for visual SLAM (Simultaneous Localization and Mapping). Recent progress in this field often involves the use of deep learning techniques to generate mask for dynamic objects, which usually require GPUs to operate in real-time (30 fps). Therefore, this paper proposes a novel visual SLAM system for dynamic environments that obtains real-time performance on CPU by incorporating a mask prediction mechanism, which allows the deep learning method and the camera tracking to run entirely in parallel at different frequencies such that neither waits for the result from the other. Based on this, it further introduces a dual-stage optical flow tracking approach and employs a hybrid usage of optical flow and ORB features, which significantly enhance the efficiency and robustness of the system. Compared with state-of-the-art methods, this system maintains high localization accuracy in dynamic environments while achieving a tracking frame rate of 56 fps on a single laptop CPU without any hardware acceleration, thus proving that deep learning methods are still feasible for dynamic SLAM even without GPU support. Based on the available information, this is the first SLAM system to achieve this.
准确且鲁棒的目标跟踪在动态环境中具有重大的挑战,尤其是在视觉SLAM(同时定位与映射)中。最近,这个领域的进展通常涉及使用深度学习技术生成动态对象的掩码,这通常需要GPU在实时(30帧/秒)操作。因此,本文提出了一种新的动态环境下的视觉SLAM系统,通过引入掩码预测机制,在CPU上实现实时性能,使得深度学习方法和相机跟踪可以在不同频率下并行运行,从而使它们不必等待结果。基于此,它还引入了双级光学流跟踪方法,并采用光流和ORB特征的混合使用,显著提高了系统的效率和鲁棒性。与最先进的方法相比,这个系统在动态环境中保持了高定位精度,同时在单个笔记本CPU上实现了56帧/秒的跟踪率,没有使用任何硬件加速,从而证明即使没有GPU支持,深度学习方法仍然可以实现动态SLAM。基于可用信息,这是第一个实现此目标的SLAM系统。
https://arxiv.org/abs/2405.07392
In this paper, we show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object Segmentation (VOS) under complex circumstances. Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object Segmentation. Prior works on VOS mostly rely on direct comparison of semantic and contextual features to perform dense matching between current and past frames, passing over actual motion structure. On the other hand, Optical Flow Estimation task aims to approximate the scene motion field, exposing global motion patterns which are typically undiscoverable during all pairs similarity search. We present WarpFormer, an architecture for semi-supervised Video Object Segmentation that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching. Our framework employs a generic pretrained Optical Flow Estimation network whose prediction is used to warp both past frames and instance segmentation masks to the current frame domain. Consequently, warped segmentation masks are refined and fused together aiming to inpaint occluded regions and eliminate artifacts caused by flow field imperfects. Additionally, we employ novel large-scale MOSE 2023 dataset to train model on various complex scenarios. Our method demonstrates strong performance on DAVIS 2016/2017 validation (93.0% and 85.9%), DAVIS 2017 test-dev (80.6%) and YouTube-VOS 2019 validation (83.8%) that is competitive with alternative state-of-the-art methods while using much simpler memory mechanism and instance understanding logic.
在本文中,我们证明了将来自其他领域视频理解的知识与大规模学习相结合可以提高视频对象分割(VOS)在复杂情况下的鲁棒性。具体来说,我们关注将场景全局运动知识集成到大型半监督视频对象分割中,以提高大规模半监督视频对象分割的鲁棒性。先前的VOS研究主要依赖于对当前和过去帧进行语义和上下文特征的直接比较来进行密集匹配,而忽略了实际运动结构。另一方面,光学流估计任务旨在近似场景运动场,揭示在所有相似度搜索过程中通常不可见的全局运动模式。我们提出了WarpFormer,一种半监督视频对象分割架构,利用现有的运动理解知识进行平滑传播和更精确的匹配。我们的框架采用了一个通用的预训练的光学流估计网络,其预测用于将过去帧和实例分割掩码 warp 到当前帧领域。因此,warped分割掩码经过精细化和融合,旨在修复遮挡区域并消除由流场不完美引起的伪影。此外,我们还使用新的大型MOSE 2023数据集来训练模型,以训练其在各种复杂场景中的性能。我们的方法在DAVIS 2016/2017验证集(93.0%和85.9%)和DAVIS 2017测试集(80.6%)以及YouTube-VOS 2019验证集(83.8%)上的表现与最先进的替代方法竞争,同时使用更简单的内存机制和实例理解逻辑。
https://arxiv.org/abs/2405.07031
The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption
近年来,在视频对象分割(Video Object Segmentation)方面的研究取得了显著的成果,通过在当前和前一帧之间匹配密集的语义和实例级别特征,实现了长时传播。然而,全局特征匹配忽略了场景运动上下文,未能满足时间一致性。尽管一些方法引入局部匹配分支以实现平滑传播,但由于局部窗口的约束,它们无法建模由于局部窗口引起的复杂视觉效果变化。在本文中,我们提出了DeVOS(可变形视频对象分割)架构,结合基于记忆的匹配和运动引导的传播,实现了稳定的长期建模和强的时间一致性。对于短期局部传播,我们提出了名为ADVA(自适应变形视频注意力)的新注意机制,允许将相似性搜索区域适应性地扩展到查询特定的语义特征,从而确保对复杂形状和比例变化进行稳健跟踪。DeVOS采用光流技术获得场景运动特征,这些特征进一步通过变形注意力作为强的先验进行学习。我们的方法在DAVIS 2017 val和test-dev(88.1%, 83.0%)以及YouTube-VOS 2019 val(86.6%)上实现了 Top-rank 性能,具有稳定的运行速度和合理的内存消耗。
https://arxiv.org/abs/2405.08715
Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.
动作识别是构建交互式虚拟现实的关键技术。随着深度学习的快速发展,动作识别方法也取得了很大的进步。研究人员设计并实现了多个观点的动作识别骨骼,这导致了方法的多样性和遇到的新挑战。本文回顾了几种基于深度神经网络的动作识别方法。我们在第一部分介绍了两个通道网络及其变体,尤其是在本文中,使用RGB视频帧和光学流模式作为输入;第二部分介绍了3D卷积网络,它们致力于利用RGB模式直接提取不同运动信息;第三部分介绍了基于Transformer的方法,将自然语言处理模型的思想引入计算机视觉和视频理解。我们在本文的回顾中提供了客观的看法,并希望为未来的研究提供参考。
https://arxiv.org/abs/2405.05584
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities such as RGB, optical flow, and audio, while only video-level annotations are available. In the pursuit of effective multimodal violence detection (MVD), information redundancy, modality imbalance, and modality asynchrony are identified as three key challenges. In this work, we propose a new weakly supervised MVD method that explicitly addresses these challenges. Specifically, we introduce a multi-scale bottleneck transformer (MSBT) based fusion module that employs a reduced number of bottleneck tokens to gradually condense information and fuse each pair of modalities and utilizes a bottleneck token-based weighting scheme to highlight more important fused features. Furthermore, we propose a temporal consistency contrast loss to semantically align pairwise fused features. Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance. Code is available at this https URL.
弱监督多模态暴力检测的目标是通过利用多种模态(如RGB、光流和音频)来学习一个暴力检测模型,而只有视频级别的注释可用。在追求有效的多模态暴力检测(MVD)时,我们确定的三个关键挑战是信息冗余、模态不平衡和模态异步。在这篇工作中,我们提出了一种新的弱监督多模态暴力检测方法,明确解决了这些挑战。具体来说,我们引入了一种多尺度瓶颈变换器(MSBT)基于融合模块,采用较少的瓶颈标记来逐渐压缩信息并融合每对模态,并使用瓶颈标记基于的权重方案来突出更重要的融合特征。此外,我们还提出了一种时间一致性对比损失,以在语义上对对齐的融合特征进行平滑。在最大的XD-Violence数据集上进行的实验证明,与最先进的性能相当。代码可在此处访问:https://www. this URL.
https://arxiv.org/abs/2405.05130
We describe a method for recovering the irradiance underlying a collection of images corrupted by atmospheric turbulence. Since supervised data is often technically impossible to obtain, assumptions and biases have to be imposed to solve this inverse problem, and we choose to model them explicitly. Rather than initializing a latent irradiance ("template") by heuristics to estimate deformation, we select one of the images as a reference, and model the deformation in this image by the aggregation of the optical flow from it to other images, exploiting a prior imposed by Central Limit Theorem. Then with a novel flow inversion module, the model registers each image TO the template but WITHOUT the template, avoiding artifacts related to poor template initialization. To illustrate the robustness of the method, we simply (i) select the first frame as the reference and (ii) use the simplest optical flow to estimate the warpings, yet the improvement in registration is decisive in the final reconstruction, as we achieve state-of-the-art performance despite its simplicity. The method establishes a strong baseline that can be further improved by integrating it seamlessly into more sophisticated pipelines, or with domain-specific methods if so desired.
我们描述了一种从受大气湍流污染的图像集合中恢复辐射的方法。由于通常很难获得监督数据,因此需要强加假设和偏见来解决这个反问题,我们选择明确地建模它们。我们不通过启发式初始化一个潜在的辐射("模板")来估计变形,而是选择一个图像作为参考,并利用来自它到其他图像的光学流的聚合来建模变形,利用中央极限定理 prior。然后,通过一种新颖的流量反演模块,模型将每个图像注册到模板上,但没有模板。这避免了与初始模板质量差相关的伪影。为了说明方法的稳健性,我们只需(i)选择第一个帧作为参考,(ii)使用最简单的光学流估计变形,然而在最终重构中,模型的注册改善是决定性的,尽管它的简单性,我们实现了最先进的表现。通过将该方法无缝集成到更复杂的数据处理流程中,或如果需要的话使用领域特定方法,可以进一步改进这个基线。
https://arxiv.org/abs/2405.03662
Micro-expression recognition (MER) aims to recognize the short and subtle facial movements from the Micro-expression (ME) video clips, which reveal real emotions. Recent MER methods mostly only utilize special frames from ME video clips or extract optical flow from these special frames. However, they neglect the relationship between movements and space-time, while facial cues are hidden within these relationships. To solve this issue, we propose the Hierarchical Space-Time Attention (HSTA). Specifically, we first process ME video frames and special frames or data parallelly by our cascaded Unimodal Space-Time Attention (USTA) to establish connections between subtle facial movements and specific facial areas. Then, we design Crossmodal Space-Time Attention (CSTA) to achieve a higher-quality fusion for crossmodal data. Finally, we hierarchically integrate USTA and CSTA to grasp the deeper facial cues. Our model emphasizes temporal modeling without neglecting the processing of special data, and it fuses the contents in different modalities while maintaining their respective uniqueness. Extensive experiments on the four benchmarks show the effectiveness of our proposed HSTA. Specifically, compared with the latest method on the CASME3 dataset, it achieves about 3% score improvement in seven-category classification.
微表情识别(MER)旨在识别Micro-expression(ME)视频片段中的短小精微面部动作,这些动作揭示了真实情感。最近,大多数MER方法仅利用ME视频片段的特殊帧或提取特殊帧的光学流。然而,它们忽视了运动和空间时间之间的关系,而面部线索隐藏在这些关系中。为解决这个问题,我们提出了层次空间时间关注(HSTA)。具体来说,我们通过自下而上的单模态空间时间关注(USTA)处理ME视频帧和特殊帧,以建立微妙面部动作与特定面部区域之间的联系。然后,我们设计了跨模态空间时间关注(CSTA)来提高跨模态数据的质量。最后,我们将USTA和CSTA分层集成,以捕捉更深的面部线索。我们强调在忽略特殊数据处理的同时进行时间建模,并将不同模态的内容融合在一起,保持其独特性。在四个基准测试上进行的大量实验证明了我们提出的HSTA的有效性。具体来说,与CASME3数据集中的最新方法相比,它在七类分类上实现了约3%的得分改进。
https://arxiv.org/abs/2405.03202
Due to the ever-increasing availability of video surveillance cameras and the growing need for crime prevention, the violence detection task is attracting greater attention from the research community. With respect to other action recognition tasks, violence detection in surveillance videos shows additional issues, such as the presence of a significant variety of real fight scenes. Unfortunately, available datasets seem to be very small compared with other action recognition datasets. Moreover, in surveillance applications, people in the scenes always differ for each video and the background of the footage differs for each camera. Also, violent actions in real-life surveillance videos must be detected quickly to prevent unwanted consequences, thus models would definitely benefit from a reduction in memory usage and computational costs. Such problems make classical action recognition methods difficult to be adopted. To tackle all these issues, we introduce JOSENet, a novel self-supervised framework that provides outstanding performance for violence detection in surveillance videos. The proposed model receives two spatiotemporal video streams, i.e., RGB frames and optical flows, and involves a new regularized self-supervised learning approach for videos. JOSENet provides improved performance compared to self-supervised state-of-the-art methods, while requiring one-fourth of the number of frames per video segment and a reduced frame rate. The source code and the instructions to reproduce our experiments are available at this https URL.
由于视频监控摄像头的不断增加和犯罪预防的需求不断增加,暴力检测任务正在从研究社区获得越来越多的关注。与其他动作识别任务相比,监控视频中的暴力检测任务还表现为其他问题,例如存在显著的实战场景。然而,现有的数据集似乎与其他动作识别数据集相比非常小。此外,在视频应用中,每个视频场景的人都有所不同,视频拍摄的角度也有所不同。为了快速检测现实生活中的暴力行为,防止不良后果,因此模型可以从内存使用和计算成本的降低中获得好处。这些问题使得经典动作识别方法难以采用。为解决这些问题,我们引入了JOSENet,一种新颖的自监督框架,可以在监控视频中的暴力检测方面提供卓越的表现。与自监督先进的视频分类方法相比,所提出的模型具有更好的性能,同时每个视频片段需要四分之一帧数和降低帧率。JOSENet的源代码和重现实验的说明可以在该链接处找到。
https://arxiv.org/abs/2405.02961
Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore, we propose UnSAMFlow, an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also analyze the poor gradient landscapes of traditional smoothness losses and propose a new smoothness definition based on homography instead. A simple yet effective mask feature module has also been added to further aggregate features on the object level. With all these adaptations, our method produces clear optical flow estimation with sharp boundaries around objects, which outperforms state-of-the-art methods on both KITTI and Sintel datasets. Our method also generalizes well across domains and runs very efficiently.
传统的无监督光学流方法由于缺乏物体级别信息,容易受到遮挡和运动边界的困扰。因此,我们提出了UnSAMFlow,一种利用最新基础模型Segment Anything Model(SAM)中的物体信息的无监督流网络。我们首先包括一个自监督的语义增强模块,专门针对SAM掩码。我们还分析了传统平滑度损失函数的糟糕梯度景观,并基于仿射学定义提出了新的平滑度定义。还添加了一个简单的 yet effective 掩码特征模块,进一步在物体级别汇总特征。有了这些修改,我们的方法在物体级别具有清晰的光学流估计,边缘清晰,超越了最先进的方法在KITTI和Sintel数据集上的表现。我们的方法还具有良好的跨领域泛化能力,运行效率非常高。
https://arxiv.org/abs/2405.02608
Existing VLMs can track in-the-wild 2D video objects while current generative models provide powerful visual priors for synthesizing novel views for the highly under-constrained 2D-to-3D object lifting. Building upon this exciting progress, we present DreamScene4D, the first approach that can generate three-dimensional dynamic scenes of multiple objects from monocular in-the-wild videos with large object motion across occlusions and novel viewpoints. Our key insight is to design a "decompose-then-recompose" scheme to factorize both the whole video scene and each object's 3D motion. We first decompose the video scene by using open-vocabulary mask trackers and an adapted image diffusion model to segment, track, and amodally complete the objects and background in the video. Each object track is mapped to a set of 3D Gaussians that deform and move in space and time. We also factorize the observed motion into multiple components to handle fast motion. The camera motion can be inferred by re-rendering the background to match the video frames. For the object motion, we first model the object-centric deformation of the objects by leveraging rendering losses and multi-view generative priors in an object-centric frame, then optimize object-centric to world-frame transformations by comparing the rendered outputs against the perceived pixel and optical flow. Finally, we recompose the background and objects and optimize for relative object scales using monocular depth prediction guidance. We show extensive results on the challenging DAVIS, Kubric, and self-captured videos, detail some limitations, and provide future directions. Besides 4D scene generation, our results show that DreamScene4D enables accurate 2D point motion tracking by projecting the inferred 3D trajectories to 2D, while never explicitly trained to do so.
现有的VLMs可以在野外追踪2D视频对象,而当前的生成模型可以为高度约束的2D到3D物体提升提供强大的视觉先验,以生成新颖的视角。在此基础上,我们提出了DreamScene4D,这是第一个可以从单目野生动物视频生成多物体三维动态场景的方法,具有大物体运动跨越遮挡和新视角。我们的关键见解是设计一个“分解-然后-重构”方案,将整个视频场景和每个对象的3D运动分解。我们首先通过使用开箱即用的词汇mask跟踪器和适应性图像扩散模型来分解视频场景,分割和跟踪视频中的物体和背景。每个物体跟踪映射到一组3D高斯,它们在空间和时间上扭曲和移动。此外,我们还将观察到的运动分解为多个组件,以处理快速运动。通过重新渲染背景以匹配视频帧,可以推断出相机运动。对于物体运动,我们首先通过利用渲染损失和物体中心帧的多视图生成先验在物体中心建模物体本体的变形,然后通过将渲染输出与感知像素和光学流进行比较,优化物体本体到世界帧的变换。最后,我们通过单目深度预测指导来重构背景和物体,并优化相对物体比例。我们在具有挑战性的DAVIS、Kubric和自捕获视频中展示了广泛的结果,详细介绍了其局限性,并提供了未来的方向。除了4D场景生成外,我们的结果表明,DreamScene4D通过将推断的3D轨迹投影到2D来准确跟踪2D物体运动,而从未明确训练过这样做。
https://arxiv.org/abs/2405.02280
In this paper, we consider two challenging issues in reference-based super-resolution (RefSR) for smartphone, (i) how to choose a proper reference image, and (ii) how to learn RefSR in a self-supervised manner. Particularly, we propose a novel self-supervised learning approach for real-world RefSR from observations at dual and multiple camera zooms. Firstly, considering the popularity of multiple cameras in modern smartphones, the more zoomed (telephoto) image can be naturally leveraged as the reference to guide the super-resolution (SR) of the lesser zoomed (ultra-wide) image, which gives us a chance to learn a deep network that performs SR from the dual zoomed observations (DZSR). Secondly, for self-supervised learning of DZSR, we take the telephoto image instead of an additional high-resolution image as the supervision information, and select a center patch from it as the reference to super-resolve the corresponding ultra-wide image patch. To mitigate the effect of the misalignment between ultra-wide low-resolution (LR) patch and telephoto ground-truth (GT) image during training, we first adopt patch-based optical flow alignment and then design an auxiliary-LR to guide the deforming of the warped LR features. To generate visually pleasing results, we present local overlapped sliced Wasserstein loss to better represent the perceptual difference between GT and output in the feature space. During testing, DZSR can be directly deployed to super-solve the whole ultra-wide image with the reference of the telephoto image. In addition, we further take multiple zoomed observations to explore self-supervised RefSR, and present a progressive fusion scheme for the effective utilization of reference images. Experiments show that our methods achieve better quantitative and qualitative performance against state-of-the-arts. Codes are available at this https URL.
在本文中,我们考虑了在基于参考图像的超分辨率(RefSR)中 two 个具有挑战性的问题:(i)如何选择一个适当的参考图像,(ii)如何在自监督的方式下学习RefSR。特别地,我们提出了一种从双摄像头和多摄像头缩放的观察中进行真实世界RefSR的新型自监督学习方法。首先,考虑到现代智能手机中多个摄像头的流行,更缩放的(望远镜)图像可以自然地作为一个参考,以指导较小缩放(超广角)图像的超分辨率(SR),这给我们机会学习从双缩放观察中进行SR的深度网络。(ii)为了自监督学习DZSR,我们选择望远镜图像作为监督信息,并从它中选择一个中心补丁作为参考,以超分辨率相应的超广角图像补丁。为了减轻在训练过程中超广角低分辨率(LR)补丁与望远镜地面真实(GT)图像之间错位的影响,我们首先采用基于补丁的图像光束对齐,然后设计了一个辅助-LR,以指导失真 LR 特征的变形。为了生成视觉效果好的结果,我们提出了局部重叠的切削韦伯损失,更好地表示 GT 和输出在特征空间中的差异。在测试期间,DZSR可以直接部署用于解决整个超广角图像。此外,我们进一步进行了多次缩放观察,以探索自监督RefSR,并提出了参考图像的有效利用方案。实验结果表明,我们的方法在定量和定性方面都优于现有技术水平。代码可在此处下载:https://www.x剔除。
https://arxiv.org/abs/2405.02171
Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
目前最先进的视频修复方法通常依赖于光学流或基于注意力的方法,通过在帧之间传播视觉信息来修复遮罩区域。虽然这些方法在标准基准测试中取得了显著的进展,但它们在需要合成不存在的其他帧中合成新内容时遇到困难。在本文中,我们将视频修复重新定义为条件生成建模问题,并提出了使用条件视频扩散模型解决此类问题的框架。我们强调了使用生成方法解决此任务的优点,证明了我们的方法能够生成多样、高质量的视频修复,并合成与提供上下文一致的时空语义内容。
https://arxiv.org/abs/2405.00251
The joint optimization of the sensor trajectory and 3D map is a crucial characteristic of bundle adjustment (BA), essential for autonomous driving. This paper presents $\nu$-DBA, a novel framework implementing geometric dense bundle adjustment (DBA) using 3D neural implicit surfaces for map parametrization, which optimizes both the map surface and trajectory poses using geometric error guided by dense optical flow prediction. Additionally, we fine-tune the optical flow model with per-scene self-supervision to further improve the quality of the dense mapping. Our experimental results on multiple driving scene datasets demonstrate that our method achieves superior trajectory optimization and dense reconstruction accuracy. We also investigate the influences of photometric error and different neural geometric priors on the performance of surface reconstruction and novel view synthesis. Our method stands as a significant step towards leveraging neural implicit representations in dense bundle adjustment for more accurate trajectories and detailed environmental mapping.
传感器轨迹和3D地图的联合优化是捆绑调整(BA)的关键特征,对于自动驾驶至关重要。本文介绍了一种新的框架,采用3D神经隐式表面实现几何密集捆绑调整(DBA),用于地图参数化,通过基于密光流预测的几何误差指导来优化地图表面和轨迹姿态。此外,我们通过每个场景的自监督优化光学流模型来进一步改善密 map 的质量。在多个驾驶场景数据集上的实验结果表明,我们的方法在轨迹优化和密 map 重建准确性方面具有优越性能。我们还研究了像元误差和不同神经几何先验对表面重建和新视图合成性能的影响。我们的方法在利用神经隐式表示的捆绑调整中迈出了重要的一步,可以更准确地轨迹和详细环境映射。
https://arxiv.org/abs/2404.18439
Video frame interpolation, the process of synthesizing intermediate frames between sequential video frames, has made remarkable progress with the use of event cameras. These sensors, with microsecond-level temporal resolution, fill information gaps between frames by providing precise motion cues. However, contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often neglect the fact that event data primarily supply high-confidence features at scene edges during multi-modal feature fusion, thereby diminishing the role of event signals in optical flow (OF) estimation and warping refinement. To address this overlooked aspect, we introduce an end-to-end E-VFI learning method (referred to as EGMR) to efficiently utilize edge features from event signals for motion flow and warping enhancement. Our method incorporates an Edge Guided Attentive (EGA) module, which rectifies estimated video motion through attentive aggregation based on the local correlation of multi-modal features in a coarse-to-fine strategy. Moreover, given that event data can provide accurate visual references at scene edges between consecutive frames, we introduce a learned visibility map derived from event data to adaptively mitigate the occlusion problem in the warping refinement process. Extensive experiments on both synthetic and real datasets show the effectiveness of the proposed approach, demonstrating its potential for higher quality video frame interpolation.
视频帧插值,即在序列视频帧之间合成中间帧的过程,在事件相机中的应用取得了显著的进展。这些传感器具有微秒级的时域分辨率,通过提供精确的运动提示来填补帧之间的信息缺口。然而,当代基于事件的视频帧插值(E-VFI)方法通常忽视了一个事实,即事件数据在多模态特征融合过程中主要提供高置信度的特征,从而减弱了事件信号在光流估计和畸变校正中的作用。为了解决这个问题,我们引入了一种端到端的E-VFI学习方法(称为EGMR),以有效地利用事件信号的边缘特征进行运动流和畸变增强。我们的方法包括一个边缘引导注意(EGA)模块,它基于多模态特征在粗到细策略下的局部相关性进行 attentive 聚合来纠正估计的视频运动。此外,由于事件数据可以在连续帧之间的场景边缘提供准确的视觉参考,我们引入了一个基于事件数据的 learned visibility map,用于自适应地减轻畸变校正过程中的遮挡问题。在合成和真实数据集上进行的大量实验证明了我们提出的方法的有效性,表明了它有潜力实现更高质量的视频帧插值。
https://arxiv.org/abs/2404.18156
Motion analysis plays a critical role in various applications, from virtual reality and augmented reality to assistive visual navigation. Traditional self-driving technologies, while advanced, typically do not translate directly to pedestrian applications due to their reliance on extensive sensor arrays and non-feasible computational frameworks. This highlights a significant gap in applying these solutions to human users since human navigation introduces unique challenges, including the unpredictable nature of human movement, limited processing capabilities of portable devices, and the need for directional responsiveness due to the limited perception range of humans. In this project, we introduce an image-only method that applies motion analysis using optical flow with ego-motion compensation to predict Motor Focus-where and how humans or machines focus their movement intentions. Meanwhile, this paper addresses the camera shaking issue in handheld and body-mounted devices which can severely degrade performance and accuracy, by applying a Gaussian aggregation to stabilize the predicted motor focus area and enhance the prediction accuracy of movement direction. This also provides a robust, real-time solution that adapts to the user's immediate environment. Furthermore, in the experiments part, we show the qualitative analysis of motor focus estimation between the conventional dense optical flow-based method and the proposed method. In quantitative tests, we show the performance of the proposed method on a collected small dataset that is specialized for motor focus estimation tasks.
运动分析在各种应用中扮演着关键角色,从虚拟现实和增强现实到辅助视觉导航。虽然传统自动驾驶技术很先进,但通常无法直接应用于行人应用程序,因为它们依赖于广泛的传感器阵列和不切实际的计算框架。这凸显了将这些解决方案应用于人类用户的应用之间的显著空白。由于人类导航具有独特的挑战,包括人类运动的不可预测性、便携设备处理能力的有限性和由于人类感知范围有限而需要定向响应,因此在这个项目中,我们引入了一种仅使用图像的方法,该方法使用光流与自适应运动补偿来预测焦点。同时,本文解决了手持和人体安装设备中相机抖动问题,这会严重降低性能和准确性。通过应用高斯聚合来稳定预测的焦点区域并提高运动方向预测的准确性。这也提供了一个 robust、实时的解决方案,适应用户周围的环境。此外,在实验部分,我们展示了传统密集光流基于方法与所提出方法之间的定性分析。在定量测试中,我们展示了所提出方法在一个专门为运动焦点估计任务收集的小数据集上的性能。
https://arxiv.org/abs/2404.17031
We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at this http URL
我们介绍了一种新的多会话SLAM系统,该系统在单个全局参考下跟踪相机运动。我们的方法将预测光流与求解层相结合来估计相机姿态。骨架使用一种新的具有差分隐私的求解器进行端到端训练,用于估计宽基线两视图姿态。完整的系统可以连接离散序列,执行视觉姿态估计和全局优化。与现有方法相比,我们的设计准确且对灾难性故障具有鲁棒性。代码可在此处下载:http://www.example.com
https://arxiv.org/abs/2404.15263
This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).
本文介绍了FlowMap,一种端到端的不同iable方法,用于求解视频序列中的精确相机姿态、相机内参和逐帧密集深度。我们的方法通过简单最小二乘目标函数对深度、内参和姿态引起的光学流进行逐视频梯度下降最小化。在点跟踪的使用下,我们引入了可进行一级优化的深度、内参和姿态的可导性重新参数化。我们通过实验验证,我们的方法能够使用高斯平铺实现照片现实感的360度轨迹合成。与基于梯度的 bundle adjustment 方法相比,我们的方法不仅远远超过了先前的结果,而且与最先进的SfM方法COLMAP在360度新视图合成下游任务的表现相当。尽管我们的方法是基于梯度的,完全不同导,完全与传统SfM不同,但它成功地克服了传统SfM的局限性。
https://arxiv.org/abs/2404.15259