We investigate the problem of pixelwise correspondence for deformable objects, namely cloth and rope, by comparing both classical and learning-based methods. We choose cloth and rope because they are traditionally some of the most difficult deformable objects to analytically model with their large configuration space, and they are meaningful in the context of robotic tasks like cloth folding, rope knot-tying, T-shirt folding, curtain closing, etc. The correspondence problem is heavily motivated in robotics, with wide-ranging applications including semantic grasping, object tracking, and manipulation policies built on top of correspondences. We present an exhaustive survey of existing classical methods for doing correspondence via feature-matching, including SIFT, SURF, and ORB, and two recently published learning-based methods including TimeCycle and Dense Object Nets. We make three main contributions: (1) a framework for simulating and rendering synthetic images of deformable objects, with qualitative results demonstrating transfer between our simulated and real domains (2) a new learning-based correspondence method extending Dense Object Nets, and (3) a standardized comparison across state-of-the-art correspondence methods. Our proposed method provides a flexible, general formulation for learning temporally and spatially continuous correspondences for nonrigid (and rigid) objects. We report root mean squared error statistics for all methods and find that Dense Object Nets outperforms baseline classical methods for correspondence, and our proposed extension of Dense Object Nets performs similarly.
我们通过比较基于经典方法和基于学习的方法来研究可变形物体的像素级对应问题,包括布料和绳索。我们选择布料和绳索是因为它们是传统上最难以用大配置空间进行分析建模的变形物体之一,而且在机器人任务如布料折叠、绳索结结、T恤折叠、窗帘关闭等背景下具有重要意义。对应问题在机器人领域受到广泛关注,包括通过特征匹配的语义抓取、物体跟踪和操作策略等。我们全面调查了通过特征匹配实现对应的传统经典方法,包括SIFT、SURF和ORB,以及两篇最近发表的学习方法TimeCycle和Dense Object Nets。我们做出了三个主要贡献:(1)通过模拟和渲染变形物体的合成图像,展示了模拟和真实领域之间的转移;(2)扩展了Dense Object Nets的新学习方法;(3)对最先进的对应方法进行了标准化比较。我们提出的方法为学习非刚性(和刚性)对象的时域和空间连续对应提供了一个灵活、通用的公式。我们报告了所有方法的所有者的根均方误差统计,并发现,Dense Object Nets基线经典方法在对应方面优越,而我们的对Dense Object Nets的扩展也具有相似的性能。
https://arxiv.org/abs/2405.08996
Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm, detecting objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms. Code is available at this https URL.
许多基于查询的3D多对象跟踪(MOT)方法采用了关注点的跟踪范式,利用跟踪查询进行身份一致的检测,利用对象查询进行身份无关的跟踪生成。然而,关注点的跟踪范式将检测和跟踪查询在同一个嵌入中纠缠在一起,对于检测和跟踪任务来说不是最优解。其他方法类似于跟踪-by-detection范式,使用解耦的跟踪和检测查询然后进行后续的相关联来检测物体。然而,这些方法并未利用检测和关联任务之间的协同作用。通过结合这两种范式的优势,我们引入了ADA-Track,一种从多视角摄像机视角的3D MOT的新型端到端框架。我们基于边缘增强交叉注意力的可学习数据关联模块,利用外观和几何特征。此外,我们将该关联模块集成到基于DETR的3D检测器的解码层中,实现同时检测和查询到图像的交叉注意。通过堆叠这些解码层,查询在检测和关联任务上进行 alternating refine,有效利用了任务依赖关系。我们在nuScenes数据集上评估我们的方法,并证明了与前两种范式相比,我们的方法具有优势。代码可在此处下载:https://www.xxx.com/。
https://arxiv.org/abs/2405.08909
With the advancement of video analysis technology, the multi-object tracking (MOT) problem in complex scenes involving pedestrians is gaining increasing importance. This challenge primarily involves two key tasks: pedestrian detection and re-identification. While significant progress has been achieved in pedestrian detection tasks in recent years, enhancing the effectiveness of re-identification tasks remains a persistent challenge. This difficulty arises from the large total number of pedestrian samples in multi-object tracking datasets and the scarcity of individual instance samples. Motivated by recent rapid advancements in meta-learning techniques, we introduce MAML MOT, a meta-learning-based training approach for multi-object tracking. This approach leverages the rapid learning capability of meta-learning to tackle the issue of sample scarcity in pedestrian re-identification tasks, aiming to improve the model's generalization performance and robustness. Experimental results demonstrate that the proposed method achieves high accuracy on mainstream datasets in the MOT Challenge. This offers new perspectives and solutions for research in the field of pedestrian multi-object tracking.
随着视频分析技术的进步,涉及行人的复杂场景中的多目标跟踪(MOT)问题越来越重要。这个挑战主要涉及两个关键任务:行人检测和识别。虽然近年来在行人检测任务上取得了显著的进展,但增强识别任务的有效性仍然是一个持续的挑战。这种困难源于多对象跟踪数据集中大量的行人样本和样本稀疏性。为了应对这种挑战,受到最近元学习技术快速发展启发的我们引入了MAML MOT,一种基于元学习的多目标跟踪训练方法。这种方法利用元学习的快速学习能力来解决行人识别任务中的样本稀疏性问题,旨在提高模型的泛化性能和鲁棒性。实验结果表明,与主流数据集相比,所提出的方法在MOT挑战中具有很高的准确率。这为研究行人多目标跟踪领域提供了新的视角和解决方案。
https://arxiv.org/abs/2405.07272
The main barrier to achieving fully autonomous flights lies in autonomous aircraft navigation. Managing non-cooperative traffic presents the most important challenge in this problem. The most efficient strategy for handling non-cooperative traffic is based on monocular video processing through deep learning models. This study contributes to the vision-based deep learning aircraft detection and tracking literature by investigating the impact of data corruption arising from environmental and hardware conditions on the effectiveness of these methods. More specifically, we designed $7$ types of common corruptions for camera inputs taking into account real-world flight conditions. By applying these corruptions to the Airborne Object Tracking (AOT) dataset we constructed the first robustness benchmark dataset named AOT-C for air-to-air aerial object detection. The corruptions included in this dataset cover a wide range of challenging conditions such as adverse weather and sensor noise. The second main contribution of this letter is to present an extensive experimental evaluation involving $8$ diverse object detectors to explore the degradation in the performance under escalating levels of corruptions (domain shifts). Based on the evaluation results, the key observations that emerge are the following: 1) One-stage detectors of the YOLO family demonstrate better robustness, 2) Transformer-based and multi-stage detectors like Faster R-CNN are extremely vulnerable to corruptions, 3) Robustness against corruptions is related to the generalization ability of models. The third main contribution is to present that finetuning on our augmented synthetic data results in improvements in the generalisation ability of the object detector in real-world flight experiments.
实现完全自动驾驶飞行的主要障碍是自主飞行器的导航。管理非合作交通是这个问题中最重要的挑战。最有效的处理非合作交通的方法基于深度学习模型通过单目视频处理。本研究通过研究数据 corruption 对这些方法的有效性产生的影响,对基于视觉的深度学习飞机检测和跟踪文献做出了贡献。具体来说,我们设计了几种常见的数据污染情况,考虑了现实飞行条件下的相机输入。将这些污染情况应用于空中物体跟踪(AOT)数据集,我们构建了名为AOT-C的第一个对空对空航空物体检测的稳健基准数据集。这个数据集中的污染情况涵盖了恶劣天气和传感器噪音等广泛挑战。本论文的第二个主要贡献是,通过包括8个不同类型的物体检测器进行广泛的实验评估,探索了随着数据污染水平不断提高,性能的降解情况。根据评估结果,出现的关键观察结果是:1)YOLO家族的一阶检测器表现出更好的鲁棒性;2)类似于Faster R-CNN的Transformer-based和多阶段检测器对数据污染极其敏感;3)对数据污染的鲁棒性与模型的泛化能力相关。第三主要贡献是,说明了在我们的增强合成数据上进行微调导致物体检测器在现实飞行实验中的泛化能力得到提高。
https://arxiv.org/abs/2405.06765
In the last twenty years, unmanned aerial vehicles (UAVs) have garnered growing interest due to their expanding applications in both military and civilian domains. Detecting non-cooperative aerial vehicles with efficiency and estimating collisions accurately are pivotal for achieving fully autonomous aircraft and facilitating Advanced Air Mobility (AAM). This paper presents a deep-learning framework that utilizes optical sensors for the detection, tracking, and distance estimation of non-cooperative aerial vehicles. In implementing this comprehensive sensing framework, the availability of depth information is essential for enabling autonomous aerial vehicles to perceive and navigate around obstacles. In this work, we propose a method for estimating the distance information of a detected aerial object in real time using only the input of a monocular camera. In order to train our deep learning components for the object detection, tracking and depth estimation tasks we utilize the Amazon Airborne Object Tracking (AOT) Dataset. In contrast to previous approaches that integrate the depth estimation module into the object detector, our method formulates the problem as image-to-image translation. We employ a separate lightweight encoder-decoder network for efficient and robust depth estimation. In a nutshell, the object detection module identifies and localizes obstacles, conveying this information to both the tracking module for monitoring obstacle movement and the depth estimation module for calculating distances. Our approach is evaluated on the Airborne Object Tracking (AOT) dataset which is the largest (to the best of our knowledge) air-to-air airborne object dataset.
在过去的二十年中,由于无人机在军事和民用领域的广泛应用,无人驾驶航空器(UAVs)吸引了越来越多的关注。有效地检测非合作性无人机和准确估算碰撞是实现完全自动驾驶飞机和促进先进空运 mobility(AAM)的关键。本文介绍了一种利用光学传感器检测、跟踪和估算非合作性无人机的深度学习框架。在实施这一全面感测框架时,深度信息对于使自动驾驶无人机感知和围绕障碍物行驶至关重要。 在这项工作中,我们提出了一种仅使用单目摄像机输入来实时估算被检测空中物体距离的方法。为了为我们的深度学习组件训练物体检测、跟踪和深度估计任务,我们利用了亚马逊空中物体跟踪(AOT)数据集。与之前将深度估计模块集成到物体检测器中的方法不同,我们的方法将问题表述为图像到图像的映射。我们采用了一个轻量级的编码器-解码器网络来进行高效且鲁棒的深度估计。 总之,物体检测模块确定并定位障碍物,将此信息传递给跟踪模块用于监控障碍物移动和深度估计模块用于计算距离。我们的方法在空中物体跟踪(AOT)数据集上进行了评估,这是我们已知的大型(至多最好)空对空飞行器数据集。
https://arxiv.org/abs/2405.06749
Knowing the locations of nearby moving objects is important for a mobile robot to operate safely in a dynamic environment. Dynamic object tracking performance can be improved if robots share observations of tracked objects with nearby team members in real-time. To share observations, a robot must make up-to-date estimates of the transformation from its coordinate frame to the frame of each neighbor, which can be challenging because of odometry drift. We present Multiple Object Tracking with Localization Error Elimination (MOTLEE), a complete system for a multi-robot team to accurately estimate frame transformations and collaboratively track dynamic objects. To accomplish this, robots use open-set image-segmentation methods to build object maps of their environment and then use our Temporally Consistent Alignment of Frames Filter (TCAFF) to align maps and estimate coordinate frame transformations without any initial knowledge of neighboring robot poses. We show that our method for aligning frames enables a team of four robots to collaboratively track six pedestrians with accuracy similar to that of a system with ground truth localization in a challenging hardware demonstration. The code and hardware dataset are available at this https URL.
了解附近移动物体的位置对移动机器人安全地在动态环境中操作非常重要。如果机器人与附近的团队成员共享跟踪对象的观察,动态物体跟踪性能可以得到改善。要共享观察结果,机器人必须实时估计从其坐标系到每个邻居的帧的变换,这可能会因为惯性漂移而具有挑战性。我们提出了 Multiple Object Tracking with Localization Error Elimination(MOTLEE),这是一个多机器人团队准确估计帧变换并协同跟踪动态物体的完整系统。为此,机器人使用开环图像分割方法构建其环境的物体地图,然后使用我们的 Temporally Consistent Alignment of Frames Filter(TCAFF)进行对齐和估计坐标系变换,而无需初始知识相邻机器人的姿态。我们证明了我们的对齐帧方法使得一个由四个机器人组成的团队可以与地面真实局部定位系统的精度相似地协同跟踪六个行人。代码和硬件数据集可在此处访问:https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_id=1147。
https://arxiv.org/abs/2405.05210
There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at this https URL.
目前,通过增强视觉事件摄像机输出的RGB模态来改善视觉对象跟踪是一个非常有趣的方法。然而,现有的方法使用传统的表现模型对RGB-E跟踪进行事件特征提取,这些模型仅针对RGB跟踪进行优化,而没有适应事件数据的固有特征。为解决这个问题,我们提出了一个事件骨架(Pooler),旨在获得一种高质量的特征表示,具有对事件数据固有特征的认知,即其稀疏性。特别地,我们引入了多尺度池化来通过使用不同的池化核尺寸来捕捉事件数据中的所有运动特征趋势。通过执行自适应相互引导融合(MGF)模块,建立了提取的RGB和事件表示之间的关联。在广泛使用的两个RGB-E跟踪数据集(包括VisEvent和COESOT)上进行的实验结果表明,我们的方法在两个数据集上的表现都显著优于最先进的跟踪器,包括VisEvent和COESOT。在COESOT上,通过分别提高4.9%和5.2%的精度和成功率,实现了显著的性能提升。我们的代码将在此处https:// URL上可用。
https://arxiv.org/abs/2405.05004
Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.
近年来,随着其较低成本,视觉中心化的自动驾驶引起了广泛关注。预训练对于提取普遍表示至关重要。然而,目前视觉中心化的预训练通常依赖于2D或3D预训练任务,忽视了自动驾驶作为4D场景理解任务的时空特征。在本文中,我们通过引入基于世界模型的自动驾驶4D表示学习框架\emph{DriveWorld}来解决这一挑战。该框架能够以时空方式从多摄像头驾驶视频中进行预训练。具体来说,我们提出了一个记忆状态空间模型进行空间-时间建模,包括动态内存库模块用于学习时空感知到的潜在动态,静态场景传播模块用于学习空间感知到的潜在静态,以提供全面的场景上下文。我们还引入了一个任务提示,用于解耦各种下游任务的关注点特征。实验证明,DriveWorld在各种自动驾驶任务上取得了很好的效果。当使用OpenScene数据集进行预训练时,DriveWorld在3D物体检测上实现了7.5%的mAP增加,在在线地图上实现了3%的IoU增加,在多对象跟踪上实现了5%的AMOTA增加,在运动预测中降低了0.1m的minADE,在占用预测上实现了3%的IoU增加,在规划中减少了0.34m的L2误差。
https://arxiv.org/abs/2405.04390
High-definition map with accurate lane-level information is crucial for autonomous driving, but the creation of these maps is a resource-intensive process. To this end, we present a cost-effective solution to create lane-level roadmaps using only the global navigation satellite system (GNSS) and a camera on customer vehicles. Our proposed solution utilizes a prior standard-definition (SD) map, GNSS measurements, visual odometry, and lane marking edge detection points, to simultaneously estimate the vehicle's 6D pose, its position within a SD map, and also the 3D geometry of traffic lines. This is achieved using a Bayesian simultaneous localization and multi-object tracking filter, where the estimation of traffic lines is formulated as a multiple extended object tracking problem, solved using a trajectory Poisson multi-Bernoulli mixture (TPMBM) filter. In TPMBM filtering, traffic lines are modeled using B-spline trajectories, and each trajectory is parameterized by a sequence of control points. The proposed solution has been evaluated using experimental data collected by a test vehicle driving on highway. Preliminary results show that the traffic line estimates, overlaid on the satellite image, generally align with the lane markings up to some lateral offsets.
具有高清晰度地图和准确的路级信息对于自动驾驶至关重要,但创建这些地图是一个资源密集的过程。为此,我们提出了一个成本有效的解决方案,使用仅是全球导航卫星系统(GNSS)和车辆上的摄像机来创建道路级地图。我们的解决方案利用了预定义的标准定义(SD)地图、GNSS测量、视觉观测里程和车道标记边缘检测点,同时估计车辆的6D姿态,其在SD地图上的位置以及交通线的3D几何形状。这是通过使用贝叶斯同时定位和多对象跟踪滤波器实现的,其中交通线的估计作为一个多扩展对象跟踪问题,通过轨迹的概率Multi-Bernoulli混合(TPMBM)滤波器求解。在TPMBM滤波器中,交通线通过B-spline轨迹建模,并且每个轨迹由一系列控制点参数化。所提出的解决方案已通过在高速公路上行驶的测试车辆的实验数据进行了评估。初步结果表明,交通线估计,叠加在卫星图像上,通常与车道标记在横向偏移量上相符。
https://arxiv.org/abs/2405.04290
We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. Videos of the game-play and analyses can be seen on our website this https URL .
我们将多智能体深度强化学习(RL)应用于训练具有完全车载计算和感知能力的端到端机器人足球策略,通过采用 ego 中心式 RGB 视觉。这个设置反映了现实世界机器人领域许多挑战,包括积极感知、灵活的全身体控制和动态、部分不可观测的多智能体领域的长距离规划。我们依赖于大规模、基于模拟的数据生成来获得自适应的 behaviors,这些 behaviors 可以成功地传输到物理机器人,利用低成本传感器。为了实现适当的视觉现实,我们的模拟结合了刚体物理和通过多个 Neural Radiance Fields (NeRFs) 学习到的逼真的渲染。我们将基于教师的多智能体 RL 和跨实验数据复用来探索复杂的足球策略。我们分析了在仅优化感知无关足球比赛时出现的积极感知行为,包括物体跟踪和寻找球。代理显示与具有特权、地面真实状态访问权限的政策具有同等的表现和敏捷性。据我们所知,本文是首次将端到端训练多智能体机器人足球的实践,将原始像素观察结果映射到关节级别动作,可以在现实世界中部署。游戏的视频和分析可以在我们的网站 https:// this URL 上查看。
https://arxiv.org/abs/2405.02425
In the rapidly advancing field of robotics, the fusion of state-of-the-art visual technologies with mobile robotic arms has emerged as a critical integration. This paper introduces a novel system that combines the Segment Anything model (SAM) -- a transformer-based visual foundation model -- with a robotic arm on a mobile platform. The design of integrating a depth camera on the robotic arm's end-effector ensures continuous object tracking, significantly mitigating environmental uncertainties. By deploying on a mobile platform, our grasping system has an enhanced mobility, playing a key role in dynamic environments where adaptability are critical. This synthesis enables dynamic object segmentation, tracking, and grasping. It also elevates user interaction, allowing the robot to intuitively respond to various modalities such as clicks, drawings, or voice commands, beyond traditional robotic systems. Empirical assessments in both simulated and real-world demonstrate the system's capabilities. This configuration opens avenues for wide-ranging applications, from industrial settings, agriculture, and household tasks, to specialized assignments and beyond.
在迅速发展的机器人领域,将最先进的视觉技术与移动机器人手臂的融合已成为一个关键整合。本文介绍了一种将Segment Anything模型(SAM)——一种基于Transformer的视觉基础模型——与移动平台上的机器人手臂相结合的新系统。在机器人手臂的末端执行器上集成深度相机,可以确保连续物体跟踪,显著缓解环境不确定性。通过将该系统部署在移动平台上,我们的抓取系统具有增强的机动性,在需要适应性关键的环境中发挥了关键作用。这种合成实现了动态物体分割、跟踪和抓取。它还提高了用户交互,使机器人能够直觉地响应各种模式,如点击、绘画或语音命令,而不仅仅是传统的机器人系统。实证评估不仅在模拟环境中证实了该系统的功能,而且在现实环境中也证明了其能力。这种配置为各种应用敞开了大门,从工业环境、农业和家庭任务,到专业任务和 beyond。
https://arxiv.org/abs/2404.18720
Visual object tracking and segmentation in omnidirectional videos are challenging due to the wide field-of-view and large spherical distortion brought by 360° images. To alleviate these problems, we introduce a novel representation, extended bounding field-of-view (eBFoV), for target localization and use it as the foundation of a general 360 tracking framework which is applicable for both omnidirectional visual object tracking and segmentation tasks. Building upon our previous work on omnidirectional visual object tracking (360VOT), we propose a comprehensive dataset and benchmark that incorporates a new component called omnidirectional video object segmentation (360VOS). The 360VOS dataset includes 290 sequences accompanied by dense pixel-wise masks and covers a broader range of target categories. To support both the development and evaluation of algorithms in this domain, we divide the dataset into a training subset with 170 sequences and a testing subset with 120 sequences. Furthermore, we tailor evaluation metrics for both omnidirectional tracking and segmentation to ensure rigorous assessment. Through extensive experiments, we benchmark state-of-the-art approaches and demonstrate the effectiveness of our proposed 360 tracking framework and training dataset. Homepage: this https URL
视图对象跟踪和分割在全景视频中具有挑战性,因为360°图像带来的 wide field-of-view 和 large spherical distortion。为了减轻这些问题,我们引入了一种新的表示,扩展边界场视野(eBFoV),用于目标定位,并将其作为通用360跟踪框架的基础,适用于全景视觉对象跟踪和分割任务。在我们的之前工作基础上(360VOT),我们提出了一个全面的 datasets 和 benchmark,其中包含了一个新的组件,称为全景视频对象分割(360VOS)。360VOS 数据集包括 290 个序列,并伴有密集的像素级掩码,涵盖了更广泛的目标类别。为了支持在这个领域的算法的发展和评估,我们将数据集划分为训练集和测试集,其中训练集包含170个序列,测试集包含120个序列。此外,我们还为全景跟踪和分割定义了严格的评估指标,以确保严谨的评估。通过广泛的实验,我们基准了最先进的 approaches,并证明了所提出的360跟踪框架和训练数据集的有效性。主页:https:// this URL
https://arxiv.org/abs/2404.13953
Multi-object tracking (MOT) is a critical and challenging task in computer vision, particularly in situations involving objects with similar appearances but diverse movements, as seen in team sports. Current methods, largely reliant on object detection and appearance, often fail to track targets in such complex scenarios accurately. This limitation is further exacerbated by the lack of comprehensive and diverse datasets covering the full view of sports pitches. Addressing these issues, we introduce TeamTrack, a pioneering benchmark dataset specifically designed for MOT in sports. TeamTrack is an extensive collection of full-pitch video data from various sports, including soccer, basketball, and handball. Furthermore, we perform a comprehensive analysis and benchmarking effort to underscore TeamTrack's utility and potential impact. Our work signifies a crucial step forward, promising to elevate the precision and effectiveness of MOT in complex, dynamic settings such as team sports. The dataset, project code and competition is released at: this https URL.
多目标跟踪(MOT)是计算机视觉领域一个关键而具有挑战性的任务,特别是在涉及具有相似外观但不同运动情况的情况,如团队运动中。当前的方法,主要依赖物体检测和外观,往往无法准确跟踪在这样的复杂场景中的目标。这个限制进一步加剧了缺乏全面和多样数据集覆盖体育场的不足。为了解决这些问题,我们介绍了专门为体育MOT设计的团队跟踪基准数据集TeamTrack。TeamTrack是一个包括各种体育项目完整场地视频数据的广泛收集。此外,我们进行了全面分析和基准测试,以强调TeamTrack的实用性和潜在影响。我们的工作标志着一个关键的进步,有望提高在复杂、动态的体育场景中MOT的精度和效果。该数据集、项目代码和比赛已发布在:https:// this URL。
https://arxiv.org/abs/2404.13868
With the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Multi-object tracking (MOT) algorithms can be categorized between two-stage and single-stage methods. Two-stage methods tend to be simpler to adapt and implement to custom applications, while single-stage methods present a more complex end-to-end tracking method that can yield better results in occluded situations at the cost of more training data. The potential advantages of single-stage methods over two-stage methods depends on the complexity of the sequence of viewpoints that a robot needs to process. In this work, we compare a 3D two-stage MOT algorithm, 3D-SORT, against a 3D single-stage MOT algorithm, MOT-DETR, in three different types of sequences with varying levels of complexity. The sequences represent simpler and more complex motions that a robot arm can perform in a tomato greenhouse. Our experiments in a tomato greenhouse show that the single-stage algorithm consistently yields better tracking accuracy, especially in the more challenging sequences where objects are fully occluded or non-visible during several viewpoints.
随着农业食品工业中自动化需求的增加,准确地检测和定位相关物体在3D中的关键是成功的机器人操作。然而,由于存在遮挡,这是一个挑战。多视角感知方法允许机器人克服遮挡,但需要一个跟踪组件来将机器人检测到的物体与多个视角相关联。多对象跟踪(MOT)算法可以分为两阶段和单阶段方法。两阶段方法通常更容易适应定制应用程序,而单阶段方法则呈现出了更复杂的端到端跟踪方法,在遮挡情况下可以获得更好的结果,但需要更多的训练数据。单阶段方法相对于两阶段方法的潜在优势取决于机器人需要处理视点的序列的复杂程度。在本研究中,我们比较了3D两阶段MOT算法(3D-SORT)与3D单阶段MOT算法(MOT-DETR)在三种不同复杂程度的序列中的效果。这些序列代表机器人手臂在番茄温室中可以执行的更简单和更复杂的动作。我们在番茄温室中的实验结果表明,单阶段算法在跟踪准确性方面始终优于双阶段算法,尤其是在更具有挑战性的序列中,对象在多个视角中都被完全遮挡或不可见的情况下。
https://arxiv.org/abs/2404.12963
Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at this https URL.
today,大多数图像理解任务的方法都依赖于前馈神经网络。虽然这种方法通过微调获得了 empirical 的准确性和效率,但同时也存在一些基本缺陷。现有的网络往往难以在不同的数据集上泛化,即使是相同任务。通过设计,这些网络最终在预训练的3D对象表示的潜在空间中进行推理,这是具有挑战性的。尤其是在试图根据2D图像预测3D信息时,这更是如此。我们提出将从RGB相机中的3D多对象跟踪重新建模为同义词{反向渲染(IR)问题,通过优化通过不同的渲染管道在预训练3D对象表示的潜在空间中进行优化,并检索在给定输入图像中最好地表示物体实例的潜在。为此,我们优化了一个在生成性潜在空间上进行的图像损失。我们研究了不仅是对跟踪的另一种看法,而且我们的方法还允许我们检查生成的物体,推理失败情况,并解决模糊情况。我们通过仅从合成数据中学习生成先验来评估我们的方法的泛化能力和扩展能力。我们在 nuScenes 和 Waymo 数据集上对相机基于3D跟踪的性能进行了评估。这两个数据集完全未见对我们的方法,也不需要微调。视频和代码可在此处 https:// URL 下载。
https://arxiv.org/abs/2404.12359
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
跨对象跟踪任务的新趋势是使用自然语言跟踪感兴趣的对象。然而,缺乏成对提示实例数据会阻碍其进展。为解决这个问题,我们提出了一个高质量但成本低的数据生成方法,基于Unreal Engine 5,并构建了一个名为Refer-UE-City的新基准数据集,主要包括路口监视视频的场景,详细描述了人和车辆的外观和行为。具体来说,它提供了14个视频,总共有714个表情,与Refer-KITTI数据集的规模相当。此外,我们提出了一个多层语义引导多对象框架MLS-Track,通过引入语义引导模块(SGM)和语义相关分支(SCB)来增强模型和文本之间的交互。对Refer-UE-City和Refer-KITTI数据集的实验表明,我们提出的框架的有效性得到了证明,并且实现了最先进的性能。代码和数据集将可用。
https://arxiv.org/abs/2404.12031
Vision sensors are versatile and can capture a wide range of visual cues, such as color, texture, shape, and depth. This versatility, along with the relatively inexpensive availability of machine vision cameras, played an important role in adopting vision-based environment perception systems in autonomous vehicles (AVs). However, vision-based perception systems can be easily affected by glare in the presence of a bright source of light, such as the sun or the headlights of the oncoming vehicle at night or simply by light reflecting off snow or ice-covered surfaces; scenarios encountered frequently during driving. In this paper, we investigate various glare reduction techniques, including the proposed saturated pixel-aware glare reduction technique for improved performance of the computer vision (CV) tasks employed by the perception layer of AVs. We evaluate these glare reduction methods based on various performance metrics of the CV algorithms used by the perception layer. Specifically, we considered object detection, object recognition, object tracking, depth estimation, and lane detection which are crucial for autonomous driving. The experimental findings validate the efficacy of the proposed glare reduction approach, showcasing enhanced performance across diverse perception tasks and remarkable resilience against varying levels of glare.
视觉传感器具有多功能,可以捕捉各种视觉线索,如颜色、纹理、形状和深度。这种多功能,再加上机器视觉摄像头相对较低的价格,在自动驾驶车辆(AVs)中采用基于视觉的环境感知系统发挥了重要作用。然而,基于视觉的感知系统很容易受到在明亮光源下存在的眩光的影响,例如太阳或夜间或仅仅是由于雪或冰表面反射的光线;这些情况在驾驶过程中经常遇到。在本文中,我们研究了各种眩光减除技术,包括为提高AV感知层计算机视觉(CV)任务的性能而提出的饱和像素感知眩光减除技术。我们根据CV算法使用感知层所实现的各种性能指标评估这些眩光减除方法。具体来说,我们考虑了物体检测、物体识别、物体跟踪、深度估计和车道检测,这些对于自动驾驶至关重要。实验结果证实了所提出的眩光减除方法的有效性,展示了在各种感知任务中出色的性能和对于眩光水平变化的非凡的鲁棒性。
https://arxiv.org/abs/2404.10992
State-of-the-art (SOTA) trackers have shown remarkable Multiple Object Tracking (MOT) performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of SOTA trackers remains underexplored. To address these limitations, we propose a pipeline for physic-based volumetric fog simulation in arbitrary real-world MOT dataset utilizing frame-by-frame monocular depth estimation and a fog formation optical model. Moreover, we enhance our simulation by rendering of both homogeneous and heterogeneous fog effects. We propose to use the dark channel prior method to estimate fog (smoke) color, which shows promising results even in night and indoor scenes. We present the leading tracking benchmark MOTChallenge (MOT17 dataset) overlaid by fog (smoke for indoor scenes) of various intensity levels and conduct a comprehensive evaluation of SOTA MOT methods, revealing their limitations under fog and fog-similar challenges.
最先进的(SOTA)跟踪器在当前基准测试中展示出了出色的多目标跟踪(MOT)性能。然而,这些基准测试主要基于明确的场景,忽略了诸如雾、霾、烟和尘土等不利的大气条件。因此,SOTA跟踪器的稳健性仍然没有被充分探讨。为了克服这些限制,我们提出了一个利用基于物理的卷积神经网络在任意现实世界MOT数据集上进行体积雾模拟的流程。此外,我们还通过渲染出均匀和异质雾效应来增强我们的模拟。我们提出使用暗通道先验方法估计雾(烟雾)颜色,即使在夜间和室内场景下,这种方法也表现出了很好的效果。我们展示了各种强度的雾(烟雾)对各种场景的MOT挑战(MOT17数据集)作为先导,对SOTA MOT方法进行了全面的评估,揭示了它们在雾和雾类似挑战下的局限性。
https://arxiv.org/abs/2404.10534
We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot segmentation models. Contrasted to prior 3D scene segmentation approaches that heavily rely on video object tracking, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot segmentation models, enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as scene understanding and manipulation.
我们提出了Gaga,一个通过利用零散拍摄模型预测的不一致2D掩码来重构和分割开放世界3D场景的框架。与 prior 3D 场景分割方法相比,Gaga 利用空间信息,有效地将对象掩码关联到不同相机姿态。通过消除训练图像中连续视图变化的假设,Gaga 展示了对相机姿态变化的鲁棒性,特别是对于稀疏采样图像,确保了精确的掩码标签一致性。此外,Gaga 适应了各种来源的2D分割掩码,并展示了与不同开放世界零散拍摄模型具有良好的鲁棒性,增强了其多才性。大量的定性和定量评估证实,Gaga 在与最先进方法的表现上具有优势,突出了其在现实应用场景(如场景理解和操作)上的潜力。
https://arxiv.org/abs/2404.07977
This paper introduces SFSORT, the world's fastest multi-object tracking system based on experiments conducted on MOT Challenge datasets. To achieve an accurate and computationally efficient tracker, this paper employs a tracking-by-detection method, following the online real-time tracking approach established in prior literature. By introducing a novel cost function called the Bounding Box Similarity Index, this work eliminates the Kalman Filter, leading to reduced computational requirements. Additionally, this paper demonstrates the impact of scene features on enhancing object-track association and improving track post-processing. Using a 2.2 GHz Intel Xeon CPU, the proposed method achieves an HOTA of 61.7\% with a processing speed of 2242 Hz on the MOT17 dataset and an HOTA of 60.9\% with a processing speed of 304 Hz on the MOT20 dataset. The tracker's source code, fine-tuned object detection model, and tutorials are available at \url{this https URL}.
本文介绍了基于MOT Challenge数据集的实验世界上最快的多对象跟踪系统SFSORT。为了实现准确和计算效率的跟踪器,本文采用了一种跟踪通过检测的方法,遵循了之前文献中确立的在线实时跟踪方法。通过引入一种新的成本函数称为边界框相似性指数,本文消除了Kalman滤波器,从而减少了计算需求。此外,本文证明了场景特征对增强物体跟踪关联和提高轨迹后处理的影响。使用2.2 GHz的Intel Xeon CPU,在MOT17数据集上,所提出的方法达到HOTA 61.7\%,处理速度为2242 Hz;在MOT20数据集上,该方法达到HOTA 60.9\%,处理速度为304 Hz。跟踪器的源代码、微调的物体检测模型和教程可在 \url{this <https://this URL>} 中找到。
https://arxiv.org/abs/2404.07553