Point cloud filtering is a fundamental 3D vision task, which aims to remove noise while recovering the underlying clean surfaces. State-of-the-art methods remove noise by moving noisy points along stochastic trajectories to the clean surfaces. These methods often require regularization within the training objective and/or during post-processing, to ensure fidelity. In this paper, we introduce StraightPCF, a new deep learning based method for point cloud filtering. It works by moving noisy points along straight paths, thus reducing discretization errors while ensuring faster convergence to the clean surfaces. We model noisy patches as intermediate states between high noise patch variants and their clean counterparts, and design the VelocityModule to infer a constant flow velocity from the former to the latter. This constant flow leads to straight filtering trajectories. In addition, we introduce a DistanceModule that scales the straight trajectory using an estimated distance scalar to attain convergence near the clean surface. Our network is lightweight and only has $\sim530K$ parameters, being 17% of IterativePFN (a most recent point cloud filtering network). Extensive experiments on both synthetic and real-world data show our method achieves state-of-the-art results. Our method also demonstrates nice distributions of filtered points without the need for regularization. The implementation code can be found at: this https URL.
点云筛选是一个基本的三维视觉任务,旨在在删除噪声的同时恢复底层干净表面。最先进的方法通过将噪音点沿着随机轨迹移动到干净表面来消除噪音。这些方法通常需要在训练目标内进行正则化,或在后处理过程中进行正则化,以确保精度。在本文中,我们介绍了StraightPCF,一种基于直线的点云筛选新方法。它通过将噪音点沿着直线移动来减少离散误差,同时确保更快地达到干净表面。我们将噪音斑块建模为高噪音块变体和干净块之间的中间状态,并设计VelocityModule从前者推断到后者的恒定流速。这种恒定流速导致直线过滤轨迹。此外,我们还引入了距离模块,通过估计距离标量来缩放直线轨迹以达到清洁表面的收敛。我们的网络轻量化,仅含有约530K个参数,是IterativePFN(一种最先进的点云过滤网络)的17%。对合成和真实世界数据的广泛实验表明,我们的方法达到了最先进水平。我们的方法还展示了无需正则化即可良好地分布滤波点的分布。实现代码可以在:这个链接找到。
https://arxiv.org/abs/2405.08322
Place recognition is the foundation for enabling autonomous systems to achieve independent decision-making and safe operations. It is also crucial in tasks such as loop closure detection and global localization within SLAM. Previous methods utilize mundane point cloud representations as input and deep learning-based LiDAR-based Place Recognition (LPR) approaches employing different point cloud image inputs with convolutional neural networks (CNNs) or transformer architectures. However, the recently proposed Mamba deep learning model, combined with state space models (SSMs), holds great potential for long sequence modeling. Therefore, we developed OverlapMamba, a novel network for place recognition, which represents input range views (RVs) as sequences. In a novel way, we employ a stochastic reconstruction approach to build shift state space models, compressing the visual representation. Evaluated on three different public datasets, our method effectively detects loop closures, showing robustness even when traversing previously visited locations from different directions. Relying on raw range view inputs, it outperforms typical LiDAR and multi-view combination methods in time complexity and speed, indicating strong place recognition capabilities and real-time efficiency.
位置识别是使自动驾驶系统实现独立决策和安全的操作的基础,同时在SLAM任务中(例如闭环检测和全局定位)也非常关键。以前的方法利用乏味的点云表示作为输入,并采用基于深度学习的激光雷达(LiDAR)基于点云图像的识别方法(LPR)或Transformer架构的不同点云图像输入。然而,最近提出的Mamba深度学习模型与状态空间模型(SSMs)相结合,具有很大的长期序列建模潜力。因此,我们开发了OverlapMamba,一种新型的用于位置识别的网络,将输入范围视(RVs)表示为序列。与传统方法不同,我们采用随机重构方法构建了转移状态空间模型,压缩了视觉表示。在三个不同的公开数据集上评估,我们的方法有效地检测到了闭环,即使从不同的方向访问之前访问过的位置时,表现依然稳健。依赖原始范围视图输入,它在时间和速度上优于典型的LiDAR和多视图组合方法,表明具有强大的定位能力和实时效率。
https://arxiv.org/abs/2405.07966
We present SceneFactory, a workflow-centric and unified framework for incremental scene modeling, that supports conveniently a wide range of applications, such as (unposed and/or uncalibrated) multi-view depth estimation, LiDAR completion, (dense) RGB-D/RGB-L/Mono//Depth-only reconstruction and SLAM. The workflow-centric design uses multiple blocks as the basis for building different production lines. The supported applications, i.e., productions avoid redundancy in their designs. Thus, the focus is on each block itself for independent expansion. To support all input combinations, our implementation consists of four building blocks in SceneFactory: (1) Mono-SLAM, (2) depth estimation, (3) flexion and (4) scene reconstruction. Furthermore, we propose an unposed & uncalibrated multi-view depth estimation model (U2-MVD) to estimate dense geometry. U2-MVD exploits dense bundle adjustment for solving for poses, intrinsics, and inverse depth. Then a semantic-awared ScaleCov step is introduced to complete the multi-view depth. Relying on U2-MVD, SceneFactory both supports user-friendly 3D creation (with just images) and bridges the applications of Dense RGB-D and Dense Mono. For high quality surface and color reconstruction, we propose due-purpose Multi-resolutional Neural Points (DM-NPs) for the first surface accessible Surface Color Field design, where we introduce Improved Point Rasterization (IPR) for point cloud based surface query. We implement and experiment with SceneFactory to demonstrate its broad practicability and high flexibility. Its quality also competes or exceeds the tightly-coupled state of the art approaches in all tasks. We contribute the code to the community (this https URL).
我们提出了SceneFactory,一个以工作流为中心且统一的框架,用于增量式场景建模,支持各种应用,例如(未经校准和/或未校准)多视角深度估计、激光雷达完成、(密集) RGB-D/RGB-L/单/深度重建和SLAM。工作流中心的设计使用多个模块作为构建不同生产线的基线。支持的应用程序在其设计中避免了冗余。因此,关注点在于每个模块本身,进行独立扩展。为了支持所有输入组合,我们的实现包括SceneFactory中的四个构建模块:(1)单SLAM,(2)深度估计,(3)伸展和(4)场景重建。此外,我们提出了一个未经校准且未校准的多视角深度估计模型(U2-MVD)来估计密集几何。U2-MVD利用密集Bundle Adjustment来解决姿态、内参和逆深度。然后引入了一个语义感知的尺度 Cov 步来完成多视角深度。凭借 U2-MVD,SceneFactory支持用户友好的3D创建(仅使用图像)。同时,它还桥接了Dense RGB-D和Dense Mono的应用程序。为了实现高品质表面和色彩重建,我们为第一个可访问表面颜色场设计提出了目的明确的Multi-resolutional Neural Points(DM-NPs),并在基于点云的表面查询中引入了改进点映射(IPR)。我们在SceneFactory上实现并实验了它,以证明其广泛适用性和高度灵活性。其在所有任务中的质量也与其他紧密耦合的方法竞争或超过。我们将代码贡献给社区(此https URL)。
https://arxiv.org/abs/2405.07847
Point cloud registration is a fundamental task for estimating rigid transformations between point clouds. Previous studies have used geometric information for extracting features, matching and estimating transformation. Recently, owing to the advancement of RGB-D sensors, researchers have attempted to utilize visual information to improve registration performance. However, these studies focused on extracting distinctive features by deep feature fusion, which cannot effectively solve the negative effects of each feature's weakness, and cannot sufficiently leverage the valid information. In this paper, we propose a new feature combination framework, which applies a looser but more effective fusion and can achieve better performance. An explicit filter based on transformation consistency is designed for the combination framework, which can overcome each feature's weakness. And an adaptive threshold determined by the error distribution is proposed to extract more valid information from the two types of features. Owing to the distinctive design, our proposed framework can estimate more accurate correspondences and is applicable to both hand-crafted and learning-based feature descriptors. Experiments on ScanNet show that our method achieves a state-of-the-art performance and the rotation accuracy of 99.1%.
点云配准是将点云之间估计刚性变换的基本任务。之前的研究已经利用几何信息提取特征、匹配和估计变换。然而,由于RGB-D传感器的进步,研究人员试图利用视觉信息来提高配准性能。然而,这些研究仅通过深度特征融合来提取独特的特征,这无法有效地解决每个特征的弱点,也无法充分利用有效信息。在本文中,我们提出了一个新的特征组合框架,该框架应用了较宽松但更有效的融合,可以实现更好的性能。基于变换一致性的显式滤波器被设计为组合框架,可以克服每个特征的弱点。还提出了一个由错误分布决定的自适应阈值,用于从两种类型的特征中提取更多有效信息。由于其独特的设计,我们提出的框架可以估计更精确的对应关系,并适用于手工制作的和学习式的特征描述符。在ScanNet上的实验表明,我们的方法达到最先进的性能,旋转精度为99.1%。
https://arxiv.org/abs/2405.07594
High-resolution road representations are a key factor for the success of (highly) automated driving functions. These representations, for example, high-definition (HD) maps, contain accurate information on a multitude of factors, among others: road geometry, lane information, and traffic signs. Through the growing complexity and functionality of automated driving functions, also the requirements on testing and evaluation grow continuously. This leads to an increasing interest in virtual test drives for evaluation purposes. As roads play a crucial role in traffic flow, accurate real-world representations are needed, especially when deriving realistic driving behavior data. This paper proposes a novel approach to generate realistic road representations based solely on point cloud information, independent of the LiDAR sensor, mounting position, and without the need for odometry data, multi-sensor fusion, machine learning, or highly-accurate calibration. As the primary use case is simulation, we use the OpenDRIVE format for evaluation.
高质量的道路表示是自动驾驶功能成功的重要因素之一。例如,高清晰度(HD)地图包含关于道路几何、车道信息和交通标志等大量准确信息。随着自动驾驶功能日益复杂和功能强大,对测试和评估的需求也在不断增长。这导致对于评价目的的虚拟驾驶测试的需求不断增加。因为道路在交通流量中扮演着关键角色,所以尤其需要准确的现实世界道路表示,尤其是在从真实的驾驶行为数据中提取数据时。本文提出了一种仅基于点云信息生成现实道路表示的新方法,独立于LiDAR传感器、安装位置,无需轮迹数据、多传感器融合、机器学习或高精度校准。作为主要应用场景是仿真,我们使用OpenDRIVE格式进行评估。
https://arxiv.org/abs/2405.07544
In the character animation field, modern supervised keyframe interpolation models have demonstrated exceptional performance in constructing natural human motions from sparse pose definitions. As supervised models, large motion datasets are necessary to facilitate the learning process; however, since motion is represented with fixed hierarchical skeletons, such datasets are incompatible for skeletons outside the datasets' native configurations. Consequently, the expected availability of a motion dataset for desired skeletons severely hinders the feasibility of learned interpolation in practice. To combat this limitation, we propose Point Cloud-based Motion Representation Learning (PC-MRL), an unsupervised approach to enabling cross-compatibility between skeletons for motion interpolation learning. PC-MRL consists of a skeleton obfuscation strategy using temporal point cloud sampling, and an unsupervised skeleton reconstruction method from point clouds. We devise a temporal point-wise K-nearest neighbors loss for unsupervised learning. Moreover, we propose First-frame Offset Quaternion (FOQ) and Rest Pose Augmentation (RPA) strategies to overcome necessary limitations of our unsupervised point cloud-to-skeletal motion process. Comprehensive experiments demonstrate the effectiveness of PC-MRL in motion interpolation for desired skeletons without supervision from native datasets.
在角色动画领域,现代有监督的关键帧插值模型在构建自然人体运动方面表现出卓越的性能。作为有监督模型,需要大型动作数据集来促进学习过程;然而,由于运动以固定的层次结构骨架表示,这类数据集对于数据集中的非原配置骨骼是不可兼容的。因此,为期望的骨骼模型设计的运动数据集在实践中严重阻碍了学习插值的可行性。为了克服这一限制,我们提出了基于点云的运动表示学习(PC-MRL),一种无监督方法,以实现骨骼在运动插值学习中的互操作性。PC-MRL包括使用时域点云采样进行骨密度干扰策略和一个无监督骨架重构方法。我们设计了一个基于时刻的K-最近邻损失来进行无监督学习。此外,我们还提出了First-frame Offset Quaternion(FOQ)和Rest Pose Augmentation(RPA)策略来克服我们无监督点云到骨骼的运动过程所必需的局限性。全面的实验证明PC-MRL在不需要来自原始数据集的监督的情况下,在运动插值方面具有很高的有效性。
https://arxiv.org/abs/2405.07444
Global point clouds that correctly represent the static environment features can facilitate accurate localization and robust path planning. However, dynamic objects introduce undesired ghost tracks that are mixed up with the static environment. Existing dynamic removal methods normally fail to balance the performance in computational efficiency and accuracy. In response, we present BeautyMap to efficiently remove the dynamic points while retaining static features for high-fidelity global maps. Our approach utilizes a binary-encoded matrix to efficiently extract the environment features. With a bit-wise comparison between matrices of each frame and the corresponding map region, we can extract potential dynamic regions. Then we use coarse to fine hierarchical segmentation of the $z$-axis to handle terrain variations. The final static restoration module accounts for the range-visibility of each single scan and protects static points out of sight. Comparative experiments underscore BeautyMap's superior performance in both accuracy and efficiency against other dynamic points removal methods. The code is open-sourced at this https URL.
全球点云正确地表示静态环境特征可以促进准确的局部化和稳健路径规划。然而,动态对象引入了不必要的幽灵轨迹,与静态环境混杂在一起。现有的动态删除方法通常无法在计算效率和准确性之间实现平衡。因此,我们提出了BeautyMap,以高效地删除动态点,同时保留高保真度的全局地图的静态特征。我们的方法利用二进制编码矩阵有效地提取环境特征。通过比较每个帧的矩阵和相应地图区域的位元比较,我们可以提取潜在的动态区域。然后,我们使用粗到细的$z$轴层次分割来处理地形变化。最后的静态恢复模块考虑每个单独扫描的可视范围,保护看不见的静态点。与其它动态点删除方法相比,BeautyMap在准确性和效率方面都具有卓越的表现。代码现在开源在https://这个URL上。
https://arxiv.org/abs/2405.07283
This paper presents an approach to teleoperate a manipulator using a mobile phone as a leader device. Using its IMU and camera, the phone estimates its Cartesian pose which is then used to to control the Cartesian pose of the robot's tool. The user receives visual feedback in the form of multi-view video - a point cloud rendered in a virtual reality environment. This enables the user to observe the scene from any position. To increase immersion, the robot's estimate of external forces is relayed using the phone's haptic actuator. Leader and follower are connected through wireless networks such as 5G or Wi-Fi. The paper describes the setup and analyzes its performance.
本文提出了一种使用智能手机作为领导设备进行遥控操作六连杆的方法。通过其加速度计和摄像头,智能手机估计其刚体姿态,然后用于控制机器人工具的刚体姿态。用户通过多视角视频获得视觉反馈——虚拟现实环境中渲染的点云。这使得用户能够从任何位置观察场景。为了提高沉浸感,机器人通过智能手机的触觉驱动器传递其外部力的估计。领导者并通过无线网络(如5G或Wi-Fi)与跟随者连接。本文描述了设置和分析了其性能。
https://arxiv.org/abs/2405.07128
To substantially enhance robot intelligence, there is a pressing need to develop a large model that enables general-purpose robots to proficiently undertake a broad spectrum of manipulation tasks, akin to the versatile task-planning ability exhibited by LLMs. The vast diversity in objects, robots, and manipulation tasks presents huge challenges. Our work introduces a comprehensive framework to develop a foundation model for general robotic manipulation that formalizes a manipulation task as contact synthesis. Specifically, our model takes as input object and robot manipulator point clouds, object physical attributes, target motions, and manipulation region masks. It outputs contact points on the object and associated contact forces or post-contact motions for robots to achieve the desired manipulation task. We perform extensive experiments both in the simulation and real-world settings, manipulating articulated rigid objects, rigid objects, and deformable objects that vary in dimensionality, ranging from one-dimensional objects like ropes to two-dimensional objects like cloth and extending to three-dimensional objects such as plasticine. Our model achieves average success rates of around 90\%. Supplementary materials and videos are available on our project website at this https URL.
要显著增强机器智能,有必要开发一个大型模型,使通用机器人能够高效地执行广泛的操作任务,类似于LLMs展示的多功能任务规划能力。对象、机器人和操作任务的巨大多样性带来了巨大的挑战。我们的工作引入了一个全面的框架,用于开发通用的机器人操作基础模型,将操作任务 formal化为接触合成。具体来说,我们的模型接受对象和机器人操作器点云、物体物理属性、目标运动和操作区域掩码作为输入。它输出物体上的接触点和与机器人实现所需操作任务相关的接触力或后接触运动。我们在模拟和现实世界的广泛实验中进行了广泛的操作,操作刚性物体、刚性物体和可变维度的柔性物体,从一维物体(如绳子)到二维物体(如布料)和延伸到三维物体(如塑料ine)。我们的模型实现了约90%的成功率。附加材料和视频可在项目网站上的这个链接中找到。
https://arxiv.org/abs/2405.06964
Recognizing human actions from point cloud sequence has attracted tremendous attention from both academia and industry due to its wide applications. However, most previous studies on point cloud action recognition typically require complex networks to extract intra-frame spatial features and inter-frame temporal features, resulting in an excessive number of redundant computations. This leads to high latency, rendering them impractical for real-world applications. To address this problem, we propose a Plane-Fit Redundancy Encoding point cloud sequence network named PRENet. The primary concept of our approach involves the utilization of plane fitting to mitigate spatial redundancy within the sequence, concurrently encoding the temporal redundancy of the entire sequence to minimize redundant computations. Specifically, our network comprises two principal modules: a Plane-Fit Embedding module and a Spatio-Temporal Consistency Encoding module. The Plane-Fit Embedding module capitalizes on the observation that successive point cloud frames exhibit unique geometric features in physical space, allowing for the reuse of spatially encoded data for temporal stream encoding. The Spatio-Temporal Consistency Encoding module amalgamates the temporal structure of the temporally redundant part with its corresponding spatial arrangement, thereby enhancing recognition accuracy. We have done numerous experiments to verify the effectiveness of our network. The experimental results demonstrate that our method achieves almost identical recognition accuracy while being nearly four times faster than other state-of-the-art methods.
从点云序列中识别人类动作引起了学术界和产业界的高度关注,因为它具有广泛的应用。然而,大多数先前的点云动作识别研究通常需要复杂的网络来提取帧内空间特征和帧间时间特征,导致冗余计算数量过多。这导致延迟过高,使得它们对于现实应用不再实用。为了解决这个问题,我们提出了一个名为PRENet的平滑fit冗余编码点云序列网络。我们方法的主要思想是利用平滑fit来减轻序列内的空间冗余,同时编码整个序列的时间冗余以最小化冗余计算。具体来说,我们的网络由两个主要模块组成:平滑fit嵌入模块和时域-空间一致性编码模块。平滑fit嵌入模块利用观察到连续点云帧在物理空间中具有独特的几何特征的事实,实现地理位置编码数据的重复利用。时域-空间一致性编码模块将时间冗余部分与相应的空间布局相结合,从而提高识别准确性。我们进行了大量实验来验证我们网络的有效性。实验结果表明,与最先进的方法相比,我们的方法具有几乎相同的识别准确度,同时速度快了约四倍。
https://arxiv.org/abs/2405.06929
Deep learning has shown promising results for multiple 3D point cloud registration datasets. However, in the underwater domain, most registration of multibeam echo-sounder (MBES) point cloud data are still performed using classical methods in the iterative closest point (ICP) family. In this work, we curate and release DotsonEast Dataset, a semi-synthetic MBES registration dataset constructed from an autonomous underwater vehicle in West Antarctica. Using this dataset, we systematically benchmark the performance of 2 classical and 4 learning-based methods. The experimental results show that the learning-based methods work well for coarse alignment, and are better at recovering rough transforms consistently at high overlap (20-50%). In comparison, GICP (a variant of ICP) performs well for fine alignment and is better across all metrics at extremely low overlap (10%). To the best of our knowledge, this is the first work to benchmark both learning-based and classical registration methods on an AUV-based MBES dataset. To facilitate future research, both the code and data are made available online.
深度学习已经在多个3D点云配准数据集上取得了良好的结果。然而,在水下领域,大多数多波节流声呐(MBES)点云数据的配准仍然使用迭代最近点(ICP)家族的经典方法。在这项工作中,我们编辑和发布了DotsonEast Dataset,一个由自主水下车辆在西部南极洲上构建的半合成MBES配准数据集。利用这个数据集,我们系统性地比较了2个经典方法和4个学习方法的表现。实验结果表明,基于学习的方法在粗对齐方面表现良好,并且在高重叠(20-50%)的情况下,恢复粗糙变换的效果更好。相比之下,GICP(一种ICP的变体)在细对齐方面表现良好,并且即使在极低重叠(10%)的情况下,其在所有指标上仍优于其他方法。据我们所知,这是第一个将学习方法和经典配准方法在AUV基于MBES数据集上进行比较的工作。为了促进未来的研究,我们将代码和数据都放在网上。
https://arxiv.org/abs/2405.06279
In this paper, we tackle the problem of grasping transparent and specular objects. This issue holds importance, yet it remains unsolved within the field of robotics due to failure of recover their accurate geometry by depth cameras. For the first time, we propose ASGrasp, a 6-DoF grasp detection network that uses an RGB-D active stereo camera. ASGrasp utilizes a two-layer learning-based stereo network for the purpose of transparent object reconstruction, enabling material-agnostic object grasping in cluttered environments. In contrast to existing RGB-D based grasp detection methods, which heavily depend on depth restoration networks and the quality of depth maps generated by depth cameras, our system distinguishes itself by its ability to directly utilize raw IR and RGB images for transparent object geometry reconstruction. We create an extensive synthetic dataset through domain randomization, which is based on GraspNet-1Billion. Our experiments demonstrate that ASGrasp can achieve over 90% success rate for generalizable transparent object grasping in both simulation and the real via seamless sim-to-real transfer. Our method significantly outperforms SOTA networks and even surpasses the performance upper bound set by perfect visible point cloud inputs.Project page: this https URL
在本文中,我们解决了理解透明和散射物体的难题。尽管这个问题在机器人领域具有重要意义,但由于深度相机无法准确恢复物体的准确几何形状,因此仍未得到解决。为了实现首次,我们提出了ASGrasp,一种使用RGB-D活动立体相机的多关节 grasp 检测网络。ASGrasp 使用基于两层学习的立体网络来重建透明物体的几何形状,使得在杂乱环境中实现无材料差异的物体抓取。与现有的基于深度学习的抓取检测方法不同,这些方法严重依赖深度恢复网络和深度相机生成的深度图的质量。我们的系统通过直接利用原始IR和RGB图像进行透明物体几何形状重建,从而与现有方法区分开来。通过领域随机化创建了一个广泛的合成数据集,基于GraspNet-1Billion。我们的实验证明,ASGrasp在模拟和真实情况下都可以实现超过90%的通用透明物体抓取成功率。我们的方法在现有SOTA网络之上显著优于网络,甚至超过了由完美可见点云输入产生的性能上限。项目页面:https:// this URL
https://arxiv.org/abs/2405.05648
Autonomous systems often employ multiple LiDARs to leverage the integrated advantages, enhancing perception and robustness. The most critical prerequisite under this setting is the estimating the extrinsic between each LiDAR, i.e., calibration. Despite the exciting progress in multi-LiDAR calibration efforts, a universal, sensor-agnostic calibration method remains elusive. According to the coarse-to-fine framework, we first design a spherical descriptor TERRA for 3-DoF rotation initialization with no prior knowledge. To further optimize, we present JEEP for the joint estimation of extrinsic and pose, integrating geometric and motion information to overcome factors affecting the point cloud registration. Finally, the LiDAR poses optimized by the hierarchical optimization module are input to time syn- chronization module to produce the ultimate calibration results, including the time offset. To verify the effectiveness, we conduct extensive experiments on eight datasets, where 16 diverse types of LiDARs in total and dozens of calibration tasks are tested. In the challenging tasks, the calibration errors can still be controlled within 5cm and 1° with a high success rate.
自主系统通常会采用多个LiDAR,以利用其集成优势,提高感知和稳健性。在這種设置下,最關鍵的先决条件是估计每个LiDAR之间的外骨骼,即校准。尽管在多LiDAR校准的努力中取得了令人兴奋的进展,但仍然缺乏一个通用的、对传感器无关的校准方法。根据粗到细的框架,我们首先设计了一个球形描述符TERRA,用于无先验知识的三维旋转初始化。为了进一步优化,我们提出了JEEP,用于联合估计外骨骼和姿态,整合几何和运动信息,克服影响点云配准的因素。最后,由层次优化模块优化的LiDAR姿态被输入到时间同步模块,产生最终校准结果,包括时间偏移。为了验证其有效性,我们在八个数据集上进行了广泛的实验,其中总共有16种不同类型的LiDAR和几十个校准任务进行了测试。在具有挑战性的任务中,校准误差仍然可以控制在5cm和1°以内,具有很高的成功率。
https://arxiv.org/abs/2405.05589
Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
高效的数据利用对于在自动驾驶中提高3D场景理解至关重要,因为过分依赖人工标注的激光雷达点云使得完全监督方法受到挑战。为了解决这个问题,我们的研究扩展了激光雷达语义分割的半监督学习,利用驾驶场景的固有空间先验和多传感器补充,以提高未标注数据集的有效性。我们引入了LaserMix++,一个进化框架,整合了不同来源的激光雷达扫描的激光束操作,并进一步通过激光雷达相机对应关系辅助数据有效的学习。我们的框架旨在通过包括多模态、包括1)多模态激光Mix操作,实现对细粒度跨传感器交互的优化;2)相机到激光雷达特征蒸馏,增强激光雷达特征学习;3)使用开维词表模型的语言驱动知识指导生成辅助监督。LaserMix++的多功能使其能够应用于各种激光雷达表示形式,使其成为一种通用的解决方案。通过理论分析和广泛应用于流行驾驶感知数据集的实验,验证了我们的框架。结果表明,LaserMix++在完全监督替代方案方面取得了显著的优越性,其准确率与五个标注少的关系相当,同时显著提高了仅监督基础。这一重大的进步凸显了半监督方法在减少基于激光雷达的3D场景理解系统中过度依赖全面标注数据中的潜力。
https://arxiv.org/abs/2405.05258
The 4D millimeter-wave (mmWave) radar, with its robustness in extreme environments, extensive detection range, and capabilities for measuring velocity and elevation, has demonstrated significant potential for enhancing the perception abilities of autonomous driving systems in corner-case scenarios. Nevertheless, the inherent sparsity and noise of 4D mmWave radar point clouds restrict its further development and practical application. In this paper, we introduce a novel 4D mmWave radar point cloud detector, which leverages high-resolution dense LiDAR point clouds. Our approach constructs dense 3D occupancy ground truth from stitched LiDAR point clouds, and employs a specially designed network named DenserRadar. The proposed method surpasses existing probability-based and learning-based radar point cloud detectors in terms of both point cloud density and accuracy on the K-Radar dataset.
4D毫米波(mmWave)雷达,在极端环境下具有稳健性,广泛的探测范围和测量速度和高度的能力,已经展示了在角落场景中增强自动驾驶系统感知能力的巨大潜力。然而,4D mmWave雷达点云固有的稀疏性和噪声限制了其进一步发展和实际应用。在本文中,我们介绍了一种新型的4D mmWave雷达点云检测器,它利用高分辨率的大致激光雷达点云。我们的方法从拼接的激光雷达点云中构建了密集的3D占有率地面真实值,并采用了一个专门设计的网络名为DenserRadar。与现有的概率基础学习和基于学习的雷达点云检测器相比,所提出的方法在K-Radar数据集上点云密度和准确性都超过了现有的水平。
https://arxiv.org/abs/2405.05131
The search for refining 3D LiDAR data has attracted growing interest motivated by recent techniques such as supervised learning or generative model-based methods. Existing approaches have shown the possibilities for using diffusion models to generate refined LiDAR data with high fidelity, although the performance and speed of such methods have been limited. These limitations make it difficult to execute in real-time, causing the approaches to struggle in real-world tasks such as autonomous navigation and human-robot interaction. In this work, we introduce a novel approach based on conditional diffusion models for fast and high-quality sparse-to-dense upsampling of 3D scene point clouds through an image representation. Our method employs denoising diffusion probabilistic models trained with conditional inpainting masks, which have been shown to give high performance on image completion tasks. We introduce a series of experiments, including multiple datasets, sampling steps, and conditional masks, to determine the ideal configuration, striking a balance between performance and inference speed. This paper illustrates that our method outperforms the baselines in sampling speed and quality on upsampling tasks using the KITTI-360 dataset. Furthermore, we illustrate the generalization ability of our approach by simultaneously training on real-world and synthetic datasets, introducing variance in quality and environments.
寻找精化3D LiDAR数据的搜索吸引了越来越多的关注,这是由最近使用的如监督学习或基于生成模型的方法等技术引起的。虽然已经证明了使用扩散模型生成具有高保真度的精化LiDAR数据的可能性,但这种方法的性能和速度仍然有限。这些限制使得在实时执行中很难实现,导致在现实世界的任务(如自主导航和人类机器人交互)中,这些方法遇到困难。 在本文中,我们介绍了一种基于条件扩散模型的新的方法,用于通过图像表示对3D场景点云进行高保真度的平滑和压缩。我们的方法采用带条件修补掩码的噪声扩散概率模型进行训练,这些模型已经在图像完成任务中表现出良好的性能。我们介绍了一系列实验,包括多个数据集、采样步骤和条件掩码,以确定理想的配置,在性能和推理速度之间取得平衡。本文证明了,我们的方法在KITTI-360数据集上的采样速度和质量方面超过了基线。此外,我们还通过同时在一手真实世界和合成数据上训练,展示了我们方法的一般化能力。我们还展示了在不同质量和环境下的平滑和压缩效果。
https://arxiv.org/abs/2405.04889
Mapping agencies are increasingly adopting Aerial Lidar Scanning (ALS) as a new tool to monitor territory and support public policies. Processing ALS data at scale requires efficient point classification methods that perform well over highly diverse territories. To evaluate them, researchers need large annotated Lidar datasets, however, current Lidar benchmark datasets have restricted scope and often cover a single urban area. To bridge this data gap, we present the FRench ALS Clouds from TArgeted Landscapes (FRACTAL) dataset: an ultra-large-scale aerial Lidar dataset made of 100,000 dense point clouds with high-quality labels for 7 semantic classes and spanning 250 km$^2$. FRACTAL is built upon France's nationwide open Lidar data. It achieves spatial and semantic diversity via a sampling scheme that explicitly concentrates rare classes and challenging landscapes from five French regions. It should support the development of 3D deep learning approaches for large-scale land monitoring. We describe the nature of the source data, the sampling workflow, the content of the resulting dataset, and provide an initial evaluation of segmentation performance using a performant 3D neural architecture.
地图机构 increasingly采用航空激光扫描(ALS)作为监测领土和支持公共政策的新工具。在处理大规模 ALS 数据时,需要高效的分点分类方法,这些方法在高度多样化的领土上表现良好。为了评估它们,研究人员需要大型注释的激光数据集,然而,当前的激光基准数据集具有有限的覆盖范围,通常只覆盖单个城市地区。为了填补这个数据差距,我们提出了由Targeted Landscapes (FRACTAL)团队制作的FRench ALS Clouds数据集:一个由100,000个密集点云组成,为7个语义类别的超大规模航空激光数据集,面积超过250平方公里。FRACTAL是基于法国全国开放式激光数据构建的。它通过一种明确的抽样方案,专门集中来自五个法国地区的罕见类群和具有挑战性的地貌,实现了空间和语义多样性。它应该支持大规模土地监测的3D深度学习方法的开发。我们描述了原始数据的性质、抽样工作流程、生成数据的内容,并使用性能出色的3D神经架构对分割性能进行了初步评估。
https://arxiv.org/abs/2405.04634
Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.
跨模态知识转移增强在激光雷达语义分割中改善点云表示学习。尽管具有潜在优势,但\textit{弱教师挑战}源于重复且缺乏多样性的车载相机图像以及稀疏且不准确的地面真实标签。为了应对这个问题,我们提出了高效的图像到激光雷达知识转移(ELiTe)范式。ELiTe引入了来自视觉基础模型(VFM)的补丁到点的多级知识蒸馏,在多样开放世界图像上进行了广泛训练。这使得能够在模态之间有效地进行知识传递。ELiTe采用参数高效的微调来加强VFM教师,并加速大规模模型训练,同时最小化成本。此外,我们还引入了基于伪标签生成的分割 anything模型,以增强低质量图像标签,促进稳健的语义表示。ELiTe通过显著更少的参数在SemanticKITTI基准上取得了最先进的性能,超过了实时推理模型。我们的方法通过显著更少的参数证实了其有效性和效率。
https://arxiv.org/abs/2405.04121
In this paper, we investigate an open research task of cross-modal retrieval between 3D shapes and textual descriptions. Previous approaches mainly rely on point cloud encoders for feature extraction, which may ignore key inherent features of 3D shapes, including depth, spatial hierarchy, geometric continuity, etc. To address this issue, we propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance. Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes, which enrich the inherent features and enhance their compatibility with text matching. Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method, in an attempt to improve the learning efficiency. Extensive quantitative and qualitative experiments demonstrate the superiority of our proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.
在本文中,我们研究了在3D形状和文本描述之间进行跨模态检索的开放研究任务。以前的方法主要依赖于点云编码器进行特征提取,这可能会忽略3D形状的关键固有特征,包括深度、空间层次结构、几何连续性等。为了解决这个问题,我们提出了COM3D,这是第一次尝试利用跨视图匹配和跨模态挖掘来提高检索性能。值得注意的是,我们通过场景表示转换器增强3D特征,以生成3D形状的跨视图匹配特征,从而丰富其固有特征并提高其与文本匹配的兼容性。此外,我们还基于半硬负样本挖掘方法优化跨模态匹配过程,试图提高学习效率。大量的定量和定性实验证实了我们提出的COM3D具有优越性,在Text2Shape数据集上取得了最先进的成果。
https://arxiv.org/abs/2405.04103
The construction and robotic sensing data originate from disparate sources and are associated with distinct frames of reference. The primary objective of this study is to align LiDAR point clouds with building information modeling (BIM) using a global point cloud registration approach, aimed at establishing a shared understanding between the two modalities, i.e., ``speak the same language''. To achieve this, we design a cross-modality registration method, spanning from front end the back end. At the front end, we extract descriptors by identifying walls and capturing the intersected corners. Subsequently, for the back-end pose estimation, we employ the Hough transform for pose estimation and estimate multiple pose candidates. The final pose is verified by wall-pixel correlation. To evaluate the effectiveness of our method, we conducted real-world multi-session experiments in a large-scale university building, involving two different types of LiDAR sensors. We also report our findings and plan to make our collected dataset open-sourced.
建筑和机器人感测数据来自不同的来源,并与独特的视角相关。本研究的主要目标是用全局点云配准方法将激光雷达点云与建筑信息模型(BIM)对齐,旨在建立两种数据之间的共同理解,即“使用相同的语言交流”。为实现这一目标,我们设计了一种跨模态配准方法,从前端到后端。在前端,我们通过识别墙并捕获相交角来提取描述符。接下来,为了后端姿态估计,我们使用Hough变换来进行姿态估计,估计多个姿态候选者。最后,通过墙像素相关性验证最终姿态。为了评估我们方法的有效性,我们在一个大型的大学楼中进行了多次现实世界的多会话实验,涉及两种不同类型的激光雷达传感器。我们还报告了我们的发现,并计划将我们所收集的数据公开开源。
https://arxiv.org/abs/2405.03969