Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.
尽管在过去的几年中,面部识别已经取得了显著的进步,但设计一个多任务面部识别模型仍然具有挑战性。大多数面部识别任务都被单独研究,并没有从相关任务之间的协同作用中受益。在本文中,我们提出了一种名为 Q-Face 的具有新颖性的多任务面部识别方法,该方法使用一个统一模型同时执行多个面部识别任务。我们将来自大型预训练模型的多个层次的特征融合在一起,使整个模型可以利用局部和全局面部信息来支持多个任务。此外,我们还设计了一个任务适应模块,在查询向量和融合多级特征之间进行跨注意,并最终根据每个面部识别任务自适应地提取所需特征。大量实验证明,我们的方法可以同时执行多个任务,在面部表情识别、动作单元检测、面部属性分析、年龄估计和面部姿态估计方面的表现均为最先进的水平。与传统方法相比,我们的方法为多任务面部识别提供了新的可能性,并展示了准确性和效率的潜力。
https://arxiv.org/abs/2405.09059
In this work, we introduce a novel method for calculating the 6DoF pose of an object using a single RGB-D image. Unlike existing methods that either directly predict objects' poses or rely on sparse keypoints for pose recovery, our approach addresses this challenging task using dense correspondence, i.e., we regress the object coordinates for each visible pixel. Our method leverages existing object detection methods. We incorporate a re-projection mechanism to adjust the camera's intrinsic matrix to accommodate cropping in RGB-D images. Moreover, we transform the 3D object coordinates into a residual representation, which can effectively reduce the output space and yield superior performance. We conducted extensive experiments to validate the efficacy of our approach for 6D pose estimation. Our approach outperforms most previous methods, especially in occlusion scenarios, and demonstrates notable improvements over the state-of-the-art methods. Our code is available on this https URL.
在这项工作中,我们提出了一种使用单张RGB-D图像计算物体6DoF姿态的新方法。与现有的方法不同,它们要么直接预测物体的姿态,要么依赖于稀疏的关键点来进行姿态恢复。我们的方法通过密集匹配解决了这一具有挑战性的任务,即我们对于每个可见像素回归物体的坐标。我们的方法依赖于现有的物体检测方法。我们引入了一个重投影机制来调整相机的固有矩阵以适应RGB-D图像的裁剪。此外,我们将3D物体坐标转换为残差表示,可以有效地降低输出空间并产生卓越的性能。我们对我们的方法在6DoF姿态估计方面的有效性进行了广泛的实验验证。与大多数先前的方法相比,我们的方法在遮挡场景中表现优异,并显著超越了最先进的 methods。我们的代码可以在这个https:// URL上找到。
https://arxiv.org/abs/2405.08483
Image matching is still challenging in such scenes with large viewpoints or illumination changes or with low textures. In this paper, we propose a Transformer-based pseudo 3D image matching method. It upgrades the 2D features extracted from the source image to 3D features with the help of a reference image and matches to the 2D features extracted from the destination image by the coarse-to-fine 3D matching. Our key discovery is that by introducing the reference image, the source image's fine points are screened and furtherly their feature descriptors are enriched from 2D to 3D, which improves the match performance with the destination image. Experimental results on multiple datasets show that the proposed method achieves the state-of-the-art on the tasks of homography estimation, pose estimation and visual localization especially in challenging scenes.
图像匹配在具有大视角或光照变化或低纹理的场景中仍然具有挑战性。在本文中,我们提出了一种基于Transformer的伪3D图像匹配方法。它通过参考图像升级源图像中提取的2D特征为3D特征,并通过粗到细的3D匹配将目标图像中提取的2D特征与源图像中的匹配。我们的关键发现是,通过引入参考图像,可以筛选出源图像中微小的点,并从2D到3D进一步丰富其特征描述符,从而提高与目标图像的匹配性能。在多个数据集上的实验结果表明,与其他方法相比,尤其是在具有挑战性场景的任务中,所提出的方法在姿态估计和视觉定位方面实现了最先进的水平。
https://arxiv.org/abs/2405.08434
Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics. Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their dependency on labeled training data, model compactness, robustness under challenging conditions, and their ability to generalize to novel unseen objects. A recent survey discussing the progress made on different aspects of this area, outstanding challenges, and promising future directions, is missing. To fill this gap, we discuss the recent advances in deep learning-based object pose estimation, covering all three formulations of the problem, i.e., instance-level, category-level, and unseen object pose estimation. Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks, providing readers with a holistic understanding of this field. Additionally, it discusses training paradigms of different domains, inference modes, application areas, evaluation metrics, and benchmark datasets, as well as reports the performance of current state-of-the-art methods on these benchmarks, thereby facilitating readers in selecting the most suitable method for their application. Finally, the survey identifies key challenges, reviews prevailing trends along with their pros and cons, and identifies promising directions for future research. We also keep tracing the latest works at this https URL.
对象姿态估计是一个在增强现实和机器人领域具有广泛应用的基本计算机视觉问题。在过去的十年里,由于其卓越的准确性和鲁棒性,深度学习模型越来越多地取代了依赖于人工点对特征的传统算法。然而,在当代方法中仍然存在几个挑战,包括对标注训练数据的依赖,模型的紧凑性,在复杂条件下的鲁棒性以及泛化到新颖未知物的能力。一份最近关于该领域进展的调查讨论了这个问题,突出了一些挑战和有前景的未来方向,但缺少了对这个领域的深入讨论。为了填补这个空白,我们讨论了基于深度学习的对象姿态估计的最新进展,涵盖了问题的所有三种形式,即实例级、类别级和未见物体姿态估计。我们的调查还涵盖了多个输入数据模态,输出姿态的自由度,物体属性以及下游任务,为读者提供了对这一领域的全面了解。此外,它还讨论了不同领域的训练范式、推理模式、应用领域、评估指标和基准数据集,以及报告了最先进方法在这些基准上的性能。最后,调查列举了关键挑战,回顾了现有趋势的优缺点,并提出了未来研究的建议。我们还在这个链接上持续追踪最新的工作。
https://arxiv.org/abs/2405.07801
Unmanned aerial vehicles (UAVs) visual localization in planetary aims to estimate the absolute pose of the UAV in the world coordinate system through satellite maps and images captured by on-board cameras. However, since planetary scenes often lack significant landmarks and there are modal differences between satellite maps and UAV images, the accuracy and real-time performance of UAV positioning will be reduced. In order to accurately determine the position of the UAV in a planetary scene in the absence of the global navigation satellite system (GNSS), this paper proposes JointLoc, which estimates the real-time UAV position in the world coordinate system by adaptively fusing the absolute 2-degree-of-freedom (2-DoF) pose and the relative 6-degree-of-freedom (6-DoF) pose. Extensive comparative experiments were conducted on a proposed planetary UAV image cross-modal localization dataset, which contains three types of typical Martian topography generated via a simulation engine as well as real Martian UAV images from the Ingenuity helicopter. JointLoc achieved a root-mean-square error of 0.237m in the trajectories of up to 1,000m, compared to 0.594m and 0.557m for ORB-SLAM2 and ORB-SLAM3 respectively. The source code will be available at this https URL.
无人机(UAVs)在行星上的视觉定位旨在通过卫星地图和由机载相机捕获的图像来估计UAV在世界坐标系中的绝对位置。然而,由于行星场景通常缺乏明显的地标,卫星地图和UAV图像之间存在模态差异,因此UAV定位的准确性和实时性能将降低。为了在没有全球导航卫星系统(GNSS)的情况下准确确定UAV在行星场景中的位置,本文提出了JointLoc,它通过自适应融合绝对2度自由度(2-DoF)姿态和相对6度自由度(6-DoF)姿态来估计实时UAV在世界坐标系中的位置。在拟议的行星UAV图像跨模态定位数据集上进行了广泛的比较实验,该数据集包含通过模拟引擎生成的三种典型火星地形类型以及来自Ingenuity直升机的真实火星UAV图像。JointLoc在1000m及以上的轨迹上的根均方误差为0.237m,而ORB-SLAM2和ORB-SLAM3的轨迹上的根均方误差分别为0.594m和0.557m。源代码将在此处https URL上提供。
https://arxiv.org/abs/2405.07429
The reliance on accurate camera poses is a significant barrier to the widespread deployment of Neural Radiance Fields (NeRF) models for 3D reconstruction and SLAM tasks. The existing method introduces monocular depth priors to jointly optimize the camera poses and NeRF, which fails to fully exploit the depth priors and neglects the impact of their inherent noise. In this paper, we propose Truncated Depth NeRF (TD-NeRF), a novel approach that enables training NeRF from unknown camera poses - by jointly optimizing learnable parameters of the radiance field and camera poses. Our approach explicitly utilizes monocular depth priors through three key advancements: 1) we propose a novel depth-based ray sampling strategy based on the truncated normal distribution, which improves the convergence speed and accuracy of pose estimation; 2) to circumvent local minima and refine depth geometry, we introduce a coarse-to-fine training strategy that progressively improves the depth precision; 3) we propose a more robust inter-frame point constraint that enhances robustness against depth noise during training. The experimental results on three datasets demonstrate that TD-NeRF achieves superior performance in the joint optimization of camera pose and NeRF, surpassing prior works, and generates more accurate depth geometry. The implementation of our method has been released at this https URL.
依赖准确的相机姿态是广泛应用神经辐射场(NeRF)模型进行3D建模和SLAM任务的显著障碍。现有的方法引入了单目深度优先权来共同优化相机姿态和NeRF,但未能充分利用深度优先权,忽视了它们的固有噪声。在本文中,我们提出了截断深度NeRF(TD-NeRF),一种通过共同优化Radiance场和学习器参数来训练NeRF的新颖方法。我们的方法通过三个关键进步利用了单目深度优先权:1)我们提出了一种基于截断正态分布的新型深度光线采样策略,从而提高姿态估计的收敛速度和准确性;2)为了绕过局部最小值并优化深度几何,我们引入了一种粗到细的训练策略,逐步提高深度精度;3)我们提出了一种更健壮的跨帧点约束,在训练过程中增强对深度噪声的鲁棒性。在三个数据集上的实验结果表明,TD-NeRF在联合优化相机姿态和NeRF方面取得了卓越的性能,超越了之前的工作,并生成了更准确的深度几何。我们方法的实现已经发布在https:// this URL上。
https://arxiv.org/abs/2405.07027
To address the limitations inherent to conventional automated harvesting robots specifically their suboptimal success rates and risk of crop damage, we design a novel bot named AHPPEBot which is capable of autonomous harvesting based on crop phenotyping and pose estimation. Specifically, In phenotyping, the detection, association, and maturity estimation of tomato trusses and individual fruits are accomplished through a multi-task YOLOv5 model coupled with a detection-based adaptive DBScan clustering algorithm. In pose estimation, we employ a deep learning model to predict seven semantic keypoints on the pedicel. These keypoints assist in the robot's path planning, minimize target contact, and facilitate the use of our specialized end effector for harvesting. In autonomous tomato harvesting experiments conducted in commercial greenhouses, our proposed robot achieved a harvesting success rate of 86.67%, with an average successful harvest time of 32.46 s, showcasing its continuous and robust harvesting capabilities. The result underscores the potential of harvesting robots to bridge the labor gap in agriculture.
为了克服传统自动收获机器人的局限性,特别是其低成功率和作物受损风险,我们设计了一种名为AHPPEBot的新型机器人,它能够基于作物表型和姿态估计进行自主收获。具体来说,在表型识别中,通过结合多任务YOLOv5模型和基于检测的适应性DBSCAN聚类算法,我们实现了对番茄果实的检测、关联和成熟度的估计。在姿态估计中,我们采用深度学习模型预测 pedicel 上七个语义关键点。这些关键点有助于机器人路径规划,最小化目标接触,并促进我们专用收割器的使用。在商业温室中进行的自走式番茄收获实验中,与传统的收获机器人相比,我们提出的机器人收获成功率为86.67%,平均成功收获时间为32.46秒,展示了其连续和可靠的收获能力。这种结果强调了收获机器人在农业中解决劳动力短缺问题的潜力。
https://arxiv.org/abs/2405.06959
This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently Gaussian Splatting-based SLAM has yielded promising results, but rely on RGB-D input and is weak in tracking. To address these limitations, we uniquely integrates advanced sparse visual odometry with a dense Gaussian Splatting scene representation for the first time, thereby eliminating the dependency on depth maps typical of Gaussian Splatting-based SLAM systems and enhancing tracking robustness. Here, the sparse visual odometry tracks camera poses in RGB stream, while Gaussian Splatting handles map reconstruction. These components are interconnected through a Multi-View Stereo (MVS) depth estimation network. And we propose a depth smooth loss to reduce the negative effect of estimated depth maps. Furthermore, the consistency in scale between the sparse visual odometry and the dense Gaussian map is preserved by Sparse-Dense Adjustment Ring (SDAR). We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art performance. Additionally, it outperforms previous monocular methods in terms of novel view synthesis fidelity, matching the results of neural SLAM systems that utilize RGB-D input.
本文介绍了一种基于Gaussian Splatting的密集视觉同时定位与映射(VSLAM)的新框架。最近基于Gaussian Splatting的SLAM已经取得了良好的结果,但是依赖于RGB-D输入,并且在跟踪方面较弱。为了克服这些限制,我们独特地将先进的稀疏视觉欧拉角与密集Gaussian Splatting场景表示集成在一起,从而消除了Gaussian Splatting-based SLAM系统常见的深度图依赖,提高了跟踪的鲁棒性。在这里,稀疏视觉欧拉角跟踪相机姿态,而Gaussian Splatting处理地图重建。这些组件通过多视角立体(MVS)深度估计网络相互连接。我们提出了一个深度平滑损失来减少估计深度图的负影响。此外,通过稀疏-稠密调整环(SDAR)可以保留稀疏视觉欧拉角与密集Gaussian地图之间的一致缩放。我们在各种合成和真实世界数据集上对系统进行了评估。我们的姿态估计精度超越了现有方法,达到了最先进的水平。此外,在新颖视角合成保真度方面,它超过了之前的多目视觉SLAM系统,与利用RGB-D输入的神经SLAM系统的结果相匹敌。
https://arxiv.org/abs/2405.06241
We propose an approach for reconstructing free-moving object from a monocular RGB video. Most existing methods either assume scene prior, hand pose prior, object category pose prior, or rely on local optimization with multiple sequence segments. We propose a method that allows free interaction with the object in front of a moving camera without relying on any prior, and optimizes the sequence globally without any segments. We progressively optimize the object shape and pose simultaneously based on an implicit neural representation. A key aspect of our method is a virtual camera system that reduces the search space of the optimization significantly. We evaluate our method on the standard HO3D dataset and a collection of egocentric RGB sequences captured with a head-mounted device. We demonstrate that our approach outperforms most methods significantly, and is on par with recent techniques that assume prior information.
我们提出了一种从单目RGB视频重构自由运动的物体的方法。大多数现有方法要么假设场景先验、手部姿势先验、物体类别姿势先验,要么依赖于多个序列段进行局部优化。我们提出了一种允许在运动相机前自由交互物体,且不需要依赖任何先验的方法来全局优化物体形状和姿态。我们根据隐式神经表示同时逐步优化物体的形状和姿态。我们方法的关键点是一个虚拟相机系统,它显著减少了优化的搜索空间。我们在标准的HO3D数据集和用头部设备捕捉的一组自适应RGB序列上评估我们的方法。我们证明了我们的方法明显优于大多数方法,并且与假设先验信息的最新技术相当。
https://arxiv.org/abs/2405.05858
In this study, we introduce a novel shared-control system for key-hole docking operations, combining a commercial camera with occlusion-robust pose estimation and a hand-eye information fusion technique. This system is used to enhance docking precision and force-compliance safety. To train a hand-eye information fusion network model, we generated a self-supervised dataset using this docking system. After training, our pose estimation method showed improved accuracy compared to traditional methods, including observation-only approaches, hand-eye calibration, and conventional state estimation filters. In real-world phantom experiments, our approach demonstrated its effectiveness with reduced position dispersion (1.23\pm 0.81 mm vs. 2.47 \pm 1.22 mm) and force dispersion (0.78\pm 0.57 N vs. 1.15 \pm 0.97 N) compared to the control group. These advancements in semi-autonomy co-manipulation scenarios enhance interaction and stability. The study presents an anti-interference, steady, and precision solution with potential applications extending beyond laparoscopic surgery to other minimally invasive procedures.
在这项研究中,我们提出了一种新颖的共享控制系统,用于关键孔对准操作,该系统将商业相机与遮挡鲁棒姿态估计和手眼信息融合技术相结合。该系统用于提高对准精度和对力合规安全性。为了训练手眼信息融合网络模型,我们使用该对准系统生成了一个自监督的 dataset。训练后,我们的姿态估计方法在传统方法(包括仅观察方法、手眼校准和传统状态估计滤波器)方面的准确度有所提高。在现实世界的幻影实验中,我们的方法证明了其有效性,与控制组相比,位置扩散减少了(1.23±0.81 mm vs. 2.47±1.22 mm),力扩散减少了(0.78±0.57 N vs. 1.15±0.97 N)。这些半自主操作场景的进步提高了交互性和稳定性。研究提出了一种抗干扰、平稳、精确的解决方案,具有潜在的应用,不仅限于腹腔镜手术,还适用于其他微创手术。
https://arxiv.org/abs/2405.05817
Neural Radiance Fields (NeRF) have emerged as a powerful paradigm for 3D scene representation, offering high-fidelity renderings and reconstructions from a set of sparse and unstructured sensor data. In the context of autonomous robotics, where perception and understanding of the environment are pivotal, NeRF holds immense promise for improving performance. In this paper, we present a comprehensive survey and analysis of the state-of-the-art techniques for utilizing NeRF to enhance the capabilities of autonomous robots. We especially focus on the perception, localization and navigation, and decision-making modules of autonomous robots and delve into tasks crucial for autonomous operation, including 3D reconstruction, segmentation, pose estimation, simultaneous localization and mapping (SLAM), navigation and planning, and interaction. Our survey meticulously benchmarks existing NeRF-based methods, providing insights into their strengths and limitations. Moreover, we explore promising avenues for future research and development in this domain. Notably, we discuss the integration of advanced techniques such as 3D Gaussian splatting (3DGS), large language models (LLM), and generative AIs, envisioning enhanced reconstruction efficiency, scene understanding, decision-making capabilities. This survey serves as a roadmap for researchers seeking to leverage NeRFs to empower autonomous robots, paving the way for innovative solutions that can navigate and interact seamlessly in complex environments.
神经辐射场(NeRF)已成为3D场景表示的强大范例,提供来自一系列稀疏和无结构传感器数据的最高质量渲染和重构。在自主机器人领域,感知和理解环境至关重要,因此NeRF在提高性能方面具有巨大的潜力。在本文中,我们对使用NeRF增强自主机器人能力的最先进技术进行全面调查和分析。我们特别关注自主机器人的感知、定位和导航模块,深入研究了对于自主操作至关重要的任务,包括3D建模、分割、姿态估计、同时定位与映射(SLAM)、导航和规划以及交互。我们的调查详细基准了现有的NeRF方法,提供了它们的优势和局限性的洞察。此外,我们探讨了该领域未来研究的方向和前景。值得注意的是,我们讨论了包括3D高斯分裂(3DGS)、大型语言模型(LLM)和生成式人工智能(GAN)等先进技术的整合,旨在提高建模效率、增强场景理解和决策能力。本调查为研究人员利用NeRF增强自主机器人提供了路线图,为研究人员提供了一个创新解决方案,可以让自主机器人顺畅地导航和交互。
https://arxiv.org/abs/2405.05526
Skeleton-based motion visualization is a rising field in computer vision, especially in the case of virtual reality (VR). With further advancements in human-pose estimation and skeleton extracting sensors, more and more applications that utilize skeleton data have come about. These skeletons may appear to be anonymous but they contain embedded personally identifiable information (PII). In this paper we present a new anonymization technique that is based on motion retargeting, utilizing adversary classifiers to further remove PII embedded in the skeleton. Motion retargeting is effective in anonymization as it transfers the movement of the user onto the a dummy skeleton. In doing so, any PII linked to the skeleton will be based on the dummy skeleton instead of the user we are protecting. We propose a Privacy-centric Deep Motion Retargeting model (PMR) which aims to further clear the retargeted skeleton of PII through adversarial learning. In our experiments, PMR achieves motion retargeting utility performance on par with state of the art models while also reducing the performance of privacy attacks.
基于骨架的运动可视化是一个在计算机视觉领域正在崛起的领域,尤其是在虚拟现实(VR)中。随着人类姿态估计和骨架提取传感器的进一步发展,越来越多的应用利用骨架数据。这些骨架似乎看起来是匿名的,但它们包含嵌入的个人可识别信息(PII)。在本文中,我们提出了一个新的匿名化技术,基于运动重新定位,并利用对抗分类器进一步去除骨架中嵌入的PII。运动重新定位在匿名化方面的效果是,它将用户的运动转移到虚拟骨架上。这样一来,与骨架相关的任何PII都基于虚拟骨架而不是用户,我们旨在通过对抗学习进一步清除被保护的骨架中的PII。 在实验中,我们提出了一种隐私中心化的深度运动重新定位模型(PMR),旨在通过对抗学习进一步清除被保护的骨架中的PII。与最先进的模型相比,PMR在运动重新定位的可用性和隐私攻击效果方面都取得了很好的表现。
https://arxiv.org/abs/2405.05428
The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at this https URL.
3D 人类姿态估计(3D HPE)任务使用 2D 图像或视频预测人体关节坐标在 3D 空间中。尽管在基于深度学习的方法中最近取得了进展,但它们通常忽视了将可访问的文本和人类自然可行的知识相结合的能力,从而错过了对引导 3D HPE 任务的宝贵隐式监督。此外,之前的努力通常从整个人体角度研究这个问题,忽视了隐藏在身体不同部位的微小指导。因此,我们提出了一个基于扩散模型的细粒度提示驱动去噪器,名为 FinePOSE。它包括三个核心模块: (1)细粒度部分感知提示学习(FPP)模块通过将可访问的文本和人体部位的自然可行知识与可学习提示相结合来构建细粒度部分感知提示。 (2)细粒度提示-姿态通信(FPC)模块建立可学习部分感知提示和学习到的姿态之间的细粒度通信,以提高去噪质量。 (3)提示驱动时间样式化(PTS)模块将学习到的提示嵌入和与噪声水平相关的时序信息集成在一起,以实现在每个去噪步骤的自适应调整。在公开的人体姿态估计数据集上进行广泛的实验证明,FinePOSE 超越了最先进的方法。我们进一步将 FinePOSE 扩展到多人类姿态估计。在实现 egoHumans 数据集上的平均 MPJPE 值为 34.3mm 证明了 FinePOSE 在处理复杂的人体场景方面具有潜力。代码可在此处访问:<https:// this URL>
https://arxiv.org/abs/2405.05216
Millimetre wave (mmWave) radar is a non-intrusive privacy and relatively convenient and inexpensive device, which has been demonstrated to be applicable in place of RGB cameras in human indoor pose estimation tasks. However, mmWave radar relies on the collection of reflected signals from the target, and the radar signals containing information is difficult to be fully applied. This has been a long-standing hindrance to the improvement of pose estimation accuracy. To address this major challenge, this paper introduces a probability map guided multi-format feature fusion model, ProbRadarM3F. This is a novel radar feature extraction framework using a traditional FFT method in parallel with a probability map based positional encoding method. ProbRadarM3F fuses the traditional heatmap features and the positional features, then effectively achieves the estimation of 14 keypoints of the human body. Experimental evaluation on the HuPR dataset proves the effectiveness of the model proposed in this paper, outperforming other methods experimented on this dataset with an AP of 69.9 %. The emphasis of our study is focusing on the position information that is not exploited before in radar singal. This provides direction to investigate other potential non-redundant information from mmWave rader.
毫米波雷达(mmWave雷达)是一种非侵入性的隐私保护和相对方便和经济实惠的设备,已经在人类室内姿态估计任务中证明了可以替代RGB相机。然而,毫米波雷达依赖于目标的反射信号收集,含有信息的雷达信号很难完全应用。这一直是提高姿态估计精度的一个长期阻碍。为了应对这一主要挑战,本文引入了一个概率引导的多格式特征融合模型,ProbRadarM3F。这是一种利用传统FFT方法和基于位置编码的位置编码方法的传统FFT方法的新型雷达特征提取框架。ProbRadarM3F将传统热图特征和位置特征融合在一起,有效实现了人体14个关键点的估计。在HuPR数据集上的实验评估证明,本文提出的模型在本文数据集上的有效程度超过了其他方法,其准确率达到了69.9%。本文的研究重点在于探讨在雷达信号之前未被充分利用的位置信息。这为研究毫米波雷达的其他潜在非冗余信息提供了方向。
https://arxiv.org/abs/2405.05164
For autonomous robotics applications, it is crucial that robots are able to accurately measure their potential state and perceive their environment, including other agents within it (e.g., cobots interacting with humans). The redundancy of these measurements is important, as it allows for planning and execution of recovery protocols in the event of sensor failure or external disturbances. Visual estimation can provide this redundancy through the use of low-cost sensors and server as a standalone source of proprioception when no encoder-based sensing is available. Therefore, we estimate the configuration of the robot jointly with its pose, which provides a complete spatial understanding of the observed robot. We present GISR - a method for deep configuration and robot-to-camera pose estimation that prioritizes real-time execution. GISR is comprised of two modules: (i) a geometric initialization module, efficiently computing an approximate robot pose and configuration, and (ii) an iterative silhouette-based refinement module that refines the initial solution in only a few iterations. We evaluate our method on a publicly available dataset and show that GISR performs competitively with existing state-of-the-art approaches, while being significantly faster compared to existing methods of the same class. Our code is available at this https URL.
对于自主机器人应用,确保机器人能够准确测量其潜在状态并感知其环境(包括其内部的其他机器人,例如与人类交互的协作机器人),对冗余进行重要评估,以便在传感器故障或外部干扰的情况下执行恢复协议。视觉估计可以通过使用低成本传感器和服务器作为自包含姿态感觉器时提供冗余来实现。因此,我们与姿态一起估计机器人的配置,这提供了对观察到的机器人的完整空间理解。我们提出了GISR - 一个注重实时执行的机器人配置和机器人-相机姿态估计方法。GISR由两个模块组成:(i)一个几何初始化模块,高效计算出近似的机器人姿态和配置;(ii)一个迭代轮廓基于平滑的优化模块,仅在几次迭代后对初始解决方案进行平滑。我们在公开可用的数据集上评估我们的方法,并证明了GISR与现有高级方法具有竞争力,同时比相同类型的现有方法速度更快。我们的代码可在此处访问:https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_id=11375
https://arxiv.org/abs/2405.04890
Relative placement tasks are an important category of tasks in which one object needs to be placed in a desired pose relative to another object. Previous work has shown success in learning relative placement tasks from just a small number of demonstrations when using relational reasoning networks with geometric inductive biases. However, such methods cannot flexibly represent multimodal tasks, like a mug hanging on any of n racks. We propose a method that incorporates additional properties that enable learning multimodal relative placement solutions, while retaining the provably translation-invariant and relational properties of prior work. We show that our method is able to learn precise relative placement tasks with only 10-20 multimodal demonstrations with no human annotations across a diverse set of objects within a category.
相对位置任务是重要的一类任务,其中一个对象需要相对于另一个对象以期望的位置放置。之前的工作已经在使用具有几何归纳偏见的关系推理网络时从仅几个演示中学习相对位置任务取得了成功。然而,这类方法不能灵活地表示多模态任务,比如悬在天花板上的茶壶。我们提出了一种方法,它结合了额外的属性,使得学习多模态相对位置解决方案,同时保留了之前工作的可证明的平移不变性和关系性质。我们证明了,我们的方法可以在没有人类标注的情况下,通过10-20个多模态演示来学习精确的相对位置任务。这些对象属于一个多样化的类别。
https://arxiv.org/abs/2405.04609
The construction and robotic sensing data originate from disparate sources and are associated with distinct frames of reference. The primary objective of this study is to align LiDAR point clouds with building information modeling (BIM) using a global point cloud registration approach, aimed at establishing a shared understanding between the two modalities, i.e., ``speak the same language''. To achieve this, we design a cross-modality registration method, spanning from front end the back end. At the front end, we extract descriptors by identifying walls and capturing the intersected corners. Subsequently, for the back-end pose estimation, we employ the Hough transform for pose estimation and estimate multiple pose candidates. The final pose is verified by wall-pixel correlation. To evaluate the effectiveness of our method, we conducted real-world multi-session experiments in a large-scale university building, involving two different types of LiDAR sensors. We also report our findings and plan to make our collected dataset open-sourced.
建筑和机器人感测数据来自不同的来源,并与独特的视角相关。本研究的主要目标是用全局点云配准方法将激光雷达点云与建筑信息模型(BIM)对齐,旨在建立两种数据之间的共同理解,即“使用相同的语言交流”。为实现这一目标,我们设计了一种跨模态配准方法,从前端到后端。在前端,我们通过识别墙并捕获相交角来提取描述符。接下来,为了后端姿态估计,我们使用Hough变换来进行姿态估计,估计多个姿态候选者。最后,通过墙像素相关性验证最终姿态。为了评估我们方法的有效性,我们在一个大型的大学楼中进行了多次现实世界的多会话实验,涉及两种不同类型的激光雷达传感器。我们还报告了我们的发现,并计划将我们所收集的数据公开开源。
https://arxiv.org/abs/2405.03969
Currently, portable electronic devices are becoming more and more popular. For lightweight considerations, their fingerprint recognition modules usually use limited-size sensors. However, partial fingerprints have few matchable features, especially when there are differences in finger pressing posture or image quality, which makes partial fingerprint verification challenging. Most existing methods regard fingerprint position rectification and identity verification as independent tasks, ignoring the coupling relationship between them -- relative pose estimation typically relies on paired features as anchors, and authentication accuracy tends to improve with more precise pose alignment. Consequently, in this paper we propose a method that jointly estimates identity verification and relative pose for partial fingerprints, aiming to leverage their inherent correlation to improve each other. To achieve this, we propose a multi-task CNN (Convolutional Neural Network)-Transformer hybrid network, and design a pre-training task to enhance the feature extraction capability. Experiments on multiple public datasets (NIST SD14, FVC2002 DB1A & DB3A, FVC2004 DB1A & DB2A, FVC2006 DB1A) and an in-house dataset show that our method achieves state-of-the-art performance in both partial fingerprint verification and relative pose estimation, while being more efficient than previous methods.
目前,便携式电子设备越来越受欢迎。考虑到轻量化的因素,它们的指纹识别模块通常使用有限尺寸的传感器。然而,部分指纹具有有限的可匹配特征,尤其是在手指按压姿势或图像质量存在差异时,这使得部分指纹验证具有挑战性。大多数现有方法将指纹位置校正和身份验证视为独立的任务,忽略了它们之间的耦合关系——相对姿态估计通常依赖于成对特征作为锚点,而身份验证准确性往往随着更精确的姿势对齐而提高。因此,在本文中,我们提出了一个方法,该方法共同估计部分指纹的身份验证和相对姿态,旨在利用它们固有的相关性提高彼此。为达到这一目标,我们提出了一个多任务 CNN-Transformer 混合网络,并设计了一个预训练任务来增强特征提取能力。在多个公开数据集(NIST SD14,FVC2002 DB1A & DB3A,FVC2004 DB1A & DB2A,FVC2006 DB1A & DB2A)和内部数据集的实验结果表明,我们的方法在部分指纹验证和相对姿态估计方面实现了最先进的性能,而效率比以前的方法更高。
https://arxiv.org/abs/2405.03959
We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation. We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.
我们提出了一种零 shot姿态优化方法,在估计人类3D姿态时强制准确的身体接触约束。我们核心的见解是,由于语言通常用来描述物理交互,因此大型预训练文本模型可以作为姿态估计的先验。因此,我们可以利用这个见解来通过将自然语言描述符转换为可求解的损失来约束3D姿态优化,从而提高姿态估计。尽管我们的方法很简单,但通过将大型多模态模型(LMM)生成的自然语言描述符转换为可求解的损失,我们成功地捕捉到了社会和物理交互的语义。我们证明了我们的方法与需要昂贵的人类标注接触点和训练专用模型的更复杂方法匹敌。此外,与以前的方法不同,我们的方法提供了一个统一的框架来解决自接触和人与人接触。
https://arxiv.org/abs/2405.03689
This paper addresses a critical flaw in MediaPipe Holistic's hand Region of Interest (ROI) prediction, which struggles with non-ideal hand orientations, affecting sign language recognition accuracy. We propose a data-driven approach to enhance ROI estimation, leveraging an enriched feature set including additional hand keypoints and the z-dimension. Our results demonstrate better estimates, with higher Intersection-over-Union compared to the current method. Our code and optimizations are available at this https URL.
这篇论文讨论了 MediaPipe Holistic 中手部领域兴趣(ROI)预测的一个关键缺陷,即对于非理想的手部姿势,会影响手语识别的准确性。我们提出了一个数据驱动的方法来增强 ROI 估计,包括额外的手关键点和 z 维度。我们的结果表明,与现有方法相比,具有更高的交集/并集比。我们的代码和优化版本可在此链接处获得。
https://arxiv.org/abs/2405.03545