3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
3D内容创作在各种应用中发挥着重要作用,如游戏、机器人模拟和虚拟现实。然而,该过程费力且耗时,需要熟练的设计师投入大量精力创作单个3D资产。为应对这一挑战,文本到3D生成技术作为一种有前途的自动化3D创作的解决方案应运而生。通过利用大型视觉语言模型的成功,这些技术旨在根据文本描述生成3D内容。尽管在最近一段时间内这一领域取得了进展,但现有的解决方案在生成质量和效率方面仍然存在显著的限制。在本次调查中,我们深入研究了最新的文本到3D创作方法。我们提供了关于文本到3D创作的全面背景,包括讨论训练和评估指标所使用的数据集以及用于评估生成3D模型的质量的评估指标。接着,我们深入探讨了作为3D生成过程基础的各种3D表示。此外,我们还对迅速发展的关于生成管道的研究进行了全面的比较,并将它们分为前馈生成、基于优化的生成和视图重构方法。通过分析这些方法的优缺点,我们希望揭示它们各自的潜能和局限。最后,我们指出了未来研究的几个有前景的方向。通过这次调查,我们希望激励研究人员进一步探索开放词汇文本条件下3D内容创作的潜力。
https://arxiv.org/abs/2405.09431
Recent advances in aerial robotics have enabled the use of multirotor vehicles for autonomous payload transportation. Resorting only to classical methods to reliably model a quadrotor carrying a cable-slung load poses significant challenges. On the other hand, purely data-driven learning methods do not comply by design with the problem's physical constraints, especially in states that are not densely represented in training data. In this work, we explore the use of physics informed neural networks to learn an end-to-end model of the multirotor-slung-load system and, at a given time, estimate a sequence of the future system states. An LSTM encoder decoder with an attention mechanism is used to capture the dynamics of the system. To guarantee the cohesiveness between the multiple predicted states of the system, we propose the use of a physics-based term in the loss function, which includes a discretized physical model derived from first principles together with slack variables that allow for a small mismatch between expected and predicted values. To train the model, a dataset using a real-world quadrotor carrying a slung load was curated and is made available. Prediction results are presented and corroborate the feasibility of the approach. The proposed method outperforms both the first principles physical model and a comparable neural network model trained without the physics regularization proposed.
近年来,随着无人机遥控技术的进步,多旋翼车辆已被用于自主承载电缆吊重物的运输。仅仅依靠经典方法来可靠地建模四旋翼携带电缆吊重物存在重大挑战。另一方面,完全基于数据驱动的学习方法在设计上并不符合问题固有约束,尤其是在训练数据中没有很好地表示的状态。在本文中,我们探讨了使用受物理学启发的神经网络来学习多旋翼吊重系统端到端模型的应用,并在给定时间估计未来系统状态。为了捕捉系统的动态,我们使用了LSTM编码器-解码器模型,并引入了注意机制来控制多个预测状态之间的连贯性。为了保证系统中多个预测状态的连贯性,我们在损失函数中引入了一个基于物理学的项,包括从基本原理导出的离散化物理模型和允许预期值和预测值之间的小误差的可缩放变量。为了训练模型,我们挑选了一个使用真实世界四旋翼运输电缆吊重物的数据集,并提供了用于训练的数据集。预测结果被呈现,并证实了该方法的有效性。与无物理学 regularization的第一原理物理模型和没有物理学的神经网络模型相比,所提出的方法优越。
https://arxiv.org/abs/2405.09428
This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel registration algorithms and guidewire segmentation methods as its perception modules. We additionally propose three modules: a topology-constrained 2D-3D instrument end-point lifting method, a tree-based fast path planning algorithm, and a prior-free endovascular navigation strategy. VascularPilot3D is compatible with most mainstream endovascular robots. Ex-vivo experiments validate that VascularPilot3D achieves 100% success rate among 25 trials. It reduces the human surgeon's overall control loops by 18.38%. VascularPilot3D is promising for general clinical autonomous endovascular navigations.
这项研究报道了VascularPilot3D,这是第一个3D完全自主式内窥镜导航系统。作为自主引导线导航探索,VascularPilot3D是基于内窥镜成像系统(本研究中的荧光X射线)和典型内窥镜机器人开发的完整导航系统。VascularPilot3D采用之前研究过的快速3D-2D血管配准算法和引导线分割方法作为其感知模块。此外,我们还提出了三个模块:基于树的高速路径规划算法、基于约束的2D-3D器械端点提升方法和无需先验的内窥镜导航策略。VascularPilot3D兼容大多数主流内窥镜机器人。实验验证表明,VascularPilot3D在25个试点研究中实现了100%的成功率。它减少了人类外科医生的总操作循环次数 by 18.38%。VascularPilot3D在一般临床自主内窥镜导航方面具有前景。
https://arxiv.org/abs/2405.09375
Current orthopedic robotic systems largely focus on navigation, aiding surgeons in positioning a guiding tube but still requiring manual drilling and screw placement. The automation of this task not only demands high precision and safety due to the intricate physical interactions between the surgical tool and bone but also poses significant risks when executed without adequate human oversight. As it involves continuous physical interaction, the robot should collaborate with the surgeon, understand the human intent, and always include the surgeon in the loop. To achieve this, this paper proposes a new cognitive human-robot collaboration framework, including the intuitive AR-haptic human-robot interface, the visual-attention-based surgeon model, and the shared interaction control scheme for the robot. User studies on a robotic platform for orthopedic surgery are presented to illustrate the performance of the proposed method. The results demonstrate that the proposed human-robot collaboration framework outperforms full robot and full human control in terms of safety and ergonomics.
目前,机器人骨科系统主要关注导航,帮助医生在定位引导管时进行操作,但仍需要手动进行钻孔和螺栓植入。自动化这一任务不仅要求高精度和安全性,是由于手术工具与骨头的复杂物理相互作用所带来的,而且在缺乏充分人类监督的情况下执行也存在重大风险。由于涉及持续的身体交互,机器人应与医生合作,理解人类的意图,并始终将医生纳入循环。为实现这一目标,本文提出了一种新的人机协作框架,包括直观的AR-人机界面、基于视觉注意的医生模型和机器人共享交互控制方案。用户研究在骨科手术机器人平台上展示了所提出方法的有效性。结果表明,与全机器人控制和全人类控制相比,人机协作框架在安全和人机工程方面具有优势。
https://arxiv.org/abs/2405.09359
Recent advances in computed tomography (CT) imaging, especially with dual-robot systems, have introduced new challenges for scan trajectory optimization. This paper presents a novel approach using Gated Recurrent Units (GRUs) to optimize CT scan trajectories. Our approach exploits the flexibility of robotic CT systems to select projections that enhance image quality by improving resolution and contrast while reducing scan time. We focus on cone-beam CT and employ several projection-based metrics, including absorption, pixel intensities, contrast-to-noise ratio, and data completeness. The GRU network aims to minimize data redundancy and maximize completeness with a limited number of projections. We validate our method using simulated data of a test specimen, focusing on a specific voxel of interest. The results show that the GRU-optimized scan trajectories can outperform traditional circular CT trajectories in terms of image quality metrics. For the used specimen, SSIM improves from 0.38 to 0.49 and CNR increases from 6.97 to 9.08. This finding suggests that the application of GRU in CT scan trajectory optimization can lead to more efficient, cost-effective, and high-quality imaging solutions.
近年来,计算机断层扫描(CT)成像技术的进步,特别是双机器人系统,为扫描轨迹优化带来了新的挑战。本文提出了一种使用门控循环单元(GRUs)优化CT扫描轨迹的新型方法。我们的方法利用了机器人CT系统的灵活性,通过提高分辨率、对比度并减少扫描时间来提高图像质量。我们专注于锥束CT,并采用几个基于投影的指标,包括吸收、像素强度、对比度-噪声比和数据完整性。GRU网络的目标是利用有限的投影量来最小化数据冗余并最大化完整性。我们使用模拟测试样本的数据来验证我们的方法,并重点关注感兴趣的某个体积。结果显示,与传统圆形CT轨迹相比,GRU优化扫描轨迹在图像质量指标上具有优势。对于使用的样品,SSIM从0.38提高至0.49,CNR从6.97提高至9.08。这一发现表明,在CT扫描轨迹优化中应用GRU可以实现更高效、经济且高质量的成像解决方案。
https://arxiv.org/abs/2405.09333
One goal of dexterous robotic grasping is to allow robots to handle objects with the same level of flexibility and adaptability as humans. However, it remains a challenging task to generate an optimal grasping strategy for dexterous hands, especially when it comes to delicate manipulation and accurate adjustment the desired grasping poses for objects of varying shapes and sizes. In this paper, we propose a novel dexterous grasp generation scheme called \textbf{\textit{GrainGrasp}} that provides fine-grained contact guidance for each fingertip. In particular, we employ a generative model to predict separate contact maps for each fingertip on the object point cloud, effectively capturing the specifics of finger-object interactions. In addition, we develop a new dexterous grasping optimization algorithm that solely relies on the point cloud as input, eliminating the necessity for complete mesh information of the object. By leveraging the contact maps of different fingertips, the proposed optimization algorithm can generate precise and determinable strategies for human-like object grasping. Experimental results confirm the efficiency of the proposed scheme. Our code is available at this https URL
灵巧机器人抓取的一个目标是使机器人能够像人类一样处理具有相同程度的灵活性和适应性的物体。然而,为灵巧的手生成最优抓取策略仍然是一个具有挑战性的任务,尤其是在处理形状和大小不等的物体时,更是如此。在本文中,我们提出了一个名为 \textbf{\textit{GrainGrasp}} 的新颖灵巧抓取生成方案,为每个手指提供细粒度的接触指导。 特别是,我们采用生成模型预测物体点云上每个手指的单独接触图,有效捕捉了手指与物体之间互动的特定细节。此外,我们还开发了一种仅依赖点云的灵巧抓取优化算法,消除了需要物体完整网格信息的必要性。通过利用不同手指的接触图,所提出的优化算法可以生成人类式物体抓取的精确和可确定策略。实验结果证实了所提出方案的有效性。我们的代码可以从该链接获取:
https://arxiv.org/abs/2405.09310
In this paper, we present an innovative technique for the path planning of flying robots in a 3D environment in Rough Mereology terms. The main goal was to construct the algorithm that would generate the mereological potential fields in 3-dimensional space. To avoid falling into the local minimum, we assist with a weighted Euclidean distance. Moreover, a searching path from the start point to the target, with respect to avoiding the obstacles was applied. The environment was created by connecting two cameras working in real-time. To determine the gate and elements of the world inside the map was responsible the Python Library OpenCV [1] which recognized shapes and colors. The main purpose of this paper is to apply the given results to drones.
在本文中,我们提出了一种创新的方法,用于在 rough melee 环境下对飞行机器人的路径进行规划。主要目标是为 3D 空间中的飞行机器人生成只论域 potential fields。为了避免陷入局部最小值,我们使用加权欧氏距离来协助算法。此外,我们还应用了从起点到目标点的搜索路径,以避免障碍物。环境是由实时连接的两个相机创建的。确定地图内世界的门和元素的是 Python 库 OpenCV [1],它识别形状和颜色。本文的主要目的是将所得到的结果应用于无人机。
https://arxiv.org/abs/2405.09282
Model Predictive Control (MPC)-based trajectory planning has been widely used in robotics, and incorporating Control Barrier Function (CBF) constraints into MPC can greatly improve its obstacle avoidance efficiency. Unfortunately, traditional optimizers are resource-consuming and slow to solve such non-convex constrained optimization problems (COPs) while learning-based methods struggle to satisfy the non-convex constraints. In this paper, we propose SOMTP algorithm, a self-supervised learning-based optimizer for CBF-MPC trajectory planning. Specifically, first, SOMTP employs problem transcription to satisfy most of the constraints. Then the differentiable SLPG correction is proposed to move the solution closer to the safe set and is then converted as the guide policy in the following training process. After that, inspired by the Augmented Lagrangian Method (ALM), our training algorithm integrated with guide policy constraints is proposed to enable the optimizer network to converge to a feasible solution. Finally, experiments show that the proposed algorithm has better feasibility than other learning-based methods and can provide solutions much faster than traditional optimizers with similar optimality.
基于模型的预测控制(MPC)路径规划在机器人领域得到了广泛应用,并将控制障碍功能(CBF)约束融入MPC可以大大提高其避障效率。然而,传统的优化器在处理非凸约束优化问题(COPs)时资源消耗大、解决速度慢。基于学习的方法也难以满足非凸约束。在本文中,我们提出了SOMTP算法,一种基于自我监督学习的自适应CBF-MPC路径规划优化器。具体来说,SOMTP首先采用问题变换来满足大多数约束。然后,提出了不同可导的SLPG校正来将解决方案更接近安全集,接着在训练过程中将其转换为引导策略。此外,受到增广拉格朗日方法(ALM)的启发,我们提出了一种与引导策略约束相结合的训练算法,使优化器网络能够收敛到可行解。最后,实验证明,与其它学习方法相比,该算法具有更好的可行性,并提供比传统具有相似最优性的优化器更快的解决方案。
https://arxiv.org/abs/2405.09212
-Recent strides in model predictive control (MPC)underscore a dependence on numerical advancements to efficientlyand accurately solve large-scale problems. Given the substantialnumber of variables characterizing typical whole-body optimalcontrol (OC) problems -often numbering in the thousands-exploiting the sparse structure of the numerical problem becomescrucial to meet computational demands, typically in the range ofa few milliseconds. A fundamental building block for computingNewton or Sequential Quadratic Programming (SQP) steps indirect optimal control methods involves addressing the linearquadratic regulator (LQR) problem. This paper concentrateson equality-constrained problems featuring implicit systemdynamics and dual regularization, a characteristic found inadvanced interior-point or augmented Lagrangian solvers. Here,we introduce a parallel algorithm designed for solving an LQRproblem with dual regularization. Leveraging a rewriting of theLQR recursion through block elimination, we first enhanced theefficiency of the serial algorithm, then subsequently generalized itto handle parametric problems. This extension enables us to splitdecision variables and solve multiple subproblems concurrently.Our algorithm is implemented in our nonlinear numerical optimalcontrol library ALIGATOR. It showcases improved performanceover previous serial formulations and we validate its efficacy bydeploying it in the model predictive control of a real quadrupedrobot. This paper follows up from our prior work on augmentedLagrangian methods for numerical optimal control with implicitdynamics and constraints.
近年来,模型预测控制(MPC)的进步表明,要高效准确地解决大规模问题,需要依赖数值改进。对于典型全身最优控制(OC)问题中大量存在的变量,通常有几千个,利用数值问题的稀疏结构变得至关重要,通常需要花费计算资源的毫秒级。计算新牛顿或序贯四元规划(SQP)步的直接最优控制方法的基本构建模块涉及解决线性二次调节器(LQR)问题。本文重点讨论具有隐含系统动力学和支持的等式约束问题,这是高级内部点或增强型拉格朗日求解器中发现的特征。在这里,我们介绍了一种用于求解具有双重 regularization 的 LQR 问题的并行算法。通过通过块消除重写LQR递归,我们首先增强了序列算法的效率,然后随后扩展到处理参数问题。这个扩展使得我们可以同时划分决策变量并解决多个子问题。 我们的算法实现在我们非线性数值最优控制库 ALIGATOR 中。它展示了前述序列形式的改进性能,并通过将该算法应用于实际四足机器人的模型预测控制来验证其有效性。本文接着我们在之前的关于具有隐含动力学和支持的增广拉格朗日方法的研究工作。
https://arxiv.org/abs/2405.09197
In this article, we focus on the critical tasks of plant protection in arable farms, addressing a modern challenge in agriculture: integrating ecological considerations into the operational strategy of precision weeding robots like \bbot. This article presents the recent advancements in weed management algorithms and the real-world performance of \bbot\ at the University of Bonn's Klein-Altendorf campus. We present a novel Rolling-view observation model for the BonnBot-Is weed monitoring section which leads to an average absolute weeding performance enhancement of $3.4\%$. Furthermore, for the first time, we show how precision weeding robots could consider bio-diversity-aware concerns in challenging weeding scenarios. We carried out comprehensive weeding experiments in sugar-beet fields, covering both weed-only and mixed crop-weed situations, and introduced a new dataset compatible with precision weeding. Our real-field experiments revealed that our weeding approach is capable of handling diverse weed distributions, with a minimal loss of only $11.66\%$ attributable to intervention planning and $14.7\%$ to vision system limitations highlighting required improvements of the vision system.
在本文中,我们重点讨论了农田保护任务中的关键任务,解决了农业领域的一个现代挑战:将生态考虑因素整合到像\bbot这样的精确喷雾机器人操作策略中。本文介绍了喷雾管理算法的最新进展以及\bbot\在 Bonn 大学 Klein-Altendorf 校园的实地表现。我们提出了 BonnBot-Is 杂草监测部分的滚动查看观察模型,使得平均绝对喷雾性能提高了 3.4%。此外,我们还展示了精确喷雾机器人如何考虑挑战性喷雾场景中的生物多样性关注。我们在糖菜田进行了全面的喷雾实验,涵盖了只有杂草和混合种植作物的情况,并引入了一个与精确喷雾兼容的新数据集。我们的实地实验表明,我们的喷雾方法能够处理不同的杂草分布,干预计划的损失只有 11.66%,而视觉系统限制引起的损失为 14.7%。
https://arxiv.org/abs/2405.09118
Humans use collaborative robots as tools for accomplishing various tasks. The interaction between humans and robots happens in tight shared workspaces. However, these machines must be safe to operate alongside humans to minimize the risk of accidental collisions. Ensuring safety imposes many constraints, such as reduced torque and velocity limits during operation, thus increasing the time to accomplish many tasks. However, for applications such as using collaborative robots as haptic interfaces with intermittent contacts for virtual reality applications, speed limitations result in poor user experiences. This research aims to improve the efficiency of a collaborative robot while improving the safety of the human user. We used Gaussian process models to predict human hand motion and developed strategies for human intention detection based on hand motion and gaze to improve the time for the robot and human security in a virtual environment. We then studied the effect of prediction. Results from comparisons show that the prediction models improved the robot time by 3\% and safety by 17\%. When used alongside gaze, prediction with Gaussian process models resulted in an improvement of the robot time by 2\% and the safety by 13\%.
人类使用协作机器人作为完成各种任务的工具。人类和机器人之间的互动发生在紧密共享的工作空间中。然而,为了最小化意外碰撞的风险,这些机器必须安全地与人类一起操作。确保安全性会带来许多限制,例如在操作期间减小扭矩和速度限制,从而增加完成许多任务的所需时间。然而,对于将协作机器人用作虚拟现实应用中的触觉接口的应用,速度限制会导致用户体验差。这项研究旨在提高协作机器人的效率,同时提高人类用户的可靠性。我们使用高斯过程模型预测人类手部运动,并基于手部动作和眼神来开发了人类意图检测策略,以提高机器人和人类在虚拟环境中的安全时间。然后我们研究了预测的影响。比较结果表明,预测模型提高了机器人的时间3%,安全性提高了17%。当与眼神结合使用时,使用高斯过程模型的预测提高了机器人的时间2%,安全性提高了13%。
https://arxiv.org/abs/2405.09109
We investigate the problem of pixelwise correspondence for deformable objects, namely cloth and rope, by comparing both classical and learning-based methods. We choose cloth and rope because they are traditionally some of the most difficult deformable objects to analytically model with their large configuration space, and they are meaningful in the context of robotic tasks like cloth folding, rope knot-tying, T-shirt folding, curtain closing, etc. The correspondence problem is heavily motivated in robotics, with wide-ranging applications including semantic grasping, object tracking, and manipulation policies built on top of correspondences. We present an exhaustive survey of existing classical methods for doing correspondence via feature-matching, including SIFT, SURF, and ORB, and two recently published learning-based methods including TimeCycle and Dense Object Nets. We make three main contributions: (1) a framework for simulating and rendering synthetic images of deformable objects, with qualitative results demonstrating transfer between our simulated and real domains (2) a new learning-based correspondence method extending Dense Object Nets, and (3) a standardized comparison across state-of-the-art correspondence methods. Our proposed method provides a flexible, general formulation for learning temporally and spatially continuous correspondences for nonrigid (and rigid) objects. We report root mean squared error statistics for all methods and find that Dense Object Nets outperforms baseline classical methods for correspondence, and our proposed extension of Dense Object Nets performs similarly.
我们通过比较基于经典方法和基于学习的方法来研究可变形物体的像素级对应问题,包括布料和绳索。我们选择布料和绳索是因为它们是传统上最难以用大配置空间进行分析建模的变形物体之一,而且在机器人任务如布料折叠、绳索结结、T恤折叠、窗帘关闭等背景下具有重要意义。对应问题在机器人领域受到广泛关注,包括通过特征匹配的语义抓取、物体跟踪和操作策略等。我们全面调查了通过特征匹配实现对应的传统经典方法,包括SIFT、SURF和ORB,以及两篇最近发表的学习方法TimeCycle和Dense Object Nets。我们做出了三个主要贡献:(1)通过模拟和渲染变形物体的合成图像,展示了模拟和真实领域之间的转移;(2)扩展了Dense Object Nets的新学习方法;(3)对最先进的对应方法进行了标准化比较。我们提出的方法为学习非刚性(和刚性)对象的时域和空间连续对应提供了一个灵活、通用的公式。我们报告了所有方法的所有者的根均方误差统计,并发现,Dense Object Nets基线经典方法在对应方面优越,而我们的对Dense Object Nets的扩展也具有相似的性能。
https://arxiv.org/abs/2405.08996
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Autonomous systems often encounter environments and scenarios beyond the scope of their training data, which underscores a critical challenge: the need to generalize and adapt to unseen scenarios in real time. This challenge necessitates new mathematical and algorithmic tools that enable adaptation and zero-shot transfer. To this end, we leverage the theory of function encoders, which enables zero-shot transfer by combining the flexibility of neural networks with the mathematical principles of Hilbert spaces. Using this theory, we first present a method for learning a space of dynamics spanned by a set of neural ODE basis functions. After training, the proposed approach can rapidly identify dynamics in the learned space using an efficient inner product calculation. Critically, this calculation requires no gradient calculations or retraining during the online phase. This method enables zero-shot transfer for autonomous systems at runtime and opens the door for a new class of adaptable control algorithms. We demonstrate state-of-the-art system modeling accuracy for two MuJoCo robot environments and show that the learned models can be used for more efficient MPC control of a quadrotor.
自主系统通常会面临其训练数据范围之外的环境和场景,这凸显了一个关键挑战:需要在实时情况下对未见过的场景进行泛化和适应。这个挑战需要新的数学和算法工具来实现适应和零样本转移。为此,我们利用函数编码器的理论,该理论通过结合神经网络的灵活性和Hilbert空间数学原理来实现零样本转移。使用这个理论,我们首先提出了一种学习由一组神经ODE基础函数组成的动态空间的方法。训练后,所提出的方法可以迅速地在学习到的空间中识别出动态。关键的是,这个计算在在线阶段不需要梯度计算或重新训练。这种方法使得自主系统在实时情况下实现零样本转移,并为新的自适应控制算法打开了大门。我们用两个MuJoCo机器人环境证明了最先进的系统建模精度,并表明所学习到的模型可以用于更有效的MPC控制四旋翼。
https://arxiv.org/abs/2405.08954
This paper addresses the critical need for refining robot motions that, despite achieving a high visual similarity through human-to-humanoid retargeting methods, fall short of practical execution in the physical realm. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. Our research introduces a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. We name our framework: I-CTRL. By reformulating the motion imitation problem as a constrained refinement over non-physics-based retargeted motions, our framework excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and physical realism for successful motion imitation.
本文解决了在机器人运动中需要精炼的问题,尽管通过人类-机器人对齐方法实现了高视觉相似性,但在物理世界中却缺乏实际执行。图形社区中现有的技术通常优先考虑视觉一致性而非基于物理的可行性,这给在实际应用中部署双足机器人带来了巨大的挑战。我们的研究引入了一个约束的强化学习算法,用于在下肢式机器人上产生基于物理的高质量运动模仿,同时成功跟踪参考人类轨迹。我们将框架命名为I-CTRL。通过将运动复制问题重新表述为基于非物理对齐运动的约束优化,我们的框架在具有简单和独特奖励的简单和独特的基础上表现出色,并且可以适用于四台机器人。此外,我们的框架可以跟随大规模运动数据集,并使用独特的RL代理。所提出的方法标志着在进步控制双足机器人方面迈出了关键的一步,强调了在成功运动复制中实现视觉和物理现实之间的一致性至关重要。
https://arxiv.org/abs/2405.08726
Ensuring safety and adapting to the user's behavior are of paramount importance in physical human-robot interaction. Thus, incorporating elastic actuators in the robot's mechanical design has become popular, since it offers intrinsic compliance and additionally provide a coarse estimate for the interaction force by measuring the deformation of the elastic components. While observer-based methods have been shown to improve these estimates, they rely on accurate models of the system, which are challenging to obtain in complex operating environments. In this work, we overcome this issue by learning the unknown dynamics components using Gaussian process (GP) regression. By employing the learned model in a Bayesian filtering framework, we improve the estimation accuracy and additionally obtain an observer that explicitly considers local model uncertainty in the confidence measure of the state estimate. Furthermore, we derive guaranteed estimation error bounds, thus, facilitating the use in safety-critical applications. We demonstrate the effectiveness of the proposed approach experimentally in a human-exoskeleton interaction scenario.
在物理人机交互中确保安全和适应用户行为至关重要。因此,将弹性执行器纳入机器设计中已成为一种流行的方法,因为它提供了固有的顺应性,并且通过测量弹性部件的变形程度,还提供了一个粗略的估计值来计算交互力。虽然基于观察者的方法已经证明了这些估计的改善,但是它们依赖于准确系统模型的准确性,而在复杂操作环境中获得这种准确性是非常困难的。在这项工作中,我们通过使用高斯过程(GP)回归学习未知动态组件。通过将学习到的模型应用于贝叶斯滤波框架,我们提高了估计精度和附加观测器,它明确考虑了状态估计中局部模型不确定性。此外,我们导出了保证估计误差上限,从而促进在关键应用中使用。我们在人机协同操作场景中验证了所提出的方法的实效性。
https://arxiv.org/abs/2405.08711
Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.
在视频中的多标签动作识别是一个对机器人动态环境应用的显著挑战,尤其是在机器人需要与人类在涉及对象的任務中進行合作时。现有的方法仍然很难识别未见到的动作,或者需要大量的训练数据。为了克服这些问题,我们提出了Dual-VCLIP,一种用于零散标签多标签动作识别的统一方法。Dual-VCLIP通过DualCoOp方法增强了VCLIP,一种用于零散标签图像分类的零散动作识别方法。我们方法的优势在于,在训练时它只学习两个提示,因此它比其他方法要简单得多。我们在包含大量物体为基础的动作的Charades数据集上验证我们的方法,证明了--尽管其简单性--我们的方法在完整数据集上与现有方法的表现相当,而在测试未见到的动作时具有 promising 的表现。我们的贡献强调了在机器人训练过程中动词-物体类别的分割对新型合作任务的影响,突出了对表现和减轻偏见的影响。
https://arxiv.org/abs/2405.08695
Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach leveraging large-scale multisensory pre-training for robotic manipulation. For supplementary information including videos of real robot experiments, please see this https URL.
尽管在机器人学习中预先在大规模数据集上训练是有益的,但目前的范式仅在对视觉表示进行大规模预训练,而其他模态的表示是从零开始训练的。与丰富的视觉数据相比,不清楚可能用于预训练其他模态(如触觉感知)的相关互联网规模数据。在机器人应用中,低数据量的情况很常见。为了填补这个空白,本文我们通过使用接触式麦克风作为另一种触觉传感器来解决这个问题。我们关键的见解是,接触式麦克风捕获固有音频信息,使我们能够利用大规模音频-视觉预训练来获得提高机器人操作绩效的代表。据我们所知,我们的方法是第一个利用大规模多感官预训练来提高机器人操作绩效的方法。如果您有兴趣了解包括机器人实验视频的更多信息,请查看此链接。
https://arxiv.org/abs/2405.08576
Task and Motion Planning (TAMP) algorithms solve long-horizon robotics tasks by integrating task planning with motion planning; the task planner proposes a sequence of actions towards a goal state and the motion planner verifies whether this action sequence is geometrically feasible for the robot. However, state-of-the-art TAMP algorithms do not scale well with the difficulty of the task and require an impractical amount of time to solve relatively small problems. We propose Constraints and Streams for Task and Motion Planning (COAST), a probabilistically-complete, sampling-based TAMP algorithm that combines stream-based motion planning with an efficient, constrained task planning strategy. We validate COAST on three challenging TAMP domains and demonstrate that our method outperforms baselines in terms of cumulative task planning time by an order of magnitude. You can find more supplementary materials on our project \href{this https URL}{website}.
任务和动作规划(TAMP)算法通过将任务规划与运动规划相结合来解决长视野机器人任务;任务规划器提出了一系列动作序列以达到目标状态,而运动规划器验证这些动作序列是否对机器人几何可行。然而,最先进的TAMP算法在任务难度上表现不佳,需要解决相对较小的問題的时间相当长。我们提出了一种约束和流式任务和动作规划(COAST)算法,这是一种概率完整性、基于采样的TAMP算法,将流式运动规划与高效、约束的任务规划策略相结合。我们在三个具有挑战性的TAMP领域上验证了COAST,并证明了我们的方法在累积任务规划时间方面优于基线。您可以在我们的项目页面上找到更多信息 \href{this <https://website.com>}.
https://arxiv.org/abs/2405.08572
This study investigates the stiffness characteristics of the Sprint Z3 head, also known as 3-PRS Parallel Kinematics Machines, which are among the most extensively researched and viably successful manipulators for precision machining applications. Despite the wealth of research on these robotic manipulators, no previous work has demonstrated their stiffness performance within the parasitic motion space. Such an undesired motion influences their stiffness properties, as stiffness is configuration-dependent. Addressing this gap, this paper develops a stiffness model that accounts for both the velocity-level parasitic motion space and the regular workspace. Numerical simulations are provided to illustrate the stiffness characteristics of the manipulator across all considered spaces. The results indicate that the stiffness profile within the parasitic motion space is both shallower and the values are smaller when compared to the stiffness distribution across the orientation workspace. This implies that evaluating a manipulator's performance adequately requires assessing its ability to resist external loads during parasitic motion. Therefore, comprehending this aspect is crucial for redesigning components to enhance overall stiffness.
本研究调查了Sprint Z3头,也称为3-PRS并行运动学机器人,在精密加工应用中是最广泛研究和成功的操作器之一。尽管在这些机器人操作器的研究中财富丰富,但之前的工作都没有在寄生运动空间中展示它们的刚度性能。这种不希望的运动会影响它们的刚度特性,因为刚度是配置相关的。为了填补这个空白,本文开发了一个刚度模型,考虑了寄生运动空间和常规工作空间。数值仿真提供了在考虑的所有空间中操作器刚度特性的变化。结果表明,在寄生运动空间内,刚度轮廓既更浅,数值也较小。这表明,在评估操作器的性能时,需要评估其在寄生运动期间对外部负载的抵抗能力。因此,理解这一点对于重新设计部件以提高整体刚度至关重要。
https://arxiv.org/abs/2405.08418