Abstract
Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task representations. We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting these challenges through two key innovations: a novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability; and an agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object. Ag2Manip's empirical validation across simulated benchmarks like FrankaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieved without domain-specific demonstrations. Ablation studies underline the essential contributions of the visual and action representations to this success. Extending our evaluations to the real world, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across both simulated and physical environments.
Abstract (translated)
自主机器人系统能够学习新的操作任务,有望从制造业向服务业自动化转型。然而,现代方法(如VIP和R3M)仍然面临重大挑战,尤其是在机器人表述之间的领域差距和特定动作空间中成功任务执行的稀疏性,导致任务表示失调和模糊。我们引入了Ag2Manip(面向机器人的代理表示),一个旨在通过两个关键创新克服这些挑战的框架:基于人类操作视频的新颖代理-agnostic视觉表示,具体表述的细节被隐藏以增强泛化性;以及一个代理-agnostic动作表示,将机器的刚度抽象为通用代理,强调末端执行器和物体之间的关键相互作用。Ag2Manip在模拟基准测试如FrankaKitchen、ManiSkill和PartManip上的实证验证显示,性能提高了325%。消融研究强调了视觉和动作表示对这一成功的关键贡献。将我们的评估扩展到现实世界,Ag2Manip显著提高了从50%到77.5%的模仿学习成功率,证明了其在模拟和实物环境中的有效性和泛化性。
URL
https://arxiv.org/abs/2404.17521