The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: this https URL
计算机视觉模型的系统评估和理解需要大量具有全面和定制标签的数据,而现实世界的视觉数据集很少满足这一要求。尽管当前的合成数据生成器提供了一个有前途的替代方案,特别是对于具身人工智能任务,但它们在计算机视觉任务上往往表现不足,原因包括资产和渲染质量低,缺乏多样性以及不现实的物理属性。我们引入了BEHAVIOR Vision Suite(BVS),一套用于生成针对系统评估计算机视觉模型的完全定制合成数据的工具和资产,基于新开发的具身人工智能基准BEHAVIOR-1K。BVS在场景级别支持大量可调整的参数(例如照明、物体放置),物体级别支持(例如关节配置,具有"填充"和"折叠"等属性的属性),以及相机级别支持(例如视野范围,焦距)。研究人员可以在数据生成过程中任意调整这些参数进行控制实验。我们展示了三个应用场景:在不同的连续轴域上评估模型的鲁棒性,对同一组图像评估场景理解模型,以及为新颖视觉任务进行建模和评估:原初和二进制状态预测。项目网站:https:// this URL
https://arxiv.org/abs/2405.09546
We offer philosophical motivations for a method we call Virtual World Cognitive Science (VW CogSci), in which researchers use virtual embodied agents that are embedded in virtual worlds to explore questions in the field of Cognitive Science. We focus on questions about mental and linguistic representation and the ways that such computational modeling can add rigor to philosophical thought experiments, as well as the terminology used in the scientific study of such representations. We find that this method forces researchers to take a god's-eye view when describing dynamical relationships between entities in minds and entities in an environment in a way that eliminates the need for problematic talk of belief and concept types, such as the belief that cats are silly, and the concept CAT, while preserving belief and concept tokens in individual cognizers' minds. We conclude with some further key advantages of VW CogSci for the scientific study of mental and linguistic representation and for Cognitive Science more broadly.
我们为一种名为Virtual World Cognitive Science(VW CogSci)的方法提供哲学动机,在这种方法中,研究者使用嵌入在虚拟世界中的虚拟实体来探索认知科学领域中的问题。我们关注的是关于心灵和语言表示以及这种计算建模如何为哲学思维实验增加严谨性,以及这种表示在科学研究中使用的术语。我们发现,这种方法要求研究者以神视角描述心灵中实体与环境中实体之间的动态关系,从而消除关于信念和概念类型的可疑谈论,例如相信猫很傻的信念和概念CAT,同时保留个体认知者头脑中的信念和概念标记。我们得出结论,VW CogSci对于精神学和语言表示的科学研究和认知科学具有更广泛的意义。
https://arxiv.org/abs/2405.08304
This paper critically analyses the "attention economy" within the framework of cognitive science and techno-political economics, as applied to both human and machine interactions. We explore how current business models, particularly in digital platform capitalism, harness user engagement by strategically shaping attentional patterns. These platforms utilize advanced AI and massive data analytics to enhance user engagement, creating a cycle of attention capture and data extraction. We review contemporary (neuro)cognitive theories of attention and platform engagement design techniques and criticize classical cognitivist and behaviourist theories for their inadequacies in addressing the potential harms of such engagement on user autonomy and wellbeing. 4E approaches to cognitive science, instead, emphasizing the embodied, extended, enactive, and ecological aspects of cognition, offer us an intrinsic normative standpoint and a more integrated understanding of how attentional patterns are actively constituted by adaptive digital environments. By examining the precarious nature of habit formation in digital contexts, we reveal the techno-economic underpinnings that threaten personal autonomy by disaggregating habits away from the individual, into an AI managed collection of behavioural patterns. Our current predicament suggests the necessity of a paradigm shift towards an ecology of attention. This shift aims to foster environments that respect and preserve human cognitive and social capacities, countering the exploitative tendencies of cognitive capitalism.
本文批判性地分析了“注意力经济”在认知科学和技术政治经济框架下的应用,应用于人类和机器互动。我们探讨了当前商业模式(特别是数字平台资本主义)如何通过战略塑造注意力和互动模式来利用用户参与度。这些平台利用先进的人工智能和大规模数据分析来增强用户参与度,形成关注捕捉和数据提取的循环。我们回顾了当代注意力和平台参与设计的神经认知理论和实践,并批评了经典认知主义和行为主义理论在解决这种参与对用户自主性和幸福感潜在危害方面的不足。 4E 方法论强调认知科学的 embodied、extended、enactive 和 ecological 方面,为我们提供了一个固有的规范立场,以及更全面地理解注意力模式如何积极地由适应性数字环境构成。通过研究数字环境中习惯形成的不确定性,我们揭示了技术经济基础,它们威胁着个人自主性,将习惯从个人转移到人工智能管理的行为模式中。我们目前的困境表明,需要进行一场范式转变,转向关注生态。这一转变旨在促进尊重和保留人类认知和社会能力的环境,对抗认知资本主义的剥削倾向。
https://arxiv.org/abs/2405.06478
Investigating children's embodied learning in mixed-reality environments, where they collaboratively simulate scientific processes, requires analyzing complex multimodal data to interpret their learning and coordination behaviors. Learning scientists have developed Interaction Analysis (IA) methodologies for analyzing such data, but this requires researchers to watch hours of videos to extract and interpret students' learning patterns. Our study aims to simplify researchers' tasks, using Machine Learning and Multimodal Learning Analytics to support the IA processes. Our study combines machine learning algorithms and multimodal analyses to support and streamline researcher efforts in developing a comprehensive understanding of students' scientific engagement through their movements, gaze, and affective responses in a simulated scenario. To facilitate an effective researcher-AI partnership, we present an initial case study to determine the feasibility of visually representing students' states, actions, gaze, affect, and movement on a timeline. Our case study focuses on a specific science scenario where students learn about photosynthesis. The timeline allows us to investigate the alignment of critical learning moments identified by multimodal and interaction analysis, and uncover insights into students' temporal learning progressions.
研究混合现实环境中儿童的身体化学习需要分析复杂的多模态数据以解释他们的学习与合作行为。学习科学家已经开发了交互分析(IA)方法来分析这种数据,但这需要研究人员花费数小时观看视频以提取和解释学生的学习模式。我们的研究旨在简化研究人员的任务,使用机器学习和多模态学习分析来支持IA过程。我们的研究结合了机器学习算法和多模态分析来支持和加速研究人员在模拟场景中全面了解学生科学兴趣的努力。为了促进有效的 researcher-AI 合作,我们提出了一个初步案例研究来确定在时间轴上可视化学生状态、行动、 gaze、情感和动作的可行性。我们的案例研究关注一个特定的科学情景,即学生学习光合作用。时间轴使我们能够调查多模态和交互分析确定的关键学习时刻,并揭示学生的时间学习进展。
https://arxiv.org/abs/2405.06203
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
嵌入式AI代理需要对物理世界进行细粒度的理解,通过视觉和语言输入与感知世界。仅从任务特定数据中学习这种能力是困难的。这导致预训练视觉语言模型成为将从互联网规模数据中获得的表示传递到下游任务和新领域的工具。然而,常见的通过对比训练的表示,如CLIP,已经被证明无法使嵌入式代理获得足够细粒度的场景理解——这对于控制是至关重要的。为了解决这个问题,我们考虑从预训练的文本到图像扩散模型的表示,这些模型专门优化从文本提示中生成图像,因此包含文本相关表示,这些表示具有高度细粒度的视觉和空间信息。使用预训练的文本到图像扩散模型,我们构建了稳定控制表示,允许学习下游控制策略,这些策略可以泛化到复杂、开放性环境。我们证明了使用稳定控制表示学习到的策略在广泛的模拟控制设置中与最先进的表示学习方法具有竞争力,包括具有挑战性的操作和导航任务。最值得注意的是,我们证明了稳定控制表示能够学习具有OVMM(困难开放式词汇表导航) benchmark中最佳性能的策略。
https://arxiv.org/abs/2405.05852
Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome these limitations. \method~uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that \method~outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.
分层控制是机器人学中一个长期以来一直困扰的问题,即需要一个明确的接口层来在高级任务计划器和低级策略之间进行通信。随着LLM的出现,语言作为一种潜在的接口层正在逐渐浮现。然而,这存在一些限制。并不是所有任务都可以分解成易于自然语言表达的步骤(例如执行舞蹈)。此外,由于领域漂移和灾难性遗忘,在 embodied 数据上进行端到端的微调变得具有挑战性。我们提出了一种方法——可学习潜在代码作为桥梁(LCB)来克服这些限制。\方法~使用可学习潜在代码作为桥梁,在LLM和低级策略之间进行通信。这使得LLM可以在任务计划中灵活地传达目标,而不会完全受到语言限制的限制。此外,它还使得在预训练过程中学习到的单词词向量的嵌入空间不被破坏。通过在Language Table和Calvin这两个常用的基于语言的基准上进行的实验,我们发现\方法~在需要推理和多步行为的任务上优于基线(包括使用GPT-4V的基线)。
https://arxiv.org/abs/2405.04798
We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.
我们在家庭环境中针对情境查询(S-EQA)解决了 embodied 问题回答(EQA)问题。与之前的工作不同,这些工作主要解决与目标对象直接引用并可量化的属性相关的简单查询,而 EQA with situational queries(例如“卫生间干净干燥吗?”)更具挑战性,因为代理需要确定不仅目标对象的答案,而且还需要就它们的状态达成一致。为了实现这个目标,我们首先介绍了一种新颖的提示生成-评估(PGE)方案,该方案围绕 LLM 的输出创建了一个独特的数据集,包括独特的情境查询、相应的共识对象信息和预测的答案。PGE 在生成的查询中保持独特性,利用多种语义相似性。我们通过在 M-Turk 上进行大规模用户研究来验证生成的数据集,并将其作为 S-EQA,第一个处理情境查询的 dataset。我们的用户研究证实 S-EQA 的真实性,其中有 97.26% 的生成查询被认为具有答案,基于共识对象数据。相反,我们在 LLM 预测的答案和人类评估的答案之间观察到较低的相关性,表明 LLM 在直接回答情境查询方面能力较差,但 S-EQA 在提供人类验证的共识方面具有可用性。我们通过在 VirtualHome 上使用视觉问答(VQA)来评估 S-EQA,这个模拟器与其他模拟器不同,包含多个可修改的状态的对象,在修改后也具有不同的视觉表现,使我们能够为 S-EQA 设定一个量化基准。据我们所知,这是第一个介绍 EQA with situational queries 的作品,也是第一个使用生成方法创建查询的。
https://arxiv.org/abs/2405.04732
Data augmentation is a key technique for addressing the challenge of limited datasets, which have become a major component in the training procedures of image processing. Techniques such as geometric transformations and color space adjustments have been thoroughly tested for their ability to artificially expand training datasets and generate semi-realistic data for training purposes. Data augmentation is the most important key to addressing the challenge of limited datasets, which have become a major component of image processing training procedures. Data augmentation techniques, such as geometric transformations and color space adjustments, are thoroughly tested for their ability to artificially expand training datasets and generate semi-realistic data for training purposes. Polygons play a crucial role in instance segmentation and have seen a surge in use across advanced models, such as YOLOv8. Despite their growing popularity, the lack of specialized libraries hampers the polygon-augmentation process. This paper introduces a novel solution to this challenge, embodied in the newly developed AugmenTory library. Notably, AugmenTory offers reduced computational demands in both time and space compared to existing methods. Additionally, the library includes a postprocessing thresholding feature. The AugmenTory package is publicly available on GitHub, where interested users can access the source code: this https URL
数据增强是解决数据有限性的关键技术,已成为图像处理培训过程的重要组成部分。诸如几何变换和色彩空间调整等技术已被充分测试其扩展训练数据和生成用于培训目的的半真实数据的能力。数据增强是解决数据有限性问题的最重要钥匙,已成为图像处理培训过程的重要组成部分。数据增强技术(如几何变换和色彩空间调整)已被充分测试其扩展训练数据并生成用于培训目的的半真实数据的能力。多边形在实例分割中扮演着关键角色,并在先进的模型(如YOLOv8)中看到了增长。尽管它们日益流行,但缺乏专用库仍然阻碍了多边形增强过程。本文提出了一种新解决方案,体现在新开发的AugmenTory库中。值得注意的是,与现有方法相比,AugmenTory在时间和空间上都有较低的计算需求。此外,库中还包括后处理阈值功能。AugmenTory包已公开在GitHub上提供,有兴趣的用户可以访问源代码:https://github.com/user/AugmenTory
https://arxiv.org/abs/2405.04442
Uncertainty has long been a critical area of study in robotics, particularly when robots are equipped with analytical models. As we move towards the widespread use of deep neural networks in robots, which have demonstrated remarkable performance in research settings, understanding the nuances of uncertainty becomes crucial for their real-world deployment. This guide offers an overview of the importance of uncertainty and provides methods to quantify and evaluate it from an applications perspective.
不确定性一直是机器人学研究的一个重要领域,尤其是在机器人配备了分析模型时。随着我们朝着在机器人中广泛使用深度神经网络的方向发展,这些网络在研究环境中表现出非凡的性能,理解不确定性的细微差别对它们在现实生活中的部署至关重要。本指南概述了不确定性的重要性,并提供从应用角度量化和评估不确定性的方法。
https://arxiv.org/abs/2405.03164
Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply common-sense reasoning. We contribute a new dataset PhysiCleAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-language models to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCleAR show that Octopi is able to effectively use intermediate physical property predictions to improve physical reasoning in both trained tasks and for zero-shot reasoning. PhysiCleAR and Octopi are available on this https URL.
物理推理对于有效机器人操作非常重要。最近的工作调查了物理推理中的视觉和语言模态;视觉可以揭示环境中物体的信息,而语言则作为附加上下文和通信媒介。尽管这些工作在各种物理推理任务上取得了成功,但它们仅限于可以从视觉或语言输入中推断出的物理属性。在这项工作中,我们研究了通过结合触觉感知和语言,使实体系统通过交互获得物理属性并应用常识推理。我们贡献了一个新的数据集 PhysiCleAR,该数据集包括物理/属性推理任务和通过GelSight 触觉传感器获得的带注释的触觉视频。然后我们引入了 Octopi 系统,该系统利用触觉表示学习和大型视觉-语言模型来预测和推理关于触觉输入的物理属性,最小化语言微调。我们对 PhysiCleAR 的评估显示,Octopi 能够有效利用中间物理属性预测来改进训练任务和零散推理。 PhysiCleAR 和 Octopi 可以在该 https URL 上找到。
https://arxiv.org/abs/2405.02794
Ultrasound robots are increasingly used in medical diagnostics and early disease screening. However, current ultrasound robots lack the intelligence to understand human intentions and instructions, hindering autonomous ultrasound scanning. To solve this problem, we propose a novel Ultrasound Embodied Intelligence system that equips ultrasound robots with the large language model (LLM) and domain knowledge, thereby improving the efficiency of ultrasound robots. Specifically, we first design an ultrasound operation knowledge database to add expertise in ultrasound scanning to the LLM, enabling the LLM to perform precise motion planning. Furthermore, we devise a dynamic ultrasound scanning strategy based on a \textit{think-observe-execute} prompt engineering, allowing LLMs to dynamically adjust motion planning strategies during the scanning procedures. Extensive experiments demonstrate that our system significantly improves ultrasound scan efficiency and quality from verbal commands. This advancement in autonomous medical scanning technology contributes to non-invasive diagnostics and streamlined medical workflows.
超声机器人被越来越多地应用于医疗诊断和早期疾病筛查。然而,目前的超声机器人缺乏理解人类意图和指令的智能,阻碍了自主超声扫描的发展。为解决这个问题,我们提出了一个新颖的超声嵌入式智能系统,该系统为超声机器人配备了大型语言模型(LLM)和专业知识,从而提高了超声机器人的效率。具体来说,我们首先设计了一个超声操作知识数据库,将超声扫描的专业知识添加到LLM中,使LLM能够进行精确的运动规划。此外,我们还基于“思考-观察-执行”提示工程,设计了一种动态超声扫描策略,使得LLM可以在扫描过程中动态调整运动规划策略。 大量的实验证明,我们的系统显著提高了超声扫描的效率和质量,从口头指令中。这一自动医疗扫描技术的发展有助于实现非侵入性诊断和优化医疗工作流程。
https://arxiv.org/abs/2405.00461
Neural language models, particularly large-scale ones, have been consistently proven to be most effective in predicting brain neural activity across a range of studies. However, previous research overlooked the comparison of these models with psychologically plausible ones. Moreover, evaluations were reliant on limited, single-modality, and English cognitive datasets. To address these questions, we conducted an analysis comparing encoding performance of various neural language models and psychologically plausible models. Our study utilized extensive multi-modal cognitive datasets, examining bilingual word and discourse levels. Surprisingly, our findings revealed that psychologically plausible models outperformed neural language models across diverse contexts, encompassing different modalities such as fMRI and eye-tracking, and spanning languages from English to Chinese. Among psychologically plausible models, the one incorporating embodied information emerged as particularly exceptional. This model demonstrated superior performance at both word and discourse levels, exhibiting robust prediction of brain activation across numerous regions in both English and Chinese.
神经语言模型,特别是大规模的 ones,在预测跨多个研究的脑神经活动方面一直被证明是最有效的。然而,之前的研究忽略了这些模型与心理上可解释的模型的比较。此外,评估是基于有限的、单模态的英语认知数据集进行的。为了回答这些问题,我们进行了一个比较各种神经语言模型和心理上可解释模型的编码性能的分析。我们的研究利用了广泛的跨模态认知数据集,研究了双语单词和话语水平。令人惊讶的是,我们的研究结果表明,心理上可解释的模型在不同的上下文中均优于神经语言模型,包括不同的模式,如 fMRI 和眼动,跨越了英语到汉语的语言。在心理上可解释的模型中,采用身体信息的模型表现得尤为出色。这个模型在词和 discourse 水平上表现出卓越的性能,展示了 robust 的预测 of brain activation across numerous regions in both English and Chinese.
https://arxiv.org/abs/2404.19364
Recent research on instructable agents has used memory-augmented Large Language Models (LLMs) as task planners, a technique that retrieves language-program examples relevant to the input instruction and uses them as in-context examples in the LLM prompt to improve the performance of the LLM in inferring the correct action and task plans. In this technical report, we extend the capabilities of HELPER, by expanding its memory with a wider array of examples and prompts, and by integrating additional APIs for asking questions. This simple expansion of HELPER into a shared memory enables the agent to work across the domains of executing plans from dialogue, natural language instruction following, active question asking, and commonsense room reorganization. We evaluate the agent on four diverse interactive visual-language embodied agent benchmarks: ALFRED, TEACh, DialFRED, and the Tidy Task. HELPER-X achieves few-shot, state-of-the-art performance across these benchmarks using a single agent, without requiring in-domain training, and remains competitive with agents that have undergone in-domain training.
最近,关于指令式代理的研究使用了记忆增强的大型语言模型(LLMs)作为任务规划者,这是一种通过检索与输入指令相关的语言程序实例并使用它们作为LLM提示中的上下文示例来提高LLM在推断正确动作和任务计划中的性能的技术。在本文的技术报告中,我们通过扩展HELPER的功能,通过增加更广泛的示例和提示,以及添加询问API,来扩展其能力。这种简单的扩展使得HELPER可以应用于执行计划领域,包括对话、自然语言指令跟随、主动问题询问和共同空间组织。我们在四个具有多样性的交互式视觉语言 embodied 代理基准上评估了代理:ALFRED、TEACh、DialFRED 和 Tidy Task。HELPER-X在这些基准上使用单个代理实现了卓越的少样本、状态最先进的性能,而无需进行领域内训练,且与经过领域内训练的代理竞争。
https://arxiv.org/abs/2404.19065
Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities.
尽管在大型语言模型(LLMs)和大型多模态模型(LMMs)方面取得了进步,但将它们集成到基于语言的、类人化的智能体中仍然不完整,阻碍了在物理环境中实现复杂任务的高效表现。现有的集成通常具有有限的开源性,挑战了该领域集体进步。我们介绍了一个用于使用LLMs和LMMs开发嵌入式智能体的开源、可扩展平台——LEGENT。LEGENT提供了一种双重方法:一个丰富、交互式的3D环境与可沟通和可操作的智能体相结合,用户友好地搭配了一个智能界面,和一个利用先进算法利用模拟世界监督的复杂数据生成管道。在我们进行的实验中,使用LEGENT生成的数据训练的胚胎式视觉-语言-动作模型在嵌入式任务中超过了GPT-4V,展示了具有 promising generalization capabilities 的良好通用能力。
https://arxiv.org/abs/2404.18243
LiDAR-camera extrinsic calibration (LCEC) is crucial for data fusion in intelligent vehicles. Offline, target-based approaches have long been the preferred choice in this field. However, they often demonstrate poor adaptability to real-world environments. This is largely because extrinsic parameters may change significantly due to moderate shocks or during extended operations in environments with vibrations. In contrast, online, target-free approaches provide greater adaptability yet typically lack robustness, primarily due to the challenges in cross-modal feature matching. Therefore, in this article, we unleash the full potential of large vision models (LVMs), which are emerging as a significant trend in the fields of computer vision and robotics, especially for embodied artificial intelligence, to achieve robust and accurate online, target-free LCEC across a variety of challenging scenarios. Our main contributions are threefold: we introduce a novel framework known as MIAS-LCEC, provide an open-source versatile calibration toolbox with an interactive visualization interface, and publish three real-world datasets captured from various indoor and outdoor environments. The cornerstone of our framework and toolbox is the cross-modal mask matching (C3M) algorithm, developed based on a state-of-the-art (SoTA) LVM and capable of generating sufficient and reliable matches. Extensive experiments conducted on these real-world datasets demonstrate the robustness of our approach and its superior performance compared to SoTA methods, particularly for the solid-state LiDARs with super-wide fields of view.
LiDAR相机外差校准(LCEC)在智能汽车数据融合中至关重要。离线,目标导向的方法在领域中一直是首选。然而,它们通常在现实环境中表现不佳。这主要是因为外差参数可能会因中度冲击或环境振动而显著变化。相比之下,在线,目标无的方法提供更大的适应性,但通常缺乏鲁棒性,主要原因是跨模态特征匹配的挑战。因此,在本文中,我们发挥了大型视觉模型(LVMs)的全部潜力,这些模型在计算机视觉和机器人领域正成为一种趋势,特别是对于嵌入式人工智能,实现跨各种具有挑战性的场景的稳健且准确的在线,目标无LCEC。我们的主要贡献是三方面的:我们引入了一个名为MIAS-LCEC的新框架,提供了一个具有交互式可视化界面的开源多功能校准工具箱,并公开了从各种室内和室外环境捕获的三个真实世界数据集。我们框架和工具箱的基础是先进的基于SoTA LVM的跨模态掩码匹配(C3M)算法,能够生成充分且可靠的匹配。在这些真实世界数据集上进行的大量实验证明了我们方法的可行性和与SoTA方法的优越性能,特别是对于具有超宽视野的固体LiDAR。
https://arxiv.org/abs/2404.18083
While neural implicit representations have gained popularity in multi-view 3D reconstruction, previous work struggles to yield physically plausible results, thereby limiting their applications in physics-demanding domains like embodied AI and robotics. The lack of plausibility originates from both the absence of physics modeling in the existing pipeline and their inability to recover intricate geometrical structures. In this paper, we introduce PhyRecon, which stands as the first approach to harness both differentiable rendering and differentiable physics simulation to learn implicit surface representations. Our framework proposes a novel differentiable particle-based physical simulator seamlessly integrated with the neural implicit representation. At its core is an efficient transformation between SDF-based implicit representation and explicit surface points by our proposed algorithm, Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Moreover, we model both rendering and physical uncertainty to identify and compensate for the inconsistent and inaccurate monocular geometric priors. The physical uncertainty additionally enables a physics-guided pixel sampling to enhance the learning of slender structures. By amalgamating these techniques, our model facilitates efficient joint modeling with appearance, geometry, and physics. Extensive experiments demonstrate that PhyRecon significantly outperforms all state-of-the-art methods in terms of reconstruction quality. Our reconstruction results also yield superior physical stability, verified by Isaac Gym, with at least a 40% improvement across all datasets, opening broader avenues for future physics-based applications.
虽然多视角3D重建中神经隐式表示已经获得了越来越多的关注,但之前的 work 很难产生物理上合理的成果,从而限制了它们在需要物理要求的领域(如 embodied AI 和机器人学)的应用。缺乏可信度源于现有流程中缺少物理建模以及它们无法恢复复杂的几何结构。在本文中,我们引入了 PhyRecon,这是第一个利用可导渲染和可导物理仿真来学习隐式表面表示的方法。我们的框架将新颖的可导粒子基于物理仿真与神经隐式表示无缝集成。其核心是基于我们提出的表面点前进立方(SP-MC)算法在 SDF 基于隐式表示和显式表面点之间进行有效的转换,实现基于渲染和物理损失的可导学习。此外,我们还建模了渲染和物理不确定性以识别和弥补不一致和不准确的单目几何先验。物理不确定性还允许我们进行基于物理的像素采样,以增强对细长结构的学习。通过将这些技术相结合,我们的模型实现了与外观、几何和物理的效率共生建模。大量实验证明,PhyRecon 在重建质量方面显著超过了所有现有方法。我们的重建结果还证明了伊萨·格雷戈尔(Isaac Gym)验证的卓越物理稳定性,在所有数据集上实现了至少 40% 的改进,为未来的基于物理的应用于开辟了更广泛的道路。
https://arxiv.org/abs/2404.16666
Semi-supervised action recognition aims to improve spatio-temporal reasoning ability with a few labeled data in conjunction with a large amount of unlabeled data. Albeit recent advancements, existing powerful methods are still prone to making ambiguous predictions under scarce labeled data, embodied as the limitation of distinguishing different actions with similar spatio-temporal information. In this paper, we approach this problem by empowering the model two aspects of capability, namely discriminative spatial modeling and temporal structure modeling for learning discriminative spatio-temporal representations. Specifically, we propose an Adaptive Contrastive Learning~(ACL) strategy. It assesses the confidence of all unlabeled samples by the class prototypes of the labeled data, and adaptively selects positive-negative samples from a pseudo-labeled sample bank to construct contrastive learning. Additionally, we introduce a Multi-scale Temporal Learning~(MTL) strategy. It could highlight informative semantics from long-term clips and integrate them into the short-term clip while suppressing noisy information. Afterwards, both of these two new techniques are integrated in a unified framework to encourage the model to make accurate predictions. Extensive experiments on UCF101, HMDB51 and Kinetics400 show the superiority of our method over prior state-of-the-art approaches.
半监督动作识别旨在通过与大量未标记数据相结合,通过几标记数据来提高空间和时间推理能力。尽管有最近的研究进展,但现有的强大方法在稀疏标记数据下仍然容易产生模糊预测,这表现为用类似的时空信息区分不同动作的局限性。在本文中,我们通过赋予模型两个能力方面来解决这个问题,即判别性空间建模和时间结构建模,以学习具有判别性的时空表示。具体来说,我们提出了自适应对比学习(ACL)策略。它通过标记数据的类原型评估所有未标记样本的置信度,并从预标记样本库中选择正负样本进行对比学习。此外,我们还引入了多尺度时间学习(MTL)策略。它可以从长期视频片段中突出有用的语义信息,并将它们整合到短期视频片段中,同时抑制噪声信息。然后,这两种新的技术都被融入到统一的框架中,以鼓励模型做出准确的预测。在UCF101、HMDB51和Kinetics400等数据集上进行的大量实验表明,我们的方法优越于先前的最先进方法。
https://arxiv.org/abs/2404.16416
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. 'Sort the objects from lightest to heaviest'). In order to facilitate the development of such systems we introduce a new simulating environment that makes use of MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. Together with the simulator we propose a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements. Finally, we develop a new modular Closed Loop Interactive Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. We extensively evaluate our reasoning approach in simulation and in the real world manipulation tasks with a success rate above 76% and 64%, respectively.
人体推理系统将机器人硬件和认知过程集成在一起,以应对自然语言查询关于特定物理环境中复杂任务的执行。这通常包括改变场景的信念或通过身体交互来改变场景(例如,'将物体按重量从轻到重排序')。为了促进这种系统的发展,我们引入了一个新的模拟环境,利用MuJoCo物理引擎和高质量渲染器Blender提供真实的视觉观察,并且准确地反映场景的物理状态。与模拟器一起,我们提出了一个由10个多步骤推理场景组成的新的基准。最后,我们开发了一种新型的模块化闭合循环交互推理(CLIER)方法,考虑了非视觉对象属性的测量、由外部干扰引起的场景变化以及机器人行动不确定性的结果。我们在模拟和现实世界的操作任务中对其推理方法进行了广泛评估。我们在模拟和现实世界的操作任务中的成功率分别达到76%和64%。
https://arxiv.org/abs/2404.15194
Biological intelligence uses a "multiscale competency architecture" (MCA). It exhibits adaptive, goal directed behaviour at all scales, from cells to organs to organisms. In contrast, machine intelligence is only adaptive and goal directed at a high level. Learned policies are passively interpreted using abstractions (e.g. arithmetic) embodied in static interpreters (e.g. x86). Biological intelligence excels at causal learning. Machine intelligence does not. Previous work showed causal learning follows from weak policy optimisation, which is hindered by presupposed abstractions in silico. Here we formalise MCAs as nested "agentic abstraction layers", to understand how they might learn causes. We show that weak policy optimisation at low levels enables weak policy optimisation at high. This facilitates what we call "multiscale causal learning" and high level goal directed behaviour. We argue that by engineering human abstractions in silico we disconnect high level goal directed behaviour from the low level goal directed behaviour that gave rise to it. This inhibits causal learning, and we speculate this is one reason why human recall would be accompanied by feeling, and in silico recall not.
生物智能使用一种“多尺度能力架构”(MCA)。它表现出在所有尺度上自适应、目标导向的行为,从细胞到器官到生物。相比之下,机器智能只是在大规模层面上自适应和目标导向。通过静态解释器(如x86)被动地解释学习策略。生物智能在因果学习方面表现出色。机器智能则没有。之前的工作表明,因果学习源于弱策略优化,这受到在假定抽象模型中进行的预设抽象的阻碍。在这里,我们将MCA正式建模为嵌套的“智能代理层”,以理解它们如何学习原因。我们展示了在较低水平上进行弱策略优化会使得在高水平上进行弱策略优化。这促进了我们所说的“多尺度因果学习”和高层次目标导向行为。我们认为,通过在假定抽象模型中工程人类抽象,我们使高级别目标导向行为与低级别目标导向行为断开联系。这阻碍了因果学习,我们推测这就是为什么人类回忆会伴随着感觉,而在假定抽象中回忆不会的原因。
https://arxiv.org/abs/2405.02325