To safely navigate intricate real-world scenarios, autonomous vehicles must be able to adapt to diverse road conditions and anticipate future events. World model (WM) based reinforcement learning (RL) has emerged as a promising approach by learning and predicting the complex dynamics of various environments. Nevertheless, to the best of our knowledge, there does not exist an accessible platform for training and testing such algorithms in sophisticated driving environments. To fill this void, we introduce CarDreamer, the first open-source learning platform designed specifically for developing WM based autonomous driving algorithms. It comprises three key components: 1) World model backbone: CarDreamer has integrated some state-of-the-art WMs, which simplifies the reproduction of RL algorithms. The backbone is decoupled from the rest and communicates using the standard Gym interface, so that users can easily integrate and test their own algorithms. 2) Built-in tasks: CarDreamer offers a comprehensive set of highly configurable driving tasks which are compatible with Gym interfaces and are equipped with empirically optimized reward functions. 3) Task development suite: This suite streamlines the creation of driving tasks, enabling easy definition of traffic flows and vehicle routes, along with automatic collection of multi-modal observation data. A visualization server allows users to trace real-time agent driving videos and performance metrics through a browser. Furthermore, we conduct extensive experiments using built-in tasks to evaluate the performance and potential of WMs in autonomous driving. Thanks to the richness and flexibility of CarDreamer, we also systematically study the impact of observation modality, observability, and sharing of vehicle intentions on AV safety and efficiency. All code and documents are accessible on this https URL.
为了在复杂的现实场景中安全导航,自动驾驶车辆必须能够适应各种道路条件并预测未来事件。基于强化学习的(RL)世界模型(WM)作为一种有前景的方法,通过学习和预测各种环境中的复杂动态而 emergence。然而,据我们所知,目前没有可用的平台来训练和测试这种算法在复杂驾驶环境中的自动驾驶算法。为填补这一空白,我们介绍了CarDreamer,第一个专为开发基于RL的自驾算法而设计的开源学习平台。它包括三个关键组件:1)世界模型骨架:CarDreamer集成了一些最先进的WMs,简化了RL算法的复制。骨架与其余部分解耦并使用标准的Gym界面通信,以便用户轻松地将自己的算法集成和测试。2)内置任务:CarDreamer提供了一系列高度可配置的驾驶任务,与Gym接口兼容,并配备经过实证优化的奖励函数。3)任务开发套件:该套件简化了驾驶任务的创建,用户可以轻松定义交通流量和车辆路线,并自动收集多模态观察数据。可视化服务器允许用户通过浏览器追踪实时代理驾驶员的视频和性能指标。此外,我们使用内置任务对WMs在自动驾驶中的性能和潜力进行了广泛的实验评估。由于CarDreamer的丰富性和灵活性,我们还系统地研究了观测模式、可观测性和车辆意图共享对AV安全性和效率的影响。所有代码和文档都可以在https://这个URL访问。
https://arxiv.org/abs/2405.09111
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
This paper addresses the critical need for refining robot motions that, despite achieving a high visual similarity through human-to-humanoid retargeting methods, fall short of practical execution in the physical realm. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. Our research introduces a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. We name our framework: I-CTRL. By reformulating the motion imitation problem as a constrained refinement over non-physics-based retargeted motions, our framework excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and physical realism for successful motion imitation.
本文解决了在机器人运动中需要精炼的问题,尽管通过人类-机器人对齐方法实现了高视觉相似性,但在物理世界中却缺乏实际执行。图形社区中现有的技术通常优先考虑视觉一致性而非基于物理的可行性,这给在实际应用中部署双足机器人带来了巨大的挑战。我们的研究引入了一个约束的强化学习算法,用于在下肢式机器人上产生基于物理的高质量运动模仿,同时成功跟踪参考人类轨迹。我们将框架命名为I-CTRL。通过将运动复制问题重新表述为基于非物理对齐运动的约束优化,我们的框架在具有简单和独特奖励的简单和独特的基础上表现出色,并且可以适用于四台机器人。此外,我们的框架可以跟随大规模运动数据集,并使用独特的RL代理。所提出的方法标志着在进步控制双足机器人方面迈出了关键的一步,强调了在成功运动复制中实现视觉和物理现实之间的一致性至关重要。
https://arxiv.org/abs/2405.08726
Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world traffic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training efficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.
自动驾驶交叉管理(AIM)由于现实交通场景复杂性和需要一个昂贵的集中式服务器同时控制所有车辆而带来了显著的挑战。为了应对这些问题,本研究通过提出一种新型的分布式AIM方法利用多智能体强化学习(MARL)来解决这些问题。我们证明了通过利用高级辅助系统3D环绕视技术,自动驾驶车辆可以在不需要任何集中式控制器的情况下准确地导航路口场景。因此,本文的贡献包括基于MARL的自動管理4个路口的算法和引入了一种名为优先场景回放的新策略,以提高训练效果。我们验证了我们的方法作为传统集中AIM技术的一个创新替代方案,确保了我们的结果的完整可重复性。具体来说,使用SMARTS平台在虚拟环境中进行的实验强调了其在各种指标上优于基准测试的优越性。
https://arxiv.org/abs/2405.08655
In the training process of Deep Reinforcement Learning (DRL), agents require repetitive interactions with the environment. With an increase in training volume and model complexity, it is still a challenging problem to enhance data utilization and explainability of DRL training. This paper addresses these challenges by focusing on the temporal correlations within the time dimension of time series. We propose a novel approach to segment multivariate time series into meaningful subsequences and represent the time series based on these subsequences. Furthermore, the subsequences are employed for causal inference to identify fundamental causal factors that significantly impact training outcomes. We design a module to provide feedback on the causality during DRL training. Several experiments demonstrate the feasibility of our approach in common environments, confirming its ability to enhance the effectiveness of DRL training and impart a certain level of explainability to the training process. Additionally, we extended our approach with priority experience replay algorithm, and experimental results demonstrate the continued effectiveness of our approach.
在深度强化学习(DRL)的训练过程中,代理需要与环境进行重复交互。随着训练量的增加和模型复杂性的提高,增强数据利用率和对DRL训练的 可解释性仍然是一个具有挑战性的问题。本文通过关注时间维度中的时间相关性来解决这些问题。我们提出了一种将多维时间序列分割为有意义的子序列的新方法,并基于这些子序列表示时间序列。此外,这些子序列用于进行因果推理,以确定对训练结果具有显著影响的基本因果因素。我们设计了一个模块来提供在DRL训练期间的因果性反馈。 several实验证明,我们的方法在常见环境中是行得通的,证实了其提高DRL训练效果和使其具有某种程度的可解释性的能力。此外,我们还通过优先经验回放算法扩展了我们的方法,实验结果表明,我们的方法仍然具有有效性。
https://arxiv.org/abs/2405.08380
This paper presents a new tool learning dataset Seal-Tools, which contains self-instruct API-like tools. Seal-Tools not only offers a large number of tools, but also includes instances which demonstrate the practical application of tools. Seeking to generate data on a large scale while ensuring reliability, we propose a self-instruct method to generate tools and instances, allowing precise control over the process. Moreover, our Seal-Tools contains hard instances that call multiple tools to complete the job, among which some are nested tool callings. For precise and comprehensive evaluation, we use strict format control and design three metrics from different dimensions. Therefore, Seal-Tools can serve as a new benchmark to evaluate the tool-calling ability of LLMs. Finally, we evaluate several prevalent LLMs and our finetuned model on Seal-Tools. The results show that current systems are far from perfect. The code, data and experiment results are available at this https URL .
本文介绍了一个名为Seal-Tools的新工具学习数据集,其中包含类似于自我指导的API工具。Seal-Tools不仅提供了大量工具,还包括一些实例,展示了工具的实际应用。为了在保证可靠性的同时生成大规模数据,我们提出了一个自指导方法来生成工具和实例,允许对过程进行精确控制。此外,我们的Seal-Tools还包含一些嵌套工具调用实例。为了进行精确和全面的评估,我们使用了严格的格式控制并从不同维度设计三个指标。因此,Seal-Tools可以作为评估LLM工具调用能力的新的基准。最后,我们在Seal-Tools上评估了几个流行的LLM模型和我们的优化模型。结果显示,当前系统离完美还有很长的路要走。代码、数据和实验结果都可以在以下链接处获取:https://url.cn/ Seal-Tools。
https://arxiv.org/abs/2405.08355
We offer philosophical motivations for a method we call Virtual World Cognitive Science (VW CogSci), in which researchers use virtual embodied agents that are embedded in virtual worlds to explore questions in the field of Cognitive Science. We focus on questions about mental and linguistic representation and the ways that such computational modeling can add rigor to philosophical thought experiments, as well as the terminology used in the scientific study of such representations. We find that this method forces researchers to take a god's-eye view when describing dynamical relationships between entities in minds and entities in an environment in a way that eliminates the need for problematic talk of belief and concept types, such as the belief that cats are silly, and the concept CAT, while preserving belief and concept tokens in individual cognizers' minds. We conclude with some further key advantages of VW CogSci for the scientific study of mental and linguistic representation and for Cognitive Science more broadly.
我们为一种名为Virtual World Cognitive Science(VW CogSci)的方法提供哲学动机,在这种方法中,研究者使用嵌入在虚拟世界中的虚拟实体来探索认知科学领域中的问题。我们关注的是关于心灵和语言表示以及这种计算建模如何为哲学思维实验增加严谨性,以及这种表示在科学研究中使用的术语。我们发现,这种方法要求研究者以神视角描述心灵中实体与环境中实体之间的动态关系,从而消除关于信念和概念类型的可疑谈论,例如相信猫很傻的信念和概念CAT,同时保留个体认知者头脑中的信念和概念标记。我们得出结论,VW CogSci对于精神学和语言表示的科学研究和认知科学具有更广泛的意义。
https://arxiv.org/abs/2405.08304
ChatGPT is a conversational agent built on a large language model. Trained on a significant portion of human output, ChatGPT can mimic people to a degree. As such, we need to consider what social identities ChatGPT simulates (or can be designed to simulate). In this study, we explored the case of identity simulation through Japanese first-person pronouns, which are tightly connected to social identities in intersectional ways, i.e., intersectional pronouns. We conducted a controlled online experiment where people from two regions in Japan (Kanto and Kinki) witnessed interactions with ChatGPT using ten sets of first-person pronouns. We discovered that pronouns alone can evoke perceptions of social identities in ChatGPT at the intersections of gender, age, region, and formality, with caveats. This work highlights the importance of pronoun use for social identity simulation, provides a language-based methodology for culturally-sensitive persona development, and advances the potential of intersectional identities in intelligent agents.
ChatGPT是一个基于大型语言模型构建的会话机器人。它通过训练大量人类输出而得到,能够以某种程度模仿人类。因此,我们需要考虑ChatGPT模拟的社会身份(或可以通过设计模拟)是什么。在这项研究中,我们通过使用日本第一人称代词来探讨身份模拟,这些代词与性别、年龄、地区和正式程度等方面紧密相关,即 intersectional pronouns。我们进行了一项控制性的在线实验,让来自日本两个地区的(关东和关西)人使用十组第一人称代词与ChatGPT进行互动。我们发现,代词本身就可以引发关于ChatGPT在性别、年龄、地区和正式程度等方面模拟社会身份的感知,不过需要指出的是,这种作用是有局限性的。这项工作强调了在社交身份模拟中使用代词的重要性,为具有文化敏感性的个性发展提供了语言基础,并推动了智能代理中交叉身份的潜在可能性。
https://arxiv.org/abs/2405.08238
Although Federated Learning (FL) is promising in knowledge sharing for heterogeneous Artificial Intelligence of Thing (AIoT) devices, their training performance and energy efficacy are severely restricted in practical battery-driven scenarios due to the ``wooden barrel effect'' caused by the mismatch between homogeneous model paradigms and heterogeneous device capability. As a result, due to various kinds of differences among devices, it is hard for existing FL methods to conduct training effectively in energy-constrained scenarios, such as the battery constraints of devices. To tackle the above issues, we propose an energy-aware FL framework named DR-FL, which considers the energy constraints in both clients and heterogeneous deep learning models to enable energy-efficient FL. Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Reinforcement Learning (MARL)-based dual-selection method, which allows participated devices to make contributions to the global model effectively and adaptively based on their computing capabilities and energy capacities in a MARL-based manner. Experiments on various well-known datasets show that DR-FL can not only maximise knowledge sharing among heterogeneous models under the energy constraint of large-scale AIoT systems but also improve the model performance of each involved heterogeneous device.
尽管联邦学习(FL)在知识共享方面对异构人工智能设备(AIoT)具有前景,但在实际电池驱动场景中,它们的训练效果和能效受到了由异构模型范式和异构设备能力之间的“木桶效应”引起的影响。因此,由于设备之间的各种差异,现有的FL方法很难在能源受限的场景中进行有效训练,例如设备的电池限制。为了解决上述问题,我们提出了一个能源感知FL框架,名为DR-FL,它考虑了客户端和异构深度学习模型的能源限制,以实现能源高效的FL。与Vanilla FL不同,DR-FL采用了我们提出的基于MARL的双选方法,允许参与设备根据其计算能力和能源能力以一种MARL方式有效且适当地为全局模型做出贡献。在各种知名数据集上的实验表明,DR-FL不仅可以提高大规模AIoT系统中的异构模型之间的知识共享,而且还可以提高涉及的所有异构设备的模型性能。
https://arxiv.org/abs/2405.08183
Mixed-integer quadratic programs (MIQPs) are a versatile way of formulating vehicle decision making and motion planning problems, where the prediction model is a hybrid dynamical system that involves both discrete and continuous decision variables. However, even the most advanced MIQP solvers can hardly account for the challenging requirements of automotive embedded platforms. Thus, we use machine learning to simplify and hence speed up optimization. Our work builds on recent ideas for solving MIQPs in real-time by training a neural network to predict the optimal values of integer variables and solving the remaining problem by online quadratic programming. Specifically, we propose a recurrent permutation equivariant deep set that is particularly suited for imitating MIQPs that involve many obstacles, which is often the major source of computational burden in motion planning problems. Our framework comprises also a feasibility projector that corrects infeasible predictions of integer variables and considerably increases the likelihood of computing a collision-free trajectory. We evaluate the performance, safety and real-time feasibility of decision-making for autonomous driving using the proposed approach on realistic multi-lane traffic scenarios with interactive agents in SUMO simulations.
混合整数二次规划(MIQPs)是一种将车辆决策和运动规划问题形式化的 versatile方法,其中预测模型是一个涉及离散和连续决策变量的混合动态系统。然而,即使是最先进的MIQP求解器也可能很难满足汽车嵌入平台上的挑战要求。因此,我们使用机器学习来简化,从而加速优化。我们的工作基于通过训练神经网络预测整数变量的最优值来解决实时MIQPs的想法,并通过在线二次规划解决剩余问题。具体来说,我们提出了一个循环移位等价深度集,特别适用于涉及许多障碍物的MIQPs,这是运动规划问题中计算负担的主要来源。我们的框架还包括一个可行性投影器,用于纠正整数变量的不可行预测,大大增加了计算无碰撞轨迹的可能性。我们在SUMO仿真中使用该方法对现实世界的多道交通场景进行自动驾驶的决策分析。
https://arxiv.org/abs/2405.08122
While there has been remarkable progress recently in the fields of manipulation and locomotion, mobile manipulation remains a long-standing challenge. Compared to locomotion or static manipulation, a mobile system must make a diverse range of long-horizon tasks feasible in unstructured and dynamic environments. While the applications are broad and interesting, there are a plethora of challenges in developing these systems such as coordination between the base and arm, reliance on onboard perception for perceiving and interacting with the environment, and most importantly, simultaneously integrating all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors, delays in decision-making, and no whole-body coordination. In this work, we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. Similar to how humans leverage whole-body and hand-eye coordination, we develop a mobile manipulator that exploits its ability to move and see, more specifically -- to move in order to see and to see in order to move. This allows it to not only move around and interact with its environment but also, choose "when" to perceive "what" using an active visual system. We observe that such an agent learns to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps. Results visualizations and videos at this https URL
虽然最近在操作和运动领域取得了显著的进展,但移动操作仍然是一个长期挑战。与运动或静态操作相比,移动系统必须在无结构和动态环境中完成一系列长远的任务。尽管应用广泛且有趣,但开发这些系统存在诸多挑战,例如基础和手臂之间的协调、依赖于车载感知来感知和与环境互动,以及最重要的是,同时集成所有这些部分。之前的工作使用分离的模块化技能来解决此问题,但这些技能是琐碎相关的。这导致了许多限制,如累积误差、决策延迟和全身协调不足。在本文中,我们提出了一个反应式的移动操作框架,使用主动视觉系统有意识地感知和反应其环境。与人类利用全身和手眼协调一样,我们开发了一个移动操作器,利用其移动和观察的能力,更具体地说——为了观察和移动而移动。这使得它不仅能够移动和与周围环境互动,还可以通过主动视觉系统选择“何时”感知“何物”。我们观察到,这种代理人在展示灵活的全身协调的同时,能够学会在复杂杂乱的场景中导航。结果可视化和视频可在此链接中观看:https://www.youtube.com/watch?v=uRstRQcJYgE
https://arxiv.org/abs/2405.07991
Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at this https URL.
诊断和处理患者是一个复杂、顺序的决策过程,需要医生获取信息——例如进行哪些检查——并采取行动。近年来的人工智能(AI)和大型语言模型(LLM)的进步有望深刻地影响临床护理。然而,目前的评估方案过于依赖静态医疗问题问答基准,缺乏现实生活中的互动决策,这不符合临床工作的要求。在这里,我们提出了AgentClinic:一个多模态基准,以评估LLM在模拟临床环境中作为代理的能力。在我们的基准中,医生代理必须通过对话和主动数据收集来揭示患者的诊断。我们提出了两个开放基准:多模态图像和对话环境,AgentClinic-NEJM,以及对话环境,AgentClinic-MedQA。我们将认知和隐含偏见都融入到患者和医生代理中,以模拟真实世界中偏见代理之间的互动。我们发现,引入偏见会导致医生代理的诊断准确度大幅降低,以及患者代理的遵从、信心和后续咨询意愿降低。评估了一系列最先进的LLM后,我们发现像MedQA这样在基准中表现优秀的模型在AgentClinic-MedQA中的表现不佳。我们发现,患者代理中使用的LLM对代理在AgentClinic基准中的表现至关重要。我们证明了有限互动和过多互动都会降低医生代理的诊断准确度。这一工作的代码和数据可以在这个https URL上找到。
https://arxiv.org/abs/2405.07960
Infants learn actively in their environments, shaping their own learning curricula. They learn about their environments' affordances, that is, how local circumstances determine how their behavior can affect the environment. Here we model this type of behavior by means of a deep learning architecture. The architecture mediates between global cognitive map exploration and local affordance learning. Inference processes actively move the simulated agent towards regions where they expect affordance-related knowledge gain. We contrast three measures of uncertainty to guide this exploration: predicted uncertainty of a model, standard deviation between the means of several models (SD), and the Jensen-Shannon Divergence (JSD) between several models. We show that the first measure gets fooled by aleatoric uncertainty inherent in the environment, while the two other measures focus learning on epistemic uncertainty. JSD exhibits the most balanced exploration strategy. From a computational perspective, our model suggests three key ingredients for coordinating the active generation of learning curricula: (1) Navigation behavior needs to be coordinated with local motor behavior for enabling active affordance learning. (2) Affordances need to be encoded locally for acquiring generalized knowledge. (3) Effective active affordance learning mechanisms should use density comparison techniques for estimating expected knowledge gain. Future work may seek collaborations with developmental psychology to model active play in children in more realistic scenarios.
婴儿在他们的环境中积极学习,塑造自己的学习课程。他们学习了解他们环境的适应性,即当地情况如何决定他们的行为如何影响环境。在这里,我们通过深度学习架构来建模这种行为。该架构在全局认知图探索和局部适应性学习之间起中介作用。推理过程积极地将模拟代理推向他们期望获得适应性相关知识的区域。我们通过预测模型的不确定性、多个模型的平均值之间的标准差以及多个模型之间的Jensen-Shannon差异来引导探索。我们发现,第一个度量被环境固有的随机不确定性所欺骗,而其他两个度量重点学习元认知不确定性。JSD表现出最平衡的探索策略。从计算角度来看,我们的模型建议协调学习课程的积极生成需要:(1)导航行为需要与局部运动行为协调以实现积极适应性学习。 (2)适应性需要局部编码以获取一般性知识。 (3)有效的积极适应性学习机制应使用密度比较技术估计预期知识 gain。未来的工作可能会寻求与发育心理学家的合作,在更真实的场景中建模儿童的学习玩耍。
https://arxiv.org/abs/2405.07816
Over the last few years, 360$\degree$ video traffic on the network has grown significantly. A key challenge of 360$\degree$ video playback is ensuring a high quality of experience (QoE) with limited network bandwidth. Currently, most studies focus on tile-based adaptive bitrate (ABR) streaming based on single viewport prediction to reduce bandwidth consumption. However, the performance of models for single-viewpoint prediction is severely limited by the inherent uncertainty in head movement, which can not cope with the sudden movement of users very well. This paper first presents a multimodal spatial-temporal attention transformer to generate multiple viewpoint trajectories with their probabilities given a historical trajectory. The proposed method models viewpoint prediction as a classification problem and uses attention mechanisms to capture the spatial and temporal characteristics of input video frames and viewpoint trajectories for multi-viewpoint prediction. After that, a multi-agent deep reinforcement learning (MADRL)-based ABR algorithm utilizing multi-viewpoint prediction for 360$\degree$ video streaming is proposed for maximizing different QoE objectives under various network conditions. We formulate the ABR problem as a decentralized partially observable Markov decision process (Dec-POMDP) problem and present a MAPPO algorithm based on centralized training and decentralized execution (CTDE) framework to solve the problem. The experimental results show that our proposed method improves the defined QoE metric by up to 85.5\% compared to existing ABR methods.
在过去的几年里,网络上的360度视频流量大幅增长。360度视频播放的一个关键挑战是确保在有限网络带宽下提供高质量(QoE)体验,尤其是在用户运动突然的情况下。目前,大多数研究都集中在基于单视图预测的块式自适应比特率(ABR)流媒体上,以降低带宽消耗。然而,单视图预测模型的性能受到头部运动固有不确定性的严重限制,无法很好地应对用户的突然运动。本文首先提出了一种多模态空间-时间注意力Transformer,用于根据历史轨迹生成多个视角轨迹的概率。所提出的方法将视角预测视为分类问题,并使用注意机制来捕捉输入视频帧和多视角预测视角轨迹的空间和时间特征。然后,我们提出了一种基于多视角预测的ABR算法,用于在各种网络条件下最大化不同的QoE目标。我们将ABR问题形式化为分布式部分观察的马尔可夫决策过程(Dec-POMDP)问题,并基于集中训练和分布式执行(CTDE)框架提出了一种MAPPO算法来解决该问题。实验结果表明,与现有ABR方法相比,我们所提出的方法提高了定义的QoE指标高达85.5%。
https://arxiv.org/abs/2405.07759
Co-speech gesturing is an important modality in conversation, providing context and social cues. In character animation, appropriate and synchronised gestures add realism, and can make interactive agents more engaging. Historically, methods for automatically generating gestures were predominantly audio-driven, exploiting the prosodic and speech-related content that is encoded in the audio signal. In this paper we instead experiment with using LLM features for gesture generation that are extracted from text using LLAMA2. We compare against audio features, and explore combining the two modalities in both objective tests and a user study. Surprisingly, our results show that LLAMA2 features on their own perform significantly better than audio features and that including both modalities yields no significant difference to using LLAMA2 features in isolation. We demonstrate that the LLAMA2 based model can generate both beat and semantic gestures without any audio input, suggesting LLMs can provide rich encodings that are well suited for gesture generation.
共同语言手势是在对话中一个重要的模式,提供上下文和社会线索。在角色动画中,适当的同步手势增加真实感,可以让交互型代理更具吸引力。历史上,自动生成手势的方法主要是音频驱动的,利用音频信号中编码的语调和社会相关内容。在本文中,我们尝试使用LLM特征来生成从文本中提取的LLAMA2模型的手势。我们比较了音频特征,并探讨了在客观测试和用户研究中将两个模式相结合的效果。令人惊讶的是,我们的结果表明,LLAMA2特征比音频特征表现更好,而将两个模式结合使用对使用LLAMA2特征的单独应用没有显著影响。我们证明了基于LLAMA2的模型可以在没有任何音频输入的情况下生成节奏和语义手势,这表明LLMs可以提供丰富的编码,非常适合手势生成。
https://arxiv.org/abs/2405.08042
Intent Management Function (IMF) is an integral part of future-generation networks. In recent years, there has been some work on AI-based IMFs that can handle conflicting intents and prioritize the global objective based on apriori definition of the utility function and accorded priorities for competing intents. Some of the earlier works use Multi-Agent Reinforcement Learning (MARL) techniques with AdHoc Teaming (AHT) approaches for efficient conflict handling in IMF. However, the success of such frameworks in real-life scenarios requires them to be flexible to business situations. The intent priorities can change and the utility function, which measures the extent of intent fulfilment, may also vary in definition. This paper proposes a novel mechanism whereby the IMF can generalize to different forms of utility functions and change of intent priorities at run-time without additional training. Such generalization ability, without additional training requirements, would help to deploy IMF in live networks where customer intents and priorities change frequently. Results on the network emulator demonstrate the efficacy of the approach, scalability for new intents, outperforming existing techniques that require additional training to achieve the same degree of flexibility thereby saving cost, and increasing efficiency and adaptability.
意图管理功能(IMF)是未来网络的重要组成部分。近年来,在基于AI的意图管理框架方面已经进行了一些工作,这些框架可以处理冲突意图并基于元启发式定义的效用函数和竞争意图的优先级进行全局优化。一些早期的作品使用具有自组织团队(AHT)方法的多智能体强化学习(MARL)技术来处理IMF中的冲突。然而,这些框架在现实场景中的成功需要它们具有对业务情况的灵活性。意图优先级可以发生变化,衡量意图满足程度的效用函数也可能有不同的定义。本文提出了一种新机制,使得IMF可以在运行时扩展到不同形式的效用函数和意图优先级的改变,而无需进行额外的训练。这种通用能力,无需额外训练需求,将使IMF能够应用于经常变化的客户意图和优先级的网络中。在网络仿真结果表明,该方法的有效性,可扩展性,超越了需要额外训练以实现相同灵活度的现有技术,从而节省成本,提高效率和适应性。
https://arxiv.org/abs/2405.07621
This work aims to tackle the intent recognition problem in Human-Robot Collaborative assembly scenarios. Precisely, we consider an interactive assembly of a wooden stool where the robot fetches the pieces in the correct order and the human builds the parts following the instruction manual. The intent recognition is limited to the idle state estimation and it is needed to ensure a better synchronization between the two agents. We carried out a comparison between two distinct solutions involving wearable sensors and eye tracking integrated into the perception pipeline of a flexible planning architecture based on Hierarchical Task Networks. At runtime, the wearable sensing module exploits the raw measurements from four 9-axis Inertial Measurement Units positioned on the wrists and hands of the user as an input for a Long Short-Term Memory Network. On the other hand, the eye tracking relies on a Head Mounted Display and Unreal Engine. We tested the effectiveness of the two approaches with 10 participants, each of whom explored both options in alternate order. We collected explicit metrics about the attractiveness and efficiency of the two techniques through User Experience Questionnaires as well as implicit criteria regarding the classification time and the overall assembly time. The results of our work show that the two methods can reach comparable performances both in terms of effectiveness and user preference. Future development could aim at joining the two approaches two allow the recognition of more complex activities and to anticipate the user actions.
此工作旨在解决人机协作组装场景中的意图识别问题。具体而言,我们考虑了一个基于灵活规划架构的人体交互式装配木质凳子的过程。意图识别仅限于空闲状态估计,以确保两个代理之间的更好的同步。我们比较了两种不同的解决方案,这些解决方案涉及可穿戴传感器和集成在灵活规划架构感知管道中的眼动追踪。在运行时,可穿戴感测模块利用用户手腕和手上的四个9轴惯性测量器的原始测量值作为输入,为长短期记忆网络提供输入。另一方面,眼追踪依赖于佩戴式显示器和Unreal Engine。我们对两种方法进行了测试,让10个参与者交替使用这两种方法。我们通过用户体验问卷收集了关于两种技术的吸引力和效率的明确度指标,以及关于分类时间和整体装配时间的隐含标准。我们工作的结果表明,两种方法在有效性方面可以达到相当不错的表现,而在用户偏好方面也有相似的效果。未来的发展可以考虑将这两种方法集成起来,以便能够识别更复杂的任务,并预测用户的行为。
https://arxiv.org/abs/2405.07570
The Lévy walk, a type of random walk characterized by linear step lengths that follow a power-law distribution, is observed in the migratory behaviors of various organisms, ranging from bacteria to humans. Notably, Lévy walks with power exponents close to two are frequently observed, though their underlying causes remain elusive. This study introduces a simplified, abstract random walk model designed to produce inverse square Lévy walks, also known as Cauchy walks and explores the conditions that facilitate these phenomena. In our model, agents move toward a randomly selected destination in multi-dimensional space, and their movement strategy is parameterized by the extent to which they pursue the shortest path. When the search cost is proportional to the distance traveled, this parameter effectively reflects the emphasis on minimizing search costs. Our findings reveal that strict adherence to this cost minimization constraint results in a Brownian walk pattern. However, removing this constraint transitions the movement to an inverse square Lévy walk. Therefore, by modulating the prioritization of search costs, our model can seamlessly alternate between Brownian and Cauchy walk dynamics. This model has the potential to be utilized for exploring the parameter space of an optimization problem.
Levy walk,一种由线性步长且符合对数分布的随机漫步特征的随机漫步类型,在各种生物的迁徙行为中都有观察到。值得注意的是,经常观察到具有接近于2的功率指数的Lévy漫步,尽管其潜在原因仍然是神秘的。本研究介绍了一个简化的、抽象的随机漫步模型,旨在产生反比例Lévy漫步,也就是Cauchy漫步,并探讨了促进这些现象的条件。在我们的模型中,代理商在多维空间中随机选择一个目标,他们的运动策略由他们追求最短路径的程度参数化。当搜索成本与所走距离成比例时,这个参数有效地反映了强调最小化搜索成本的重视。我们的研究结果表明,严格遵守这种成本最小化约束会导致布朗运动模式。然而,消除这个约束将使运动转向反比例Lévy漫步。因此,通过调整搜索成本的优先级,我们的模型可以平滑地交替出现布朗和Cauchy运动模式。这个模型有可能用于探索优化问题的参数空间。
https://arxiv.org/abs/2405.07541
In recent years, there has been an increasing demand for customizable 3D virtual spaces. Due to the significant human effort required to create these virtual spaces, there is a need for efficiency in virtual space creation. While existing studies have proposed methods for automatically generating layouts such as floor plans and furniture arrangements, these methods only generate text indicating the layout structure based on user instructions, without utilizing the information obtained during the generation process. In this study, we propose an agent-driven layout generation system using the GPT-4V multimodal large language model and validate its effectiveness. Specifically, the language model manipulates agents to sequentially place objects in the virtual space, thus generating layouts that reflect user instructions. Experimental results confirm that our proposed method can generate virtual spaces reflecting user instructions with a high success rate. Additionally, we successfully identified elements contributing to the improvement in behavior generation performance through ablation study.
近年来,定制化3D虚拟空间的需求不断增加。由于创建这些虚拟空间需要大量的人力劳动,因此需要虚拟空间创建的效率。尽管现有的研究已经提出了自动生成布局的方法,如平面图和家具布置,但这些方法仅根据用户指令生成文本表示布局结构,而没有利用生成过程中获取的信息。在本文中,我们提出了一种基于GPT-4V多模态大型语言模型的代理驱动布局生成系统,并验证了其有效性。具体来说,语言模型通过操纵代理在虚拟空间中按顺序放置物体,从而生成反映用户指令的布局。实验结果证实了我们提出的具有高成功率的生成虚拟空间的方法。此外,通过消融研究,我们成功识别了导致行为生成性能改进的因素。
https://arxiv.org/abs/2405.08037