Autonomous tuning of particle accelerators is an active and challenging field of research with the goal of enabling novel accelerator technologies cutting-edge high-impact applications, such as physics discovery, cancer research and material sciences. A key challenge with autonomous accelerator tuning remains that the most capable algorithms require an expert in optimisation, machine learning or a similar field to implement the algorithm for every new tuning task. In this work, we propose the use of large language models (LLMs) to tune particle accelerators. We demonstrate on a proof-of-principle example the ability of LLMs to successfully and autonomously tune a particle accelerator subsystem based on nothing more than a natural language prompt from the operator, and compare the performance of our LLM-based solution to state-of-the-art optimisation algorithms, such as Bayesian optimisation (BO) and reinforcement learning-trained optimisation (RLO). In doing so, we also show how LLMs can perform numerical optimisation of a highly non-linear real-world objective function. Ultimately, this work represents yet another complex task that LLMs are capable of solving and promises to help accelerate the deployment of autonomous tuning algorithms to the day-to-day operations of particle accelerators.
自动调节粒子加速器是一个充满挑战的研究领域,旨在实现新型的加速器技术, cutting-edge 的具有高影响应用的高科技应用,例如物理学发现、癌症研究和材料科学。自动调节粒子加速器的一个重要挑战是,最有效的算法需要优化领域的专家才能实现对每个新调节任务的算法进行操作。在这项工作中,我们提出使用大型语言模型(LLMs)对粒子加速器进行自动调节。我们在一个证明性的例子中展示了LLMs成功且自主地调节一个粒子加速器子系统的能力,仅基于操作员的自然语言提示。我们还比较了我们的LLM基于解决方案与最先进的优化算法(如贝叶斯优化(BO)和强化学习训练的优化(RLO))的性能。通过这样做,我们还展示了LLMs如何执行高度非线性的现实世界目标函数的数值优化。最终,这项工作代表了LLMs能够解决的最新复杂任务,并有望加速将自调节算法应用于粒子加速器日常运营的工作。
https://arxiv.org/abs/2405.08888
This paper addresses the critical need for refining robot motions that, despite achieving a high visual similarity through human-to-humanoid retargeting methods, fall short of practical execution in the physical realm. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. Our research introduces a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. We name our framework: I-CTRL. By reformulating the motion imitation problem as a constrained refinement over non-physics-based retargeted motions, our framework excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and physical realism for successful motion imitation.
本文解决了在机器人运动中需要精炼的问题,尽管通过人类-机器人对齐方法实现了高视觉相似性,但在物理世界中却缺乏实际执行。图形社区中现有的技术通常优先考虑视觉一致性而非基于物理的可行性,这给在实际应用中部署双足机器人带来了巨大的挑战。我们的研究引入了一个约束的强化学习算法,用于在下肢式机器人上产生基于物理的高质量运动模仿,同时成功跟踪参考人类轨迹。我们将框架命名为I-CTRL。通过将运动复制问题重新表述为基于非物理对齐运动的约束优化,我们的框架在具有简单和独特奖励的简单和独特的基础上表现出色,并且可以适用于四台机器人。此外,我们的框架可以跟随大规模运动数据集,并使用独特的RL代理。所提出的方法标志着在进步控制双足机器人方面迈出了关键的一步,强调了在成功运动复制中实现视觉和物理现实之间的一致性至关重要。
https://arxiv.org/abs/2405.08726
This study investigates the computational speed and accuracy of two numerical integration methods, cubature and sampling-based, for integrating an integrand over a 2D polygon. Using a group of rovers searching the Martian surface with a limited sensor footprint as a test bed, the relative error and computational time are compared as the area was subdivided to improve accuracy in the sampling-based approach. The results show that the sampling-based approach exhibits a $14.75\%$ deviation in relative error compared to cubature when it matches the computational performance at $100\%$. Furthermore, achieving a relative error below $1\%$ necessitates a $10000\%$ increase in relative time to calculate due to the $\mathcal{O}(N^2)$ complexity of the sampling-based method. It is concluded that for enhancing reinforcement learning capabilities and other high iteration algorithms, the cubature method is preferred over the sampling-based method.
这项研究探讨了两种数值积分方法:立方和基于采样的方法,在整合一个二维多边形中的积分多项式的计算速度和精度。使用一组轮式机器人,其有限传感器足迹作为火星表面测试台,将采样的精度与分段面积的提高精度进行比较,当采样方法在计算性能达到100%时。结果显示,与立方相比,基于采样的方法在相对误差方面存在14.75%的偏差。此外,为了实现相对误差低于1%,需要将相对时间增加10000%以计算由于采样的方法 $\mathcal{O}(N^2)$ 复杂性。因此,结论是,为了增强强化学习能力和其他高迭代算法,立方方法比基于采样的方法更受欢迎。
https://arxiv.org/abs/2405.08691
Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world traffic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training efficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.
自动驾驶交叉管理(AIM)由于现实交通场景复杂性和需要一个昂贵的集中式服务器同时控制所有车辆而带来了显著的挑战。为了应对这些问题,本研究通过提出一种新型的分布式AIM方法利用多智能体强化学习(MARL)来解决这些问题。我们证明了通过利用高级辅助系统3D环绕视技术,自动驾驶车辆可以在不需要任何集中式控制器的情况下准确地导航路口场景。因此,本文的贡献包括基于MARL的自動管理4个路口的算法和引入了一种名为优先场景回放的新策略,以提高训练效果。我们验证了我们的方法作为传统集中AIM技术的一个创新替代方案,确保了我们的结果的完整可重复性。具体来说,使用SMARTS平台在虚拟环境中进行的实验强调了其在各种指标上优于基准测试的优越性。
https://arxiv.org/abs/2405.08655
In biological evolution complex neural structures grow from a handful of cellular ingredients. As genomes in nature are bounded in size, this complexity is achieved by a growth process where cells communicate locally to decide whether to differentiate, proliferate and connect with other cells. This self-organisation is hypothesized to play an important part in the generalisation, and robustness of biological neural networks. Artificial neural networks (ANNs), on the other hand, are traditionally optimized in the space of weights. Thus, the benefits and challenges of growing artificial neural networks remain understudied. Building on the previously introduced Neural Developmental Programs (NDP), in this work we present an algorithm for growing ANNs that solve reinforcement learning tasks. We identify a key challenge: ensuring phenotypic complexity requires maintaining neuronal diversity, but this diversity comes at the cost of optimization stability. To address this, we introduce two mechanisms: (a) equipping neurons with an intrinsic state inherited upon neurogenesis; (b) lateral inhibition, a mechanism inspired by biological growth, which controlls the pace of growth, helping diversity persist. We show that both mechanisms contribute to neuronal diversity and that, equipped with them, NDPs achieve comparable results to existing direct and developmental encodings in complex locomotion tasks
在生物进化中,复杂的神经结构从一些细胞成分开始生长。由于自然界中的基因组大小是有限的,这种复杂性是通过细胞在局部交流以决定是否分化、增殖并与其他细胞连接来实现增长的。这种自组织被认为在生物神经网络的泛化和鲁棒性中发挥了重要作用。另一方面,人工神经网络(ANN)在权重空间中通常是优化的。因此,生长人工神经网络的收益和挑战仍然没有被充分研究。在之前引入的神经发育程序(NDP)的基础上,在这篇论文中,我们提出了一个生长ANN的算法,用于解决强化学习任务。我们认识到一个关键挑战是:保证表型复杂性需要保持神经元多样性,但这种多样性是以优化稳定性为代价的。为了应对这个问题,我们引入了两个机制:(a)为神经元提供源于神经发生学的内在状态;(b)横向抑制,一种受到生物生长启发的机制,它控制生长速度,有助于维持多样性。我们证明了这两个机制都贡献了神经元多样性,有了它们,NDP在复杂运动任务上的效果与现有的直接和发育编码相当。
https://arxiv.org/abs/2405.08510
Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This prompts us to investigate the causes to the performance discrepancy through a series of carefully designed experimental ablations. We show empirically that hypotheses such as offline data coverage and data quality by itself cannot convincingly explain the performance difference. We also find that while offline algorithms train policy to become good at pairwise classification, it is worse at generations; in the meantime the policies trained by online algorithms are good at generations while worse at pairwise classification. This hints at a unique interplay between discriminative and generative capabilities, which is greatly impacted by the sampling process. Lastly, we observe that the performance discrepancy persists for both contrastive and non-contrastive loss functions, and appears not to be addressed by simply scaling up policy networks. Taken together, our study sheds light on the pivotal role of on-policy sampling in AI alignment, and hints at certain fundamental challenges of offline alignment algorithms.
强化学习从人类反馈(RLHF)是大型语言模型对齐的规范框架。然而,离线对齐算法的兴起使得RLHF中需要在线策略抽样变得具有挑战性。在奖励过度优化背景下,我们从一个实验开端的实验集开始,这些实验展示了在线方法相对于离线方法的优势。这促使我们通过一系列精心设计的实验 ablations 调查性能差异的原因。我们通过经验证明,类似于离线数据覆盖和数据质量本身无法说服地解释性能差异。我们发现,尽管离线算法通过成对分类来训练策略变得擅长,但它在大规模生成任务上的表现却更差;与此同时,在线算法在大规模生成任务上表现更好,但在成对分类上表现得更差。这揭示了在抽样过程中存在某种独特的交互作用,这种作用在很大程度上受到对样本的影响。最后,我们观察到,对于对比度和非对比度损失函数,性能差异仍然存在,并且似乎不能通过简单地增加策略网络规模来解决。结合研究,我们的研究阐明了在人工智能对齐中on-policy抽样的关键作用,以及离线对齐算法的某些基本挑战。
https://arxiv.org/abs/2405.08448
In the training process of Deep Reinforcement Learning (DRL), agents require repetitive interactions with the environment. With an increase in training volume and model complexity, it is still a challenging problem to enhance data utilization and explainability of DRL training. This paper addresses these challenges by focusing on the temporal correlations within the time dimension of time series. We propose a novel approach to segment multivariate time series into meaningful subsequences and represent the time series based on these subsequences. Furthermore, the subsequences are employed for causal inference to identify fundamental causal factors that significantly impact training outcomes. We design a module to provide feedback on the causality during DRL training. Several experiments demonstrate the feasibility of our approach in common environments, confirming its ability to enhance the effectiveness of DRL training and impart a certain level of explainability to the training process. Additionally, we extended our approach with priority experience replay algorithm, and experimental results demonstrate the continued effectiveness of our approach.
在深度强化学习(DRL)的训练过程中,代理需要与环境进行重复交互。随着训练量的增加和模型复杂性的提高,增强数据利用率和对DRL训练的 可解释性仍然是一个具有挑战性的问题。本文通过关注时间维度中的时间相关性来解决这些问题。我们提出了一种将多维时间序列分割为有意义的子序列的新方法,并基于这些子序列表示时间序列。此外,这些子序列用于进行因果推理,以确定对训练结果具有显著影响的基本因果因素。我们设计了一个模块来提供在DRL训练期间的因果性反馈。 several实验证明,我们的方法在常见环境中是行得通的,证实了其提高DRL训练效果和使其具有某种程度的可解释性的能力。此外,我们还通过优先经验回放算法扩展了我们的方法,实验结果表明,我们的方法仍然具有有效性。
https://arxiv.org/abs/2405.08380
Safe maneuvering capability is critical for mobile robots in complex environments. However, robotic system dynamics are often time-varying, uncertain, or even unknown during the motion planning and control process. Therefore, many existing model-based reinforcement learning (RL) methods could not achieve satisfactory reliability in guaranteeing safety. To address this challenge, we propose a two-level Vector Field-guided Learning Predictive Control (VF-LPC) approach that guarantees safe maneuverability. The first level, the guiding level, generates safe desired trajectories using the designed kinodynamic guiding vector field, enabling safe motion in obstacle-dense environments. The second level, the Integrated Motion Planning and Control (IMPC) level, first uses the deep Koopman operator to learn a nominal dynamics model offline and then updates the model uncertainties online using sparse Gaussian processes (GPs). The learned dynamics and game-based safe barrier function are then incorporated into the learning predictive control framework to generate near-optimal control sequences. We conducted tests to compare the performance of VF-LPC with existing advanced planning methods in an obstacle-dense environment. The simulation results show that it can generate feasible trajectories quickly. Then, VF-LPC is evaluated against motion planning methods that employ model predictive control (MPC) and RL in high-fidelity CarSim software. The results show that VF-LPC outperforms them under metrics of completion time, route length, and average solution time. We also carried out path-tracking control tests on a racing road to validate the model uncertainties learning capability. Finally, we conducted real-world experiments on a Hongqi E-HS3 vehicle, further validating the VF-LPC approach's effectiveness.
保证移动机器人在复杂环境中的安全机动能力至关重要。然而,机器人系统动力学通常在运动规划和控制过程中是时间变化、不确定或甚至是未知的。因此,许多基于模型的强化学习(RL)方法无法在保证安全方面达到令人满意的可靠性。为解决这个问题,我们提出了一个两级Vector Field-guided Learning Predictive Control(VF-LPC)方法,以确保安全机动。 第一级,指导层,使用设计的水动力引导向量场生成安全的愿望轨迹,使机器人在密集障碍物的环境中安全运动。第二级,集成运动规划与控制(IMPC)层,首先使用深度Koopman操作学习一个定理动态模型,然后在线使用稀疏高斯过程(GPs)更新模型不确定性。然后将学到的动态和基于游戏的safe barrier函数纳入学习预测控制框架,生成最优控制序列。我们对VF-LPC与现有高级规划方法在密集障碍物的环境中的性能进行了测试。 仿真结果表明,VF-LPC可以快速生成可行轨迹。然后,将VF-LPC与采用模型预测控制(MPC)和RL的高保真度CarSim软件的运动规划方法进行比较。结果表明,在完成时间、路径长度和平均解决方案时间等指标上,VF-LPC优越。我们还对赛车道路进行了路径跟踪控制测试,以验证模型不确定性学习能力的有效性。 最后,我们在一辆 Hongqi E-HS3 车上进行了实际实验,进一步验证了VF-LPC方法的有效性。
https://arxiv.org/abs/2405.08283
Although Federated Learning (FL) is promising in knowledge sharing for heterogeneous Artificial Intelligence of Thing (AIoT) devices, their training performance and energy efficacy are severely restricted in practical battery-driven scenarios due to the ``wooden barrel effect'' caused by the mismatch between homogeneous model paradigms and heterogeneous device capability. As a result, due to various kinds of differences among devices, it is hard for existing FL methods to conduct training effectively in energy-constrained scenarios, such as the battery constraints of devices. To tackle the above issues, we propose an energy-aware FL framework named DR-FL, which considers the energy constraints in both clients and heterogeneous deep learning models to enable energy-efficient FL. Unlike Vanilla FL, DR-FL adopts our proposed Muti-Agents Reinforcement Learning (MARL)-based dual-selection method, which allows participated devices to make contributions to the global model effectively and adaptively based on their computing capabilities and energy capacities in a MARL-based manner. Experiments on various well-known datasets show that DR-FL can not only maximise knowledge sharing among heterogeneous models under the energy constraint of large-scale AIoT systems but also improve the model performance of each involved heterogeneous device.
尽管联邦学习(FL)在知识共享方面对异构人工智能设备(AIoT)具有前景,但在实际电池驱动场景中,它们的训练效果和能效受到了由异构模型范式和异构设备能力之间的“木桶效应”引起的影响。因此,由于设备之间的各种差异,现有的FL方法很难在能源受限的场景中进行有效训练,例如设备的电池限制。为了解决上述问题,我们提出了一个能源感知FL框架,名为DR-FL,它考虑了客户端和异构深度学习模型的能源限制,以实现能源高效的FL。与Vanilla FL不同,DR-FL采用了我们提出的基于MARL的双选方法,允许参与设备根据其计算能力和能源能力以一种MARL方式有效且适当地为全局模型做出贡献。在各种知名数据集上的实验表明,DR-FL不仅可以提高大规模AIoT系统中的异构模型之间的知识共享,而且还可以提高涉及的所有异构设备的模型性能。
https://arxiv.org/abs/2405.08183
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to this https URL and this https URL for more detailed information.
在本技术报告中,我们介绍了基于人类反馈的在线迭代强化学习(RLHF)的工作流程,其在最近的大规模语言模型(LLM)文献中被广泛报道,远超其线下对应方案。然而,现有的开源RLHF项目仍然主要局限于离线学习环境。在本技术报告中,我们的目标是填补这一空白,并为在线迭代RLHF提供详细的可重复食谱。 首先,由于在线人类反馈通常对于资源有限的开源社区是不可行的,我们首先使用多样化的开源数据集构建偏好模型,并使用构建的代理偏好模型来近似人类反馈。然后,我们讨论了在线迭代RLHF的理论和算法原理,接着是详细的实践实现。 经过训练,我们的LLM模型SFR-Iterative-DPO-LLaMA-3-8B-R在LLM聊天机器人基准测试中取得了令人印象深刻的性能,包括AlpacaEval-2、Arena-Hard和MT-Bench,以及包括HumanEval和TruthfulQA在内的其他学术基准。我们证明了监督微调(SFT)和迭代RLHF可以通过完全开源数据集获得最先进的性能。此外,我们将模型、精心挑选的数据集以及全面的步骤指南公开发布。请参阅此[https://url和https://url以获取更多信息。
https://arxiv.org/abs/2405.07863
General Value Functions (GVFs) (Sutton et al, 2011) are an established way to represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique pseudo-reward. Multiple GVFs can be estimated in parallel using off-policy learning from a single stream of data, often sourced from a fixed behavior policy or pre-collected dataset. This leaves an open question: how can behavior policy be chosen for data-efficient GVF learning? To address this gap, we propose GVFExplorer, which aims at learning a behavior policy that efficiently gathers data for evaluating multiple GVFs in parallel. This behavior policy selects actions in proportion to the total variance in the return across all GVFs, reducing the number of environmental interactions. To enable accurate variance estimation, we use a recently proposed temporal-difference-style variance estimator. We prove that each behavior policy update reduces the mean squared error in the summed predictions over all GVFs. We empirically demonstrate our method's performance in both tabular representations and nonlinear function approximation.
一般价值函数(GVFs)(Sutton et al, 2011) 是用于在强化学习中表示预测知识的一种已建立的方法。每个GVF根据唯一的伪奖励计算给定策略的预期回报。可以使用来自单个数据流的多重GVF估计,该数据流通常源自于固定行为策略或预先收集的数据。这留下了一个开放性问题:如何选择数据有效的GVF学习中的行为策略?为了解决这个空白,我们提出了GVFExplorer,它旨在学习一个行为策略,以并行收集数据以评估多个GVF。该行为策略根据所有GVF的回报总方差的比例选择动作,减少环境交互的数量。为了实现准确方差估计,我们使用了一个最近提出的 temporal-difference-style 方差估计器。我们证明了每个行为策略更新都会减少所有GVF的加总预测的均方误差。我们通过实验展示了我们方法的性能,包括表格表示和非线性函数逼近。
https://arxiv.org/abs/2405.07838
Over the last few years, 360$\degree$ video traffic on the network has grown significantly. A key challenge of 360$\degree$ video playback is ensuring a high quality of experience (QoE) with limited network bandwidth. Currently, most studies focus on tile-based adaptive bitrate (ABR) streaming based on single viewport prediction to reduce bandwidth consumption. However, the performance of models for single-viewpoint prediction is severely limited by the inherent uncertainty in head movement, which can not cope with the sudden movement of users very well. This paper first presents a multimodal spatial-temporal attention transformer to generate multiple viewpoint trajectories with their probabilities given a historical trajectory. The proposed method models viewpoint prediction as a classification problem and uses attention mechanisms to capture the spatial and temporal characteristics of input video frames and viewpoint trajectories for multi-viewpoint prediction. After that, a multi-agent deep reinforcement learning (MADRL)-based ABR algorithm utilizing multi-viewpoint prediction for 360$\degree$ video streaming is proposed for maximizing different QoE objectives under various network conditions. We formulate the ABR problem as a decentralized partially observable Markov decision process (Dec-POMDP) problem and present a MAPPO algorithm based on centralized training and decentralized execution (CTDE) framework to solve the problem. The experimental results show that our proposed method improves the defined QoE metric by up to 85.5\% compared to existing ABR methods.
在过去的几年里,网络上的360度视频流量大幅增长。360度视频播放的一个关键挑战是确保在有限网络带宽下提供高质量(QoE)体验,尤其是在用户运动突然的情况下。目前,大多数研究都集中在基于单视图预测的块式自适应比特率(ABR)流媒体上,以降低带宽消耗。然而,单视图预测模型的性能受到头部运动固有不确定性的严重限制,无法很好地应对用户的突然运动。本文首先提出了一种多模态空间-时间注意力Transformer,用于根据历史轨迹生成多个视角轨迹的概率。所提出的方法将视角预测视为分类问题,并使用注意机制来捕捉输入视频帧和多视角预测视角轨迹的空间和时间特征。然后,我们提出了一种基于多视角预测的ABR算法,用于在各种网络条件下最大化不同的QoE目标。我们将ABR问题形式化为分布式部分观察的马尔可夫决策过程(Dec-POMDP)问题,并基于集中训练和分布式执行(CTDE)框架提出了一种MAPPO算法来解决该问题。实验结果表明,与现有ABR方法相比,我们所提出的方法提高了定义的QoE指标高达85.5%。
https://arxiv.org/abs/2405.07759
With rapid advances, generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. Yet, language models' inherent vulnerabilities may be exacerbated due to increased accessibility and unrestricted model training on massive textual data from the Internet. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. Backdoored LLMs behave innocuously for normal queries and generate harmful responses when the backdoor trigger is activated. Despite significant efforts paid to LLMs' safety issues, LLMs are still struggling against backdoor attacks. As Anthropic recently revealed, existing safety training strategies, including supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during the pre-training stage. In this paper, we present Simulate and Eliminate (SANDE) to erase the undesired backdoored mappings for generative LLMs. We initially propose Overwrite Supervised Fine-tuning (OSFT) for effective backdoor removal when the trigger is known. Then, to handle the scenarios where the trigger patterns are unknown, we integrate OSFT into our two-stage framework, SANDE. Unlike previous works that center on the identification of backdoors, our safety-enhanced LLMs are able to behave normally even when the exact triggers are activated. We conduct comprehensive experiments to show that our proposed SANDE is effective against backdoor attacks while bringing minimal harm to LLMs' powerful capability without any additional access to unbackdoored clean models. We will release the reproducible code.
随着迅速的发展,生成式大型语言模型(LLMs)在自然语言处理(NLP)任务的各个阶段都占优势,从理解到推理。然而,由于互联网大规模文本数据的增加和无限制的模型训练,LLMs固有的漏洞可能被加剧。一个恶意攻击者可以在网上发布带有恶意数据,并对预先训练在恶意数据上的LLM进行后门攻击。后门LLM在正常查询时表现无害,但当后门触发器激活时,它会生成有害的响应。尽管为LLM的安全问题付出了巨大的努力,但LLM仍然难以应对后门攻击。正如Anthropic最近揭示的那样,现有的安全性培训策略,包括监督微调(SFT)和从人类反馈中学习的强化学习(RLHF),在LLM被后门攻击时的预训练阶段都无法撤销后门。在本文中,我们提出了Simulate and Eliminate (SANDE)来消除对生成式LLM的意外后门映射。我们最初提出了在已知触发器的情况下进行覆盖监督微调(OSFT)来有效地去除后门。然后,为了处理不知道触发器模式的情况,我们将OSFT集成到我们的两阶段框架中,即SANDE。与之前关注于后门识别的工作相比,我们的安全增强型LLM能够在已知触发器时正常运行,即使确切的触发器被激活。我们进行了全面实验,证明我们提出的SANDE对于后门攻击是有效的,同时在不增加任何未备份干净模型的访问的情况下,最小化了对LLM强大能力的损害。我们将发布可重复代码。
https://arxiv.org/abs/2405.07667
Intent Management Function (IMF) is an integral part of future-generation networks. In recent years, there has been some work on AI-based IMFs that can handle conflicting intents and prioritize the global objective based on apriori definition of the utility function and accorded priorities for competing intents. Some of the earlier works use Multi-Agent Reinforcement Learning (MARL) techniques with AdHoc Teaming (AHT) approaches for efficient conflict handling in IMF. However, the success of such frameworks in real-life scenarios requires them to be flexible to business situations. The intent priorities can change and the utility function, which measures the extent of intent fulfilment, may also vary in definition. This paper proposes a novel mechanism whereby the IMF can generalize to different forms of utility functions and change of intent priorities at run-time without additional training. Such generalization ability, without additional training requirements, would help to deploy IMF in live networks where customer intents and priorities change frequently. Results on the network emulator demonstrate the efficacy of the approach, scalability for new intents, outperforming existing techniques that require additional training to achieve the same degree of flexibility thereby saving cost, and increasing efficiency and adaptability.
意图管理功能(IMF)是未来网络的重要组成部分。近年来,在基于AI的意图管理框架方面已经进行了一些工作,这些框架可以处理冲突意图并基于元启发式定义的效用函数和竞争意图的优先级进行全局优化。一些早期的作品使用具有自组织团队(AHT)方法的多智能体强化学习(MARL)技术来处理IMF中的冲突。然而,这些框架在现实场景中的成功需要它们具有对业务情况的灵活性。意图优先级可以发生变化,衡量意图满足程度的效用函数也可能有不同的定义。本文提出了一种新机制,使得IMF可以在运行时扩展到不同形式的效用函数和意图优先级的改变,而无需进行额外的训练。这种通用能力,无需额外训练需求,将使IMF能够应用于经常变化的客户意图和优先级的网络中。在网络仿真结果表明,该方法的有效性,可扩展性,超越了需要额外训练以实现相同灵活度的现有技术,从而节省成本,提高效率和适应性。
https://arxiv.org/abs/2405.07621
Care-giving and assistive robotics, driven by advancements in AI, offer promising solutions to meet the growing demand for care, particularly in the context of increasing numbers of individuals requiring assistance. This creates a pressing need for efficient and safe assistive devices, particularly in light of heightened demand due to war-related injuries. While cost has been a barrier to accessibility, technological progress is able to democratize these solutions. Safety remains a paramount concern, especially given the intricate interactions between assistive robots and humans. This study explores the application of reinforcement learning (RL) and imitation learning, in improving policy design for assistive robots. The proposed approach makes the risky policies safer without additional environmental interactions. Through experimentation using simulated environments, the enhancement of the conventional RL approaches in tasks related to assistive robotics is demonstrated.
照顾和辅助机器人技术,以人工智能的进步为基础,为满足不断增长的健康需求提供了有前景的解决方案,特别是在需要帮助的个体数量增加的情况下。这导致了对高效且安全的辅助设备的需求不断增加,尤其是在战争相关的伤害加剧的情况下。尽管成本是一个障碍,但技术进步能够实现这些解决方案的民主化。考虑到辅助机器人和人类之间的复杂交互,安全性始终是一个首要问题。本研究探讨了强化学习(RL)和模仿学习在改善辅助机器人政策设计中的应用。所提出的方法通过模拟环境实验证明了在辅助机器人任务方面传统RL方法的增强。
https://arxiv.org/abs/2405.07603
Value function factorization methods are commonly used in cooperative multi-agent reinforcement learning, with QMIX receiving significant attention. Many QMIX-based methods introduce monotonicity constraints between the joint action value and individual action values to achieve decentralized execution. However, such constraints limit the representation capacity of value factorization, restricting the joint action values it can represent and hindering the learning of the optimal policy. To address this challenge, we propose the Potentially Optimal joint actions Weighted QMIX (POWQMIX) algorithm, which recognizes the potentially optimal joint actions and assigns higher weights to the corresponding losses of these joint actions during training. We theoretically prove that with such a weighted training approach the optimal policy is guaranteed to be recovered. Experiments in matrix games, predator-prey, and StarCraft II Multi-Agent Challenge environments demonstrate that our algorithm outperforms the state-of-the-art value-based multi-agent reinforcement learning methods.
价值函数分解方法在合作多智能体强化学习中应用广泛,特别是QMIX受到了广泛关注。许多基于QMIX的方法在联合动作价值和个体动作值之间引入了单调性约束,以实现分布式执行。然而,这些约束限制了价值函数分解的表示能力,限制了它可以代表的联合动作值的数量,并阻碍了最优策略的学习。为了应对这个挑战,我们提出了可能的最优联合动作加权QMIX(POWQMIX)算法,它认识到了可能最优的联合动作,并在训练过程中为这些联合动作分配更高的权重。我们通过理论证明,这样的加权训练方法可以保证在给定条件下,最优策略可以被恢复。在矩阵游戏、捕食者-被捕食者环境和星际争霸II多智能体挑战环境中进行的实验证明,我们的算法超越了当前基于价值的 multi-agent 强化学习方法。
https://arxiv.org/abs/2405.08036
Generating robot demonstrations through simulation is widely recognized as an effective way to scale up robot data. Previous work often trained reinforcement learning agents to generate expert policies, but this approach lacks sample efficiency. Recently, a line of work has attempted to generate robot demonstrations via differentiable simulation, which is promising but heavily relies on reward design, a labor-intensive process. In this paper, we propose DiffGen, a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model to enable automatic and efficient generation of robot demonstrations. Given a simulated robot manipulation scenario and a natural language instruction, DiffGen can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation after manipulation. The embeddings are obtained from the vision-language model, and the optimization is achieved by calculating and descending gradients through the differentiable simulation, differentiable rendering, and vision-language model components, thereby accomplishing the specified task. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time.
通过模拟生成机器人演示是一种广泛认可的有效方法来扩展机器人数据。之前的工作通常通过训练强化学习智能体生成专家策略来实现,但这种方法缺乏样本效率。最近,一系列工作试图通过可导模拟生成机器人演示,这是一种有前途的方法,但严重依赖奖励设计,是一个劳动密集型过程。在本文中,我们提出了DiffGen,一种新颖的方法,将可导物理仿真、可导渲染和视觉语言模型集成在一起,以实现自动和高效生成机器人演示。给定一个模拟机器人操作场景和自然语言指令,DiffGen可以通过最小化操纵后语言指令的嵌入和模拟观察的嵌入之间的距离来生成真实的机器人演示。嵌入是从视觉-语言模型获得的,通过可导模拟、可导渲染和视觉-语言模型组件的计算和下降梯度来优化,从而完成指定任务。实验证明,使用DiffGen,我们可以在最小的人力努力或训练时间下,高效和有效地生成机器人数据。
https://arxiv.org/abs/2405.07309
In Reinforcement Learning (RL), training a policy from scratch with online experiences can be inefficient because of the difficulties in exploration. Recently, offline RL provides a promising solution by giving an initialized offline policy, which can be refined through online interactions. However, existing approaches primarily perform offline and online learning in the same task, without considering the task generalization problem in offline-to-online adaptation. In real-world applications, it is common that we only have an offline dataset from a specific task while aiming for fast online-adaptation for several tasks. To address this problem, our work builds upon the investigation of successor representations for task generalization in online RL and extends the framework to incorporate offline-to-online learning. We demonstrate that the conventional paradigm using successor features cannot effectively utilize offline data and improve the performance for the new task by online fine-tuning. To mitigate this, we introduce a novel methodology that leverages offline data to acquire an ensemble of successor representations and subsequently constructs ensemble Q functions. This approach enables robust representation learning from datasets with different coverage and facilitates fast adaption of Q functions towards new tasks during the online fine-tuning phase. Extensive empirical evaluations provide compelling evidence showcasing the superior performance of our method in generalizing to diverse or even unseen tasks.
在强化学习(RL)中,从零开始通过在线经验训练策略可能效率低下,因为探索的困难。最近,离线RL通过给出一个初始化的离线策略,并通过在线交互进行微调,提供了一个有前途的解决方案。然而,现有的方法主要在相同任务上进行离线和在线学习,而没有考虑在离线到在线适应过程中任务泛化问题。在现实世界的应用中,我们通常只有来自特定任务的离线数据,而希望为多个任务实现快速的在线适应。为了应对这个问题,我们的工作基于在线RL中任务泛化后继表示的调查,并扩展了框架以包括离线到在线学习。我们证明了使用后继特征的传统范式无法有效利用离线数据,并通过在线微调来提高新任务的性能。为了减轻这个问题,我们引入了一种新的方法,利用离线数据获得后继表示的集合,然后构建了 ensemble Q 函数。这种方法使得可以从具有不同覆盖率的数据中进行稳健的 Q 函数学习,并在在线微调阶段加速对新任务的适应。大量实证评估提供了令人信服的证据,表明我们的方法在泛化到多样或未见任务方面具有优越性能。
https://arxiv.org/abs/2405.07223
We explore the use of deep reinforcement learning to audit an automatic short answer grading (ASAG) model. Automatic grading may decrease the time burden of rating open-ended items for educators, but a lack of robust evaluation methods for these models can result in uncertainty of their quality. Current state-of-the-art ASAG models are configured to match human ratings from a training set, and researchers typically assess their quality with accuracy metrics that signify agreement between model and human scores. In this paper, we show that a high level of agreement to human ratings does not give sufficient evidence that an ASAG model is infallible. We train a reinforcement learning agent to revise student responses with the objective of achieving a high rating from an automatic grading model in the least number of revisions. By analyzing the agent's revised responses that achieve a high grade from the ASAG model but would not be considered a high scoring responses according to a scoring rubric, we discover ways in which the automated grader can be exploited, exposing shortcomings in the grading model.
我们探讨了使用深度强化学习来审计自动短答案评分(ASAG)模型的应用。自动评分可能减轻教育者评分开放性项目的负担,但缺乏这些模型的稳健评估方法可能会导致对它们质量的不确定性。目前最先进的ASAG模型配置为与训练集中的人类评分相匹配,研究人员通常使用表示模型和人类评分之间一致性的准确度度量来评估它们的质量。在本文中,我们证明了高度的人类评分一致性不足以证明ASAG模型是不可摧毁的。我们训练了一个强化学习代理,使其以实现自动评分模型在最少修订次数的情况下获得高评分。通过分析达到ASAG模型高评分但根据评分标准不被认为是高评分响应的代理的修改后的响应,我们发现了自动评分者可以被利用的方式,揭示了评分模型的不足。
https://arxiv.org/abs/2405.07087
Semi-supervised anomaly detection for guaranteeing the reliability of intelligent systems has received increasing attention. However, existing methods rely too much on data correlation and neglect causality, which can be misleading due to confounding factors and affect system reliability. Additionally, the current reinforcement learning anomaly detection methods can effectively identify known and unknown anomalies in environments with limited labeled samples. Despite its effectiveness, these methods still face several challenges, such as under-utilization of priori knowledge, lack of model flexibility, and insufficient reward feedback when interacting with the environment. To address the above problems, this paper innovatively constructs a counterfactual causal reinforcement learning model, termed Triple-Assisted Causal Reinforcement Learning Anomaly Detector (Tri-CRLAD). The model utilizes the causal inference mechanism to radically improve the performance of semi-supervised models and enhance the model's ability to uncover anomaly data in the face of unknown or rare data. In addition, Tri-CRLAD features a triple decision support mechanism, namely, a sampling strategy based on historical similarity, an adaptive threshold smoothing adjustment strategy, and an adaptive decision reward mechanism. These mechanisms further enhance the flexibility and generalization ability of the model, enabling it to effectively respond to various complex and dynamically changing environments. Finally, Tri-CRLAD matches or exceeds the performance of 9 baseline methods across 7 diverse intelligent system datasets, including satellite systems, medical systems, and health systems. Moreover, anomaly detection stability was significantly improved by up to 23\% with an extremely small number of known anomaly samples. Our code is available at this https URL
半监督异常检测保证智能系统的可靠性已经引起越来越多的关注。然而,现有的方法过于依赖数据相关性,并忽视了因果关系,这可能会因混淆因素而误导,并影响系统的可靠性。此外,当前的强化学习异常检测方法可以有效地在具有有限标注样本的环境中识别已知和未知异常。尽管这些方法的有效性得到了提高,但它们仍然面临几个挑战,例如先验知识的利用率低,缺乏模型灵活性,以及在与环境交互时缺乏奖励反馈。为解决这些问题,本文创新地构建了一种名为Tri-Assisted Causal Reinforcement Learning Anomaly Detector(Tri-CRLAD)的反事实因果强化学习模型。该模型利用因果推理机制大幅提高半监督模型的性能,并增强模型在未知或稀有数据面前发现异常数据的能力。此外,Tri-CRLAD还具有三重决策支持机制,包括基于历史相似的采样策略、自适应阈值平滑调整策略和自适应决策奖励机制。这些机制进一步增强了模型的灵活性和泛化能力,使模型能够有效应对各种复杂和动态变化的场景。最后,Tri-CRLAD在包括卫星系统、医疗系统和健康系统在内的7个不同智能系统数据集上的性能与9个基线方法相匹敌或超过。此外,通过极其少量的已知异常样本,异常检测的稳定性显著提高了23%。我们的代码可在此处访问:https://www.xxxxxx.com
https://arxiv.org/abs/2405.06925