This paper explores the potential of large language models (LLMs) to make the Aeronautical Regulations of Colombia (RAC) more accessible. Given the complexity and extensive technicality of the RAC, this study introduces a novel approach to simplifying these regulations for broader understanding. By developing the first-ever RAC database, which contains 24,478 expertly labeled question-and-answer pairs, and fine-tuning LLMs specifically for RAC applications, the paper outlines the methodology for dataset assembly, expert-led annotation, and model training. Utilizing the Gemma1.1 2b model along with advanced techniques like Unsloth for efficient VRAM usage and flash attention mechanisms, the research aims to expedite training processes. This initiative establishes a foundation to enhance the comprehensibility and accessibility of RAC, potentially benefiting novices and reducing dependence on expert consultations for navigating the aviation industry's regulatory landscape. You can visit the dataset (this https URL) and the model (this https URL) here.
本文探讨了大型语言模型(LLMs)在使哥斯达黎加航空法规(RAC)更具可读性的潜力。鉴于RAC的复杂性和广泛的技术性,这项研究采用了一种新的方法来简化这些法规以扩大理解。通过开发有史以来第一个RAC数据库,其中包含24,478个专家标注的问答对,并专门对RAC应用程序进行微调,本文概述了数据集组装、专家引导标注和模型训练的方法。利用Gemma1.1 2b模型以及先进的技术如Unsloth for efficient VRAM usage和flash attention mechanisms,该研究旨在加速训练过程。这项倡议为提高RAC的可读性和可访问性奠定了基础,可能有助于新手,并减少在探索航空行业监管格局时依赖专家咨询。你可以访问数据集(此<https://dataset.academia.edu/dgarcia19/1/1>)和模型(此<https://github.com/yourg Gemma1.1 2b model>)在这里。
https://arxiv.org/abs/2405.08792
This paper introduces a novel application of Kolmogorov-Arnold Networks (KANs) to time series forecasting, leveraging their adaptive activation functions for enhanced predictive modeling. Inspired by the Kolmogorov-Arnold representation theorem, KANs replace traditional linear weights with spline-parametrized univariate functions, allowing them to learn activation patterns dynamically. We demonstrate that KANs outperforms conventional Multi-Layer Perceptrons (MLPs) in a real-world satellite traffic forecasting task, providing more accurate results with considerably fewer number of learnable parameters. We also provide an ablation study of KAN-specific parameters impact on performance. The proposed approach opens new avenues for adaptive forecasting models, emphasizing the potential of KANs as a powerful tool in predictive analytics.
本文介绍了一种将Kolmogorov-Arnold网络(KANs)应用于时间序列预测的新颖方法,利用其自适应激活函数增强预测建模。受到Kolmogorov-Arnold表示定理的启发,KANs用平滑参数化的一维函数替代传统的线性权重,允许它们动态学习激活模式。我们证明了KANs在真实世界卫星交通预测任务中优于传统的多层感知器(MLPs),用更少的可学习参数获得更准确的结果。我们还提供了KAN特定参数对性能影响的消融研究。所提出的方法为自适应预测模型开辟了新途径,强调了KANs作为预测分析有力工具的潜力。
https://arxiv.org/abs/2405.08790
Deep learning has enabled breakthroughs in automated diagnosis from medical imaging, with many successful applications in ophthalmology. However, standard medical image classification approaches only assess disease presence at the time of acquisition, neglecting the common clinical setting of longitudinal imaging. For slow, progressive eye diseases like age-related macular degeneration (AMD) and primary open-angle glaucoma (POAG), patients undergo repeated imaging over time to track disease progression and forecasting the future risk of developing disease is critical to properly plan treatment. Our proposed Longitudinal Transformer for Survival Analysis (LTSA) enables dynamic disease prognosis from longitudinal medical imaging, modeling the time to disease from sequences of fundus photography images captured over long, irregular time periods. Using longitudinal imaging data from the Age-Related Eye Disease Study (AREDS) and Ocular Hypertension Treatment Study (OHTS), LTSA significantly outperformed a single-image baseline in 19/20 head-to-head comparisons on late AMD prognosis and 18/20 comparisons on POAG prognosis. A temporal attention analysis also suggested that, while the most recent image is typically the most influential, prior imaging still provides additional prognostic value.
深度学习在医学影像自动诊断方面取得了突破性进展,在眼科领域有很多成功应用。然而,标准的医学图像分类方法仅在获取时评估疾病的存在,而忽略了常见的临床扫描设置——长期影像扫描。对于像年龄相关性黄斑变性(AMD)和原发性开角型眼压升高(POAG)这样的进展缓慢、进行性的眼病,患者需要重复进行影像检查以跟踪疾病进展,并预测未来患病的风险,以便正确规划治疗。我们提出的纵向Transformer for Survival Analysis(LTSA)可以从长期医学影像中动态预测疾病预后,建模长时间 irregular 时间间隔内捕获的序列帧图像中的疾病从眼轴摄影图中的时间。使用年龄相关性眼病研究(AREDS)和眼压升高治疗研究(OHTS)中的纵向影像数据,LTSA在19/20 头对头比较中显著超过了单张图像基线在晚期 AMD 预后方面的表现,而在18/20 比较中超过了原发性开角型眼压升高预后的表现。时间注意分析还表明,虽然最最新的图像通常是最有影响力的,但之前的图像仍然提供了额外的预后价值。
https://arxiv.org/abs/2405.08780
The superior performance of modern visual backbones usually comes with a costly training procedure. We contribute to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these aspects and design curriculum schedules with tailored search algorithms. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. It reduces the training time of a wide variety of popular models by 1.5-3.0x on ImageNet-1K/22K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).
现代视觉骨干的优越性能通常需要付出高昂的训练代价。我们通过将课程学习的概念扩展到其原始定义之外,即使用更容易-更难的数据训练模型,为这个问题做出了贡献。具体来说,我们将训练课程重新定义为一个软选择函数,该函数在训练过程中逐个例子中揭示出越来越困难的模式,而不是执行更容易-更难的样本选择。我们的工作受到视觉骨干学习动态中一个引人入胜的观察的启发:在训练的前几个阶段,模型主要学习识别数据中的“更容易-学习”的判别模式。这些模式通过频域和空间域观察时,包括较低频率的成分,以及不会扭曲或数据增强的自然图像内容。为了实现这些发现,我们提出了一个课程,其中模型在每次学习阶段始终利用所有训练数据,然而,对于每个实例,首先启动更容易-学习模式的暴露,随着训练的进行,逐渐引入更难的模式。为了在计算上实现这一想法,我们在输入的傅里叶频谱上引入裁剪操作,使模型仅从低频成分中学习。然后,我们证明了通过调整数据增强的强度,可以轻松地获得自然图像内容的暴露。最后,我们将这些方面综合起来,并设计了具有自定义搜索算法的课程计划。所得到的方法,EfficientTrain++,简单、通用,然而却非常有效。它将各种流行模型的训练时间缩短了1.5-3.0倍,而不会牺牲准确性。它还证明了在自监督学习方面的效果(例如MAE)。
https://arxiv.org/abs/2405.08768
Humans often express their communicative intents indirectly or non-literally, which requires their interlocutors -- human or AI -- to understand beyond the literal meaning of words. While most existing work has focused on discriminative evaluations, we present a new approach to generatively evaluate large language models' (LLMs') intention understanding by examining their responses to non-literal utterances. Ideally, an LLM should respond in line with the true intention of a non-literal utterance, not its literal interpretation. Our findings show that LLMs struggle to generate pragmatically relevant responses to non-literal language, achieving only 50-55% accuracy on average. While explicitly providing oracle intentions significantly improves performance (e.g., 75% for Mistral-Instruct), this still indicates challenges in leveraging given intentions to produce appropriate responses. Using chain-of-thought to make models spell out intentions yields much smaller gains (60% for Mistral-Instruct). These findings suggest that LLMs are not yet effective pragmatic interlocutors, highlighting the need for better approaches for modeling intentions and utilizing them for pragmatic generation.
人类通常通过间接或非文字的方式来表达他们的交流意图,这需要他们的交流者(人类或AI)理解和超越文字的含义。虽然现有的工作主要集中在区分性评估上,但我们提出了一个新的方法来通过研究他们对非文字陈述的反应来评估大型语言模型(LLMs)的意图理解。理想情况下,LLM应该对非文字陈述的真实意图作出回应,而不是其字面理解。我们的研究结果表明,LLMs在生成与实际意图相关的 pragmatic 性回应方面存在困难,平均准确率只有50-55%。即使明确提供预言意图,性能也会显著提高(例如,Mistral-Instruct的平均准确率高达75%),但这仍然表明了在利用给定的意图产生适当回应方面存在挑战。使用连锁思维来建模意图仅能带来较小的收益(Mistral-Instruct的平均准确率只有60%)。这些发现表明,LLMs 还不是一个有效的实践性交流者,需要更好的方法来建模意图并利用它们产生 pragmatic 性回应。
https://arxiv.org/abs/2405.08760
With the proliferation of edge devices, there is a significant increase in attack surface on these devices. The decentralized deployment of threat intelligence on edge devices, coupled with adaptive machine learning techniques such as the in-context learning feature of large language models (LLMs), represents a promising paradigm for enhancing cybersecurity on low-powered edge devices. This approach involves the deployment of lightweight machine learning models directly onto edge devices to analyze local data streams, such as network traffic and system logs, in real-time. Additionally, distributing computational tasks to an edge server reduces latency and improves responsiveness while also enhancing privacy by processing sensitive data locally. LLM servers can enable these edge servers to autonomously adapt to evolving threats and attack patterns, continuously updating their models to improve detection accuracy and reduce false positives. Furthermore, collaborative learning mechanisms facilitate peer-to-peer secure and trustworthy knowledge sharing among edge devices, enhancing the collective intelligence of the network and enabling dynamic threat mitigation measures such as device quarantine in response to detected anomalies. The scalability and flexibility of this approach make it well-suited for diverse and evolving network environments, as edge devices only send suspicious information such as network traffic and system log changes, offering a resilient and efficient solution to combat emerging cyber threats at the network edge. Thus, our proposed framework can improve edge computing security by providing better security in cyber threat detection and mitigation by isolating the edge devices from the network.
随着边缘设备的普及,这些设备上的攻击面显著增加。在边缘设备上分布式威胁情报的集中部署,与大型语言模型(LLMs)的上下文学习特征等自适应机器学习技术的结合,代表了一种增强网络安全低功耗边缘设备的有前途的范式。这种方法涉及在边缘设备上直接部署轻量级机器学习模型以实时分析本地数据流,如网络流量和系统日志。此外,将计算任务分配给边缘服务器可以降低延迟并提高响应速度,同时通过在本地处理敏感数据而增强隐私。LLM服务器可以使得这些边缘服务器能够自主适应不断变化的威胁和攻击模式,持续更新模型以提高检测准确性和减少误报。此外,合作学习机制使边缘设备之间实现安全且可信的相互知识共享,增强网络集体智慧,并能够实现针对检测到的异常情况的动态威胁缓解措施,如设备隔离。这种方法的可扩展性和灵活性使其非常适合各种不断变化的网络环境,因为边缘设备仅发送网络流量和系统日志变化等可疑信息,为解决网络边缘 emerging cyber threats 提供了一个弹性和高效的解决方案。因此,我们提出的框架可以通过在网络边缘隔离边缘设备来提高边缘计算安全性,从而通过隔离边缘设备从网络来提高网络威胁检测和缓解的 security。
https://arxiv.org/abs/2405.08755
Addressing the challenge of low-resource information extraction remains an ongoing issue due to the inherent information scarcity within limited training examples. Existing data augmentation methods, considered potential solutions, struggle to strike a balance between weak augmentation (e.g., synonym augmentation) and drastic augmentation (e.g., conditional generation without proper guidance). This paper introduces a novel paradigm that employs targeted augmentation and back validation to produce augmented examples with enhanced diversity, polarity, accuracy, and coherence. Extensive experimental results demonstrate the effectiveness of the proposed paradigm. Furthermore, identified limitations are discussed, shedding light on areas for future improvement.
由于有限训练样本内的固有信息稀缺性,解决低资源信息抽取的挑战仍然是一个持续进行的问题,这也使得现有的数据增强方法,被认为是潜在的解决方案,在弱增强(例如同义词增强)和剧增增强(例如缺乏指导时的条件生成)之间难以取得平衡。本文介绍了一种新的范式,它采用有针对性增强和反向验证来产生具有增强多样性、极化、准确性和连贯性的增强示例。大量的实验结果证明了所提出的范式的有效性。此外,还讨论了已识别的局限性,并阐明了未来改进的领域。
https://arxiv.org/abs/2405.08729
This paper addresses the critical need for refining robot motions that, despite achieving a high visual similarity through human-to-humanoid retargeting methods, fall short of practical execution in the physical realm. Existing techniques in the graphics community often prioritize visual fidelity over physics-based feasibility, posing a significant challenge for deploying bipedal systems in practical applications. Our research introduces a constrained reinforcement learning algorithm to produce physics-based high-quality motion imitation onto legged humanoid robots that enhance motion resemblance while successfully following the reference human trajectory. We name our framework: I-CTRL. By reformulating the motion imitation problem as a constrained refinement over non-physics-based retargeted motions, our framework excels in motion imitation with simple and unique rewards that generalize across four robots. Moreover, our framework can follow large-scale motion datasets with a unique RL agent. The proposed approach signifies a crucial step forward in advancing the control of bipedal robots, emphasizing the importance of aligning visual and physical realism for successful motion imitation.
本文解决了在机器人运动中需要精炼的问题,尽管通过人类-机器人对齐方法实现了高视觉相似性,但在物理世界中却缺乏实际执行。图形社区中现有的技术通常优先考虑视觉一致性而非基于物理的可行性,这给在实际应用中部署双足机器人带来了巨大的挑战。我们的研究引入了一个约束的强化学习算法,用于在下肢式机器人上产生基于物理的高质量运动模仿,同时成功跟踪参考人类轨迹。我们将框架命名为I-CTRL。通过将运动复制问题重新表述为基于非物理对齐运动的约束优化,我们的框架在具有简单和独特奖励的简单和独特的基础上表现出色,并且可以适用于四台机器人。此外,我们的框架可以跟随大规模运动数据集,并使用独特的RL代理。所提出的方法标志着在进步控制双足机器人方面迈出了关键的一步,强调了在成功运动复制中实现视觉和物理现实之间的一致性至关重要。
https://arxiv.org/abs/2405.08726
Numerous studies have revealed that deep learning-based medical image classification models may exhibit bias towards specific demographic attributes, such as race, gender, and age. Existing bias mitigation methods often achieve high level of fairness at the cost of significant accuracy degradation. In response to this challenge, we propose an innovative and adaptable Soft Nearest Neighbor Loss-based channel pruning framework, which achieves fairness through channel pruning. Traditionally, channel pruning is utilized to accelerate neural network inference. However, our work demonstrates that pruning can also be a potent tool for achieving fairness. Our key insight is that different channels in a layer contribute differently to the accuracy of different groups. By selectively pruning critical channels that lead to the accuracy difference between the privileged and unprivileged groups, we can effectively improve fairness without sacrificing accuracy significantly. Experiments conducted on two skin lesion diagnosis datasets across multiple sensitive attributes validate the effectiveness of our method in achieving state-of-the-art trade-off between accuracy and fairness. Our code is available at this https URL.
大量研究表明,基于深度学习的医疗图像分类模型可能存在对特定人口属性的偏见,如种族、性别和年龄等。现有的偏见缓解方法通常可以在公正性方面达到很高的水平,但会降低准确性。为了应对这一挑战,我们提出了一个创新且可适应的软最近邻损失基于通道修剪框架,通过通道修剪实现公正。 传统上,通道修剪用于加速神经网络的推理。然而,我们的工作表明,修剪也可以成为实现公正的有力工具。我们的关键洞见是,同一层中不同通道对不同群体的准确性有不同的贡献。通过选择性地修剪导致 privileged 和 unprivileged 群体之间准确率差异的关键通道,我们可以在不牺牲准确率的情况下有效提高公正性。在两个皮肤病变诊断数据集上的实验验证了我们在实现准确性和公正性之间的最优平衡。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2405.08681
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
本文解决了自监督通用音频表示学习的问题。我们探讨了使用联合嵌入预测架构(JEPA)解决这个任务的途径,它包括将输入的Mel声谱图拆分为两个部分(上下文和目标),计算每个部分的神经表示,并训练神经网络从上下文表示预测目标表示。我们在这种框架内研究了几个设计选择,并通过广泛的实验研究了它们的影响,评估了我们的模型在各种音频分类基准上的表现,包括环境声音、语音和音乐下游任务。我们特别关注输入数据中哪个部分被用作上下文或目标,并通过实验证明了它对模型性能的影响。值得注意的是,在图像领域,一些有效的设计选择导致了在音频方面的表现不佳,从而突出了这两种媒体之间的主要区别。
https://arxiv.org/abs/2405.08679
Multi-objective Bayesian optimization (MOBO) has shown promising performance on various expensive multi-objective optimization problems (EMOPs). However, effectively modeling complex distributions of the Pareto optimal solutions is difficult with limited function evaluations. Existing Pareto set learning algorithms may exhibit considerable instability in such expensive scenarios, leading to significant deviations between the obtained solution set and the Pareto set (PS). In this paper, we propose a novel Composite Diffusion Model based Pareto Set Learning algorithm, namely CDM-PSL, for expensive MOBO. CDM-PSL includes both unconditional and conditional diffusion model for generating high-quality samples. Besides, we introduce an information entropy based weighting method to balance different objectives of EMOPs. This method is integrated with the guiding strategy, ensuring that all the objectives are appropriately balanced and given due consideration during the optimization process; Extensive experimental results on both synthetic benchmarks and real-world problems demonstrates that our proposed algorithm attains superior performance compared with various state-of-the-art MOBO algorithms.
多目标巴宜轩优化(MOBO)在各种昂贵的多目标优化问题(EMOPs)上表现出良好的性能。然而,在有限的功能评估下,有效地建模Pareto最优解的复杂分布是困难的。现有的Pareto集合学习算法在昂贵的场景中可能表现出相当的不稳定性,导致获得的解决方案集与Pareto集合(PS)之间的差异显著。在本文中,我们提出了一种基于Pareto集合学习的新颖组合扩散模型,即CDM-PSL,用于昂贵的MOBO。CDM-PSL包括生成高质量样本的条件的扩散模型和基于信息熵的权重方法。此外,我们还引入了EMOPs中不同目标之间平衡信息熵的方法。这种方法与指导策略相结合,确保在优化过程中适当平衡所有目标,并给予充分的考虑;在 synthetic 基准和现实世界问题上的广泛实验结果表明,与各种最先进的MOBO算法相比,我们提出的算法具有卓越的性能。
https://arxiv.org/abs/2405.08674
Large-scale Vision-Language Models (VLMs) have demonstrated exceptional performance in natural vision tasks, motivating researchers across domains to explore domain-specific VLMs. However, the construction of powerful domain-specific VLMs demands vast amounts of annotated data, substantial electrical energy, and computing resources, primarily accessible to industry, yet hindering VLM research in academia. To address this challenge and foster sustainable and equitable VLM research, we present the Generalized Domain Prompt Learning (GDPL) framework. GDPL facilitates the transfer of VLMs' robust recognition capabilities from natural vision to specialized domains, without the need for extensive data or resources. By leveraging small-scale domain-specific foundation models and minimal prompt samples, GDPL empowers the language branch with domain knowledge through quaternion networks, uncovering cross-modal relationships between domain-specific vision features and natural vision-based contextual embeddings. Simultaneously, GDPL guides the vision branch into specific domains through hierarchical propagation of generated vision prompt features, grounded in well-matched vision-language relations. Furthermore, to fully harness the domain adaptation potential of VLMs, we introduce a novel low-rank adaptation approach. Extensive experiments across diverse domains like remote sensing, medical imaging, geology, Synthetic Aperture Radar, and fluid dynamics, validate the efficacy of GDPL, demonstrating its ability to achieve state-of-the-art domain recognition performance in a prompt learning paradigm. Our framework paves the way for sustainable and inclusive VLM research, transcending the barriers between academia and industry.
大规模视觉语言模型(VLMs)在自然视觉任务中的卓越表现,激发了跨学科领域的研究人员探索领域特定的VLMs。然而,构建强大的领域特定VLMs需要大量注释数据、大量的电力和计算资源,主要面向工业界,这阻碍了学术界VLM研究的发展。为了应对这一挑战,促进可持续和公正的VLM研究,我们提出了泛化领域提示学习(GDPL)框架。GDPL通过将VLMs的稳健识别能力从自然视觉传递到专用领域,无需大量数据或资源来实现。通过利用小规模领域特定基础模型和最小提示样本,GDPL通过四元网络赋予语言分支领域知识,揭示领域特定视觉特征与自然视觉 based 的上下文嵌入之间的跨模态关系。同时,GDPL通过分层传播生成的视觉提示特征引导视觉分支进入特定领域,基于匹配的视觉语言关系。此外,为了充分利用VLMs的领域适应潜力,我们引入了一种新的低秩适应方法。在遥感、医学成像、地质、合成孔径雷达和流体动力学等多样领域进行的大量实验证实了GDPL的有效性,表明其在提示学习范式下实现最先进的领域识别性能。我们的框架为可持续和包容的VLM研究铺平道路,超越了学术界和工业界之间的障碍。
https://arxiv.org/abs/2405.08668
The increasing complexity of Artificial Intelligence models poses challenges to interpretability, particularly in the healthcare sector. This study investigates the impact of deep learning model complexity and Explainable AI (XAI) efficacy, utilizing four ResNet architectures (ResNet-18, 34, 50, 101). Through methodical experimentation on 4,369 lung X-ray images of COVID-19-infected and healthy patients, the research evaluates models' classification performance and the relevance of corresponding XAI explanations with respect to the ground-truth disease masks. Results indicate that the increase in model complexity is associated with a decrease in classification accuracy and AUC-ROC scores (ResNet-18: 98.4%, 0.997; ResNet-101: 95.9%, 0.988). Notably, in eleven out of twelve statistical tests performed, no statistically significant differences occurred between XAI quantitative metrics - Relevance Rank Accuracy and the proposed Positive Attribution Ratio - across trained models. These results suggest that increased model complexity does not consistently lead to higher performance or relevance of explanations for models' decision-making processes.
人工智能模型的复杂度增加对可解释性提出了挑战,特别是在医疗领域。这项研究调查了深度学习模型的复杂度和可解释AI(XAI)的有效性,利用了四个ResNet架构(ResNet-18、34、50和101)。通过在COVID-19感染者和健康患者的大规模肺X光片上进行实验,研究评估了模型的分类表现以及相应XAI解释与真实疾病口罩的关系。研究结果表明,模型复杂度的增加与分类准确性和AUC-ROC分数(ResNet-18:98.4%,0.997;ResNet-101:95.9%,0.988)的降低有关。值得注意的是,在12个统计测试中,XAI定量指标——相关性排名准确性和所提出的积极归因比——在训练模型上没有显著的差异。这些结果表明,增加模型复杂度并不一定导致模型性能或解释性的提高。
https://arxiv.org/abs/2405.08658
Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world traffic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training efficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.
自动驾驶交叉管理(AIM)由于现实交通场景复杂性和需要一个昂贵的集中式服务器同时控制所有车辆而带来了显著的挑战。为了应对这些问题,本研究通过提出一种新型的分布式AIM方法利用多智能体强化学习(MARL)来解决这些问题。我们证明了通过利用高级辅助系统3D环绕视技术,自动驾驶车辆可以在不需要任何集中式控制器的情况下准确地导航路口场景。因此,本文的贡献包括基于MARL的自動管理4个路口的算法和引入了一种名为优先场景回放的新策略,以提高训练效果。我们验证了我们的方法作为传统集中AIM技术的一个创新替代方案,确保了我们的结果的完整可重复性。具体来说,使用SMARTS平台在虚拟环境中进行的实验强调了其在各种指标上优于基准测试的优越性。
https://arxiv.org/abs/2405.08655
With the increasing use of neural networks in critical systems, runtime monitoring becomes essential to reject unsafe predictions during inference. Various techniques have emerged to establish rejection scores that maximize the separability between the distributions of safe and unsafe predictions. The efficacy of these approaches is mostly evaluated using threshold-agnostic metrics, such as the area under the receiver operating characteristic curve. However, in real-world applications, an effective monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions. Despite the pivotal importance of threshold optimization, this problem has received little attention. A few studies touch upon this question, but they typically assume that the runtime data distribution mirrors the training distribution, which is a strong assumption as monitors are supposed to safeguard a system against potentially unforeseen threats. In this work, we present rigorous experiments on various image datasets to investigate: 1. The effectiveness of monitors in handling unforeseen threats, which are not available during threshold adjustments. 2. Whether integrating generic threats into the threshold optimization scheme can enhance the robustness of monitors.
随着神经网络在关键系统中的越来越多应用,运行时监控在推理过程中拒绝不安全的预测变得至关重要。为了确定拒绝分数,以最大程度地增加安全预测和不可预测预测分布之间的分离,各种技术已经涌现出来。这些方法的有效性主要是通过阈值无关的指标,如接收者操作特征曲线下的面积进行评估的。然而,在现实应用中,有效的监控还需要确定一个好的阈值,将这些分数转化为有意义的二进制决策。尽管阈值优化具有关键性,但这个问题尚未引起足够的关注。有一些研究触及了这个问题,但他们通常假定运行时数据分布与训练分布相同,这是一个强烈的假设,因为监控的目的是保护系统免受可能未预见到的威胁。在这项工作中,我们进行了各种图像数据集的实验,以研究:1. 监控在处理未预见到的威胁时的有效性。2. 将通用威胁整合到阈值优化方案中是否可以增强监控的稳健性。
https://arxiv.org/abs/2405.08654
How much is 56 times 37? Language models often make mistakes in these types of difficult calculations. This is usually explained by their inability to perform complex reasoning. Since language models rely on large training sets and great memorization capability, naturally they are not equipped to run complex calculations. However, one can argue that humans also cannot perform this calculation immediately and require a considerable amount of time to construct the solution. In order to enhance the generalization capability of language models, and as a parallel to human behavior, we propose to use special 'thinking tokens' which allow the model to perform much more calculations whenever a complex problem is encountered.
56乘以37等于多少?语言模型通常在这种类型的复杂计算中犯错误。通常,这是由于它们无法进行复杂推理。由于语言模型依赖大型训练集和出色的记忆能力,自然它们无法进行复杂计算。然而,可以认为人类也无法立即进行这种计算,并需要相当长的时间来构建解决方案。为了增强语言模型的泛化能力,并作为人类行为的一个并行,我们提议使用特殊的“思考标记”,让模型在遇到复杂问题时能够进行许多更多的计算。
https://arxiv.org/abs/2405.08644
Multi-objective combinatorial optimization (MOCO) problems are prevalent in various real-world applications. Most existing neural methods for MOCO problems rely solely on decomposition and utilize precise hypervolume to enhance diversity. However, these methods often approximate only limited regions of the Pareto front and spend excessive time on diversity enhancement because of ambiguous decomposition and time-consuming hypervolume calculation. To address these limitations, we design a Geometry-Aware Pareto set Learning algorithm named GAPL, which provides a novel geometric perspective for neural MOCO via a Pareto attention model based on hypervolume expectation maximization. In addition, we propose a hypervolume residual update strategy to enable the Pareto attention model to capture both local and non-local information of the Pareto set/front. We also design a novel inference approach to further improve quality of the solution set and speed up hypervolume calculation and local subset selection. Experimental results on three classic MOCO problems demonstrate that our GAPL outperforms state-of-the-art neural baselines via superior decomposition and efficient diversity enhancement.
多目标组合优化(MOCO)问题在各种现实应用中普遍存在。大多数现有的神经方法仅基于分解,并利用精确的半径来增强多样性。然而,由于模糊的分解和耗时的半径计算,这些方法通常只近似Pareto前沿的有限区域,并且花费大量时间进行多样性增强。为了克服这些限制,我们设计了一种基于超体积期望最大化基于Pareto注意模型的Geometry-Aware Pareto集学习算法,为神经MOCO提供了新颖的几何视角。此外,我们还提出了一种半径残差更新策略,使Pareto注意模型能够捕捉到Pareto集/前的局部和非局部信息。我们还设计了一种新的推理方法,以进一步提高解决方案集的质量和加快半径计算和局部子集选择。在三个经典的MOCO问题上的实验结果表明,我们的GAPL通过卓越的分解和高效的多样性增强超越了最先进的神经 baseline。
https://arxiv.org/abs/2405.08604
Tissue tracking in echocardiography is challenging due to the complex cardiac motion and the inherent nature of ultrasound acquisitions. Although optical flow methods are considered state-of-the-art (SOTA), they struggle with long-range tracking, noise occlusions, and drift throughout the cardiac cycle. Recently, novel learning-based point tracking techniques have been introduced to tackle some of these issues. In this paper, we build upon these techniques and introduce EchoTracker, a two-fold coarse-to-fine model that facilitates the tracking of queried points on a tissue surface across ultrasound image sequences. The architecture contains a preliminary coarse initialization of the trajectories, followed by reinforcement iterations based on fine-grained appearance changes. It is efficient, light, and can run on mid-range GPUs. Experiments demonstrate that the model outperforms SOTA methods, with an average position accuracy of 67% and a median trajectory error of 2.86 pixels. Furthermore, we show a relative improvement of 25% when using our model to calculate the global longitudinal strain (GLS) in a clinical test-retest dataset compared to other methods. This implies that learning-based point tracking can potentially improve performance and yield a higher diagnostic and prognostic value for clinical measurements than current techniques. Our source code is available at: this https URL.
在超声心动图中,组织追踪是一个具有挑战性的任务,因为心脏运动复杂,超声采集的固有本质导致了跟踪过程中的噪声阻塞和漂移。尽管光学流方法被认为是最先进的(SOTA),但它们在长距离跟踪、噪声遮挡和整个心动周期内的漂移方面表现不佳。最近,基于学习的新颖跟踪技术已经引入,以解决这些问题。在本文中,我们沿着这些技术进行了探讨,并引入了EchoTracker,一种双精度模型,用于在超声图像序列上追踪被查询的点在组织表面的轨迹。该架构包含轨迹的初步粗初始化,然后基于细粒度外观变化进行强化迭代。它既高效又轻便,可以在中高档GPU上运行。实验证明,与SOTA方法相比,该模型具有优异的性能,平均位置精度为67%,轨迹误差为2.86像素。此外,我们还展示了使用我们的模型在临床测试再测数据集中计算全局纵向应变(GLS)时的相对改善,相比其他方法,GLS的计算性能提高了25%。这表明,基于学习的点追踪技术有可能提高性能,为临床测量提供更高的诊断和预后价值,目前的技术无法满足临床需求。我们的源代码可在此处下载:https:// this URL.
https://arxiv.org/abs/2405.08587
Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach leveraging large-scale multisensory pre-training for robotic manipulation. For supplementary information including videos of real robot experiments, please see this https URL.
尽管在机器人学习中预先在大规模数据集上训练是有益的,但目前的范式仅在对视觉表示进行大规模预训练,而其他模态的表示是从零开始训练的。与丰富的视觉数据相比,不清楚可能用于预训练其他模态(如触觉感知)的相关互联网规模数据。在机器人应用中,低数据量的情况很常见。为了填补这个空白,本文我们通过使用接触式麦克风作为另一种触觉传感器来解决这个问题。我们关键的见解是,接触式麦克风捕获固有音频信息,使我们能够利用大规模音频-视觉预训练来获得提高机器人操作绩效的代表。据我们所知,我们的方法是第一个利用大规模多感官预训练来提高机器人操作绩效的方法。如果您有兴趣了解包括机器人实验视频的更多信息,请查看此链接。
https://arxiv.org/abs/2405.08576
Recent advances in knowledge graph embedding (KGE) rely on Euclidean/hyperbolic orthogonal relation transformations to model intrinsic logical patterns and topological structures. However, existing approaches are confined to rigid relational orthogonalization with restricted dimension and homogeneous geometry, leading to deficient modeling capability. In this work, we move beyond these approaches in terms of both dimension and geometry by introducing a powerful framework named GoldE, which features a universal orthogonal parameterization based on a generalized form of Householder reflection. Such parameterization can naturally achieve dimensional extension and geometric unification with theoretical guarantees, enabling our framework to simultaneously capture crucial logical patterns and inherent topological heterogeneity of knowledge graphs. Empirically, GoldE achieves state-of-the-art performance on three standard benchmarks. Codes are available at this https URL.
近年来,知识图嵌入(KGE)的进步主要依赖于欧氏/混叠欧氏正交关系变换来建模固有逻辑模式和拓扑结构。然而,现有方法局限于刚性关系正交化以及受限维度和同质几何,导致建模能力不足。在本文中,我们通过引入一个名为GoldE的强大框架,在维度和几何方面超越了这些方法。该框架基于一种一般形式的家用反射,具有普遍的正交参数化。这种参数化可以自然实现维度的扩展和几何的统一,使得我们的框架能够同时捕捉知识图的关键逻辑模式和固有拓扑异质性。实证研究表明,GoldE在三个标准基准测试中都实现了最先进的性能。代码可在此处访问:https://www.xxxxxx.com/。
https://arxiv.org/abs/2405.08540