In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that can withstand and adapt to these real-world variabilities. Focusing on four pivotal tasks -- BEV detection, map segmentation, semantic occupancy prediction, and multi-view depth estimation -- the competition laid down a gauntlet to innovate and enhance system resilience against typical and atypical disturbances. This year's challenge consisted of five distinct tracks and attracted 140 registered teams from 93 institutes across 11 countries, resulting in nearly one thousand submissions evaluated through our servers. The competition culminated in 15 top-performing solutions, which introduced a range of innovative approaches including advanced data augmentation, multi-sensor fusion, self-supervised learning for error correction, and new algorithmic strategies to enhance sensor robustness. These contributions significantly advanced the state of the art, particularly in handling sensor inconsistencies and environmental variability. Participants, through collaborative efforts, pushed the boundaries of current technologies, showcasing their potential in real-world scenarios. Extensive evaluations and analyses provided insights into the effectiveness of these solutions, highlighting key trends and successful strategies for improving the resilience of driving perception systems. This challenge has set a new benchmark in the field, providing a rich repository of techniques expected to guide future research in this field.
在自动驾驶领域,在非分布环境下稳健的感知至关重要,这将有利于车辆的安全部署。例如恶劣天气、传感器故障和环境不可预测性等问题会对自动驾驶系统的性能造成严重影响。为了解决这个问题,2024 RoboDrive挑战是为了推动开发能够承受并适应这些现实世界变异性的人工智能驱动感知技术。将注意力放在四个关键任务上--BEV检测、地图分割、语义占用预测和多视角深度估计--比赛为创新和提高系统抗干扰能力设定了挑战。今年的挑战包括五个不同的赛道,吸引了来自93个机构的140支注册队伍,并通过我们的服务器评估了大约1000个解决方案。比赛最终产生了15个最佳解决方案,其中包括先进的数据增强、多传感器融合、自监督学习误码纠正和新的算法策略来增强传感器稳健性。这些贡献显著推动了技术的进步,尤其是在处理传感器不一致性和环境变化方面。参与者通过协同努力,推动了现有技术的边界,展示了他们在现实场景中的潜力。 extensive评估和分析提供了对这些解决方案的有效性的深入了解,强调了改进驾驶感知系统韧性的关键趋势和成功策略。这个挑战为该领域设定了新的基准,为未来研究提供了丰富的技术资料。
https://arxiv.org/abs/2405.08816
Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses the curated task dataset for both pretraining the networks and fine-tuning them. Availability of large, diverse, and uncurated public medical image sets provides the opportunity to apply SSL in the "wild" and potentially extract features robust to imaging variations. However, the benefit of wild- vs self-pretraining has not been studied for medical image analysis. In this paper, we compare robustness of wild versus self-pretrained transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models to computed tomography (CT) imaging differences for non-small cell lung cancer (NSCLC) segmentation. Wild-pretrained Swin models outperformed self-pretrained Swin for the various imaging acquisitions. ViT resulted in similar accuracy for both wild- and self-pretrained models. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained models resulted in higher feature reuse at the lower level layers and feature differentiation close to output layer after fine-tuning. Hence, we conclude: Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. Swin architecture benefited from such pretraining more than ViT.
自监督学习(SSL)是从未标记数据中提取有用特征表示的方法,并在有限标记示例的情况下,在下游任务上进行微调。自监督预训练是一种SSL方法,它利用精心挑选的任务数据集来预训练网络并对其进行微调。大型、多样化和未标记的公共医疗图像数据集的可用性提供了在“野地”应用SSL的机会,从而可能提取对影像变异具有鲁棒性的特征。然而,对于医学图像分析,还没有研究野生预训练和自监督预训练之间的优势。在本文中,我们比较野生预训练和自监督预训练的Transformer(视觉Transformer [ViT]和层次窗滑动窗口 [Swin])模型的计算断层扫描(CT)成像差异对非小细胞肺癌(NSCLC)分割的鲁棒性。野生预训练的Swin模型在各种成像采集中都优于自监督预训练的Swin模型。ViT模型的准确性与野生预训练和自监督预训练模型相当。使网络学习局部结构的目标预处理任务产生了比全局图像信息建模的对比任务更高的准确率。野生预训练模型在微调后,低层层级的特征复用较高,输出层附近的特征分化也较高。因此,我们得出结论:野生预训练网络在肺癌分割分析中的鲁棒性大于自监督方法。Swin架构从预训练中受益更多。
https://arxiv.org/abs/2405.08657
With the increasing use of neural networks in critical systems, runtime monitoring becomes essential to reject unsafe predictions during inference. Various techniques have emerged to establish rejection scores that maximize the separability between the distributions of safe and unsafe predictions. The efficacy of these approaches is mostly evaluated using threshold-agnostic metrics, such as the area under the receiver operating characteristic curve. However, in real-world applications, an effective monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions. Despite the pivotal importance of threshold optimization, this problem has received little attention. A few studies touch upon this question, but they typically assume that the runtime data distribution mirrors the training distribution, which is a strong assumption as monitors are supposed to safeguard a system against potentially unforeseen threats. In this work, we present rigorous experiments on various image datasets to investigate: 1. The effectiveness of monitors in handling unforeseen threats, which are not available during threshold adjustments. 2. Whether integrating generic threats into the threshold optimization scheme can enhance the robustness of monitors.
随着神经网络在关键系统中的越来越多应用,运行时监控在推理过程中拒绝不安全的预测变得至关重要。为了确定拒绝分数,以最大程度地增加安全预测和不可预测预测分布之间的分离,各种技术已经涌现出来。这些方法的有效性主要是通过阈值无关的指标,如接收者操作特征曲线下的面积进行评估的。然而,在现实应用中,有效的监控还需要确定一个好的阈值,将这些分数转化为有意义的二进制决策。尽管阈值优化具有关键性,但这个问题尚未引起足够的关注。有一些研究触及了这个问题,但他们通常假定运行时数据分布与训练分布相同,这是一个强烈的假设,因为监控的目的是保护系统免受可能未预见到的威胁。在这项工作中,我们进行了各种图像数据集的实验,以研究:1. 监控在处理未预见到的威胁时的有效性。2. 将通用威胁整合到阈值优化方案中是否可以增强监控的稳健性。
https://arxiv.org/abs/2405.08654
The necessity for interpretability in natural language processing (NLP) has risen alongside the growing prominence of large language models. Among the myriad tasks within NLP, text generation stands out as a primary objective of autoregressive models. The NLP community has begun to take a keen interest in gaining a deeper understanding of text generation, leading to the development of model-agnostic explainable artificial intelligence (xAI) methods tailored to this task. The design and evaluation of explainability methods are non-trivial since they depend on many factors involved in the text generation process, e.g., the autoregressive model and its stochastic nature. This paper outlines 17 challenges categorized into three groups that arise during the development and assessment of attribution-based explainability methods. These challenges encompass issues concerning tokenization, defining explanation similarity, determining token importance and prediction change metrics, the level of human intervention required, and the creation of suitable test datasets. The paper illustrates how these challenges can be intertwined, showcasing new opportunities for the community. These include developing probabilistic word-level explainability methods and engaging humans in the explainability pipeline, from the data design to the final evaluation, to draw robust conclusions on xAI methods.
在自然语言处理(NLP)中,可解释性(Transparency)的需求随着大型语言模型的日益重要而上升。在NLP的丰富多样的任务中,文本生成是自回归模型的主要目标。NLP社区开始对文本生成产生浓厚兴趣,这导致开发了针对这一任务的模型无关可解释人工智能(xAI)方法。解释性方法的设计和评估非同小可,因为它们依赖于文本生成过程中涉及到的许多因素,例如自回归模型及其随机性质。 本文概述了在开发和评估基于归因的解释性方法时出现的17个挑战,分为三组。这些挑战包括词标注问题、定义解释相似性、确定词的重要性以及预测变化度量,所需的 human intervention 级别以及创建合适的测试数据集。本文展示了这些挑战如何交织在一起,为社区带来了新的机会。这些机会包括开发概率级别的词级解释性方法和激发人类参与可解释性管道,从数据设计到最终评估,以得出关于xAI方法的稳健结论。
https://arxiv.org/abs/2405.08468
The rapid advancement of Large Language Models (LLMs) highlights the urgent need for evolving evaluation methodologies that keep pace with improvements in language comprehension and information processing. However, traditional benchmarks, which are often static, fail to capture the continually changing information landscape, leading to a disparity between the perceived and actual effectiveness of LLMs in ever-changing real-world scenarios. Furthermore, these benchmarks do not adequately measure the models' capabilities over a broader temporal range or their adaptability over time. We examine current LLMs in terms of temporal generalization and bias, revealing that various temporal biases emerge in both language likelihood and prognostic prediction. This serves as a caution for LLM practitioners to pay closer attention to mitigating temporal biases. Also, we propose an evaluation framework Freshbench for dynamically generating benchmarks from the most recent real-world prognostication prediction. Our code is available at this https URL. The dataset will be released soon.
大规模语言模型的快速发展突显了需要不断改进以跟上语言理解和信息处理能力提高的迫切需要。然而,传统的基准测试,往往是静态的,无法捕捉不断变化的资讯景观,导致LLM在现实场景中的既定有效性和实际有效性存在差异。此外,这些基准测试也没有充分衡量模型在更广泛的时间范围内的能力或它们的适应性随时间的变化。我们研究了当前的LLM在时间上的泛化能力和偏差,揭示了在语言概率和预见性预测中各种时间偏见的出现。这为LLM从业者提醒注意缓解时间偏见提供了警告。同时,我们提出了一个名为Freshbench的评估框架,用于从最新的现实预见性预测中动态生成基准测试。我们的代码可在此处查看:https://www.freshbench.org。数据集将在不久的将来发布。
https://arxiv.org/abs/2405.08460
When studying political communication, combining the information from text, audio, and video signals promises to reflect the richness of human communication more comprehensively than confining it to individual modalities alone. However, when modeling such multimodal data, its heterogeneity, connectedness, and interaction are challenging to address. We argue that aligning the respective modalities can be an essential step in entirely using the potential of multimodal data because it informs the model with human understanding. Exploring aligned modalities unlocks promising analytical leverage. First, it allows us to make the most of information in the data, which inter alia opens the door to better quality predictions. Second, it is possible to answer research questions that span multiple modalities with cross-modal queries. Finally, alignment addresses concerns about model interpretability. We illustrate the utility of this approach by analyzing how German MPs address members of the far-right AfD in their speeches, and predicting the tone of video advertising in the context of the 2020 US presidential race. Our paper offers important insights to all keen to analyze multimodal data effectively.
在研究政治沟通时,将文本、音频和视频信号的信息结合起来,比仅仅局限于单一方式更全面地反映人类沟通的丰富性。然而,当尝试建模这种多模态数据时,其异质性、联系性和交互性是难以解决的问题。我们认为,对各自模态进行对齐可以是完全利用多模态数据潜在功能的关键步骤,因为它赋予了模型人类理解。通过对齐模态,我们解锁了有前途的分析优势。首先,它让我们能够充分利用数据中的信息,这不仅为更好的预测打开了大门,而且还有助于跨模态问题。其次,可以在跨模态查询中回答研究问题。最后,对齐解决了关于模型可解释性的担忧。我们通过分析德国议员如何回应极右翼民粹主义者在讲话中如何对待成员,以及预测2020年美国总统竞选期间视频广告的语气,展示了这种方法的有效性。我们的论文为所有希望有效地分析多模态数据的人提供了重要的见解。
https://arxiv.org/abs/2405.08454
This paper describes our approach to the MEDIQA-CORR shared task, which involves error detection and correction in clinical notes curated by medical professionals. This task involves handling three subtasks: detecting the presence of errors, identifying the specific sentence containing the error, and correcting it. Through our work, we aim to assess the capabilities of Large Language Models (LLMs) trained on a vast corpora of internet data that contain both factual and unreliable information. We propose to comprehensively address all subtasks together, and suggest employing a unique prompt-based in-context learning strategy. We will evaluate its efficacy in this specialized task demanding a combination of general reasoning and medical knowledge. In medical systems where prediction errors can have grave consequences, we propose leveraging self-consistency and ensemble methods to enhance error correction and error detection performance.
本文描述了我们针对MEDIQA-CORR共享任务的 approach,该任务涉及临床笔记由医疗专业人士编写的错误检测和纠正。这项任务包括处理三个子任务:检测到错误的 presence,确定包含错误的具体句子,并对其进行纠正。通过我们的工作,我们旨在评估基于互联网数据的大型语言模型(LLMs)的 capabilities,这些数据集包含事实和不可靠信息。我们提出了一种全面解决所有子任务的策略,并建议采用一种基于独特提示的上下文学习策略。我们将评估其在要求一般推理和医学知识的专用任务中的有效性。在医疗系统中,预测错误可能会带来严重的后果时,我们提出利用自一致性和集成方法增强错误检测和纠正性能。
https://arxiv.org/abs/2405.08373
Deep learning-based medical image segmentation models often face performance degradation when deployed across various medical centers, largely due to the discrepancies in data distribution. Test Time Adaptation (TTA) methods, which adapt pre-trained models to test data, have been employed to mitigate such discrepancies. However, existing TTA methods primarily focus on manipulating Batch Normalization (BN) layers or employing prompt and adversarial learning, which may not effectively rectify the inconsistencies arising from divergent data distributions. In this paper, we propose a novel Human-in-the-loop TTA (HiTTA) framework that stands out in two significant ways. First, it capitalizes on the largely overlooked potential of clinician-corrected predictions, integrating these corrections into the TTA process to steer the model towards predictions that coincide more closely with clinical annotation preferences. Second, our framework conceives a divergence loss, designed specifically to diminish the prediction divergence instigated by domain disparities, through the careful calibration of BN parameters. Our HiTTA is distinguished by its dual-faceted capability to acclimatize to the distribution of test data whilst ensuring the model's predictions align with clinical expectations, thereby enhancing its relevance in a medical context. Extensive experiments on a public dataset underscore the superiority of our HiTTA over existing TTA methods, emphasizing the advantages of integrating human feedback and our divergence loss in enhancing the model's performance and adaptability across diverse medical centers.
基于深度学习的医疗图像分割模型在部署到各个医疗机构时,常常会出现性能下降,主要原因是数据分布的不一致性。为了解决这个问题,已经使用了测试时间自适应(TTA)方法,这些方法将预训练的模型适应测试数据。然而,现有的TTA方法主要关注对Batch Normalization(BN)层进行操作或采用提示和对抗学习,这可能不能有效地解决由不同数据分布产生的不规范问题。在本文中,我们提出了一个新颖的人机交互式TTA(HiTTA)框架,具有以下两个显著特点。首先,它充分利用了临床人员校正预测的潜力,将这些校正纳入TTA过程,使模型更接近临床注释偏好。其次,通过精心调整BN参数,我们的框架提出了一种差异损失,旨在通过减少领域差异引起的预测差异来改善模型的预测准确性。我们的HiTTA在适应测试数据分布的同时,确保模型的预测与临床预期保持一致,从而在医学领域增强其相关性。在公开数据集上进行的大量实验证实了我们的HiTTA比现有TTA方法更优越,强调了将人类反馈和差异损失集成到模型中提高其性能和适应性的优势。
https://arxiv.org/abs/2405.08270
We present a novel method aimed at enhancing the sample efficiency of ensemble Q learning. Our proposed approach integrates multi-head self-attention into the ensembled Q networks while bootstrapping the state-action pairs ingested by the ensemble. This not only results in performance improvements over the original REDQ (Chen et al. 2021) and its variant DroQ (Hi-raoka et al. 2022), thereby enhancing Q predictions, but also effectively reduces both the average normalized bias and standard deviation of normalized bias within Q-function ensembles. Importantly, our method also performs well even in scenarios with a low update-to-data (UTD) ratio. Notably, the implementation of our proposed method is straightforward, requiring minimal modifications to the base model.
我们提出了一种旨在提高集成学习(Ensemble Learning)样本效率的新方法。在我们的方法中,将多头自注意力集成到集成Q网络中,并通过bootstrap集成状态动作对。这不仅使得对原始REDQ(Chen et al. 2021)和其变种DroQ(Hi-raoka et al. 2022)的性能提升,从而提高Q预测,而且有效地降低了Q函数集合中平均归一化偏置和标准差的度量。重要的是,我们的方法在更新到数据的比值(UTD)较低的情况下也表现良好。值得注意的是,我们方法的操作实施非常简单,只需要对基础模型进行少量修改。
https://arxiv.org/abs/2405.08252
Large language models (LLM) have demonstrated remarkable capabilities in various biomedical natural language processing (NLP) tasks, leveraging the demonstration within the input context to adapt to new tasks. However, LLM is sensitive to the selection of demonstrations. To address the hallucination issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by retrieving pertinent information from an established database. Nonetheless, existing research work lacks rigorous evaluation of the impact of retrieval-augmented large language models on different biomedical NLP tasks. This deficiency makes it challenging to ascertain the capabilities of RAL within the biomedical domain. Moreover, the outputs from RAL are affected by retrieving the unlabeled, counterfactual, or diverse knowledge that is not well studied in the biomedical domain. However, such knowledge is common in the real world. Finally, exploring the self-awareness ability is also crucial for the RAL system. So, in this paper, we systematically investigate the impact of RALs on 5 different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference). We analyze the performance of RALs in four fundamental abilities, including unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness. To this end, we proposed an evaluation framework to assess the RALs' performance on different biomedical NLP tasks and establish four different testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3 representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.
大语言模型(LLM)在各种生物医学自然语言处理(NLP)任务中表现出非凡的能力,通过在输入上下文演示以适应新任务。然而,LLM 对演示的选择非常敏感。为解决LLM固有的虚构问题,检索增强的大语言模型(RAL)通过从已建立的数据库中检索相关信息提供解决方案。然而,现有的研究作品在生物医学领域对检索增强的大语言模型的影响缺乏严谨的评估。这一缺陷使得确定RAL在生物医学领域的能力具有挑战性。此外,RAL的输出受到从生物医学领域检索未标记、反事实或多样知识的影響,而这些知识在生物医学领域中并没有得到充分研究。然而,在现实世界中,这些知识是很常见的。最后,探索自意识能力对RAL系统也是至关重要的。因此,在本文中,我们系统地研究了RAL对5种生物医学任务(三重提取、链接预测、分类、问答和自然语言推理)的影响。我们分析RAL在四种基本能力(未标记稳健性、反事实稳健性、多样性稳健性和负面意识)上的性能。为此,我们提出了一个评估框架来评估RAL在不同生物医学NLP任务上的性能,并基于上述基本能力建立四个测试台。然后,我们在9个数据集上评估了3个具有不同检索器的LLM的5个任务的表现。
https://arxiv.org/abs/2405.08151
Factorization-based models have gained popularity since the Netflix challenge {(2007)}. Since that, various factorization-based models have been developed and these models have been proven to be efficient in predicting users' ratings towards items. A major concern is that explaining the recommendations generated by such methods is non-trivial because the explicit meaning of the latent factors they learn are not always clear. In response, we propose a novel model that combines factorization-based methods with argumentation frameworks (AFs). The integration of AFs provides clear meaning at each stage of the model, enabling it to produce easily understandable explanations for its recommendations. In this model, for every user-item interaction, an AF is defined in which the features of items are considered as arguments, and the users' ratings towards these features determine the strength and polarity of these arguments. This perspective allows our model to treat feature attribution as a structured argumentation procedure, where each calculation is marked with explicit meaning, enhancing its inherent interpretability. Additionally, our framework seamlessly incorporates side information, such as user contexts, leading to more accurate predictions. We anticipate at least three practical applications for our model: creating explanation templates, providing interactive explanations, and generating contrastive explanations. Through testing on real-world datasets, we have found that our model, along with its variants, not only surpasses existing argumentation-based methods but also competes effectively with current context-free and context-aware methods.
自从Netflix挑战赛(2007年)以来,基于因子分解的模型已经取得了很大成功。此后,已经开发了许多基于因子分解的模型,并证明这些模型在预测用户对物品的评分方面非常有效。一个主要问题是,解释这些方法生成的推荐是非常困难的,因为它们学习到的隐含因素的明确意义并不总是清晰的。为了应对这个问题,我们提出了一个新模型,将因子分解方法与论证框架(AFs)相结合。AFs在模型每个阶段都提供了明确的含义,使得模型能够轻松地生产易于理解的解释。在这个模型中,对于每个用户-项目交互,定义一个AF,其中物品的特征被视为论据,用户对这些特征的评分决定了这些论据的强度和极性。这种观点使得我们的模型将特征归因视为一个结构化的论证过程,其中每个计算都带有明确的含义,增强了其固有的可解释性。此外,我们的框架无缝地融入了附加信息,如用户上下文,从而提高了预测的准确性。我们预计,我们的模型及其变体至少有以下三个实际应用:创建解释模板、提供交互式解释和生成对比性解释。通过在现实世界数据集上的测试,我们发现,与现有基于论证的方法相比,我们的模型及其变体超过了它们,并且与当前的无上下文和上下文感知方法竞争相当。
https://arxiv.org/abs/2405.08131
Mixed-integer quadratic programs (MIQPs) are a versatile way of formulating vehicle decision making and motion planning problems, where the prediction model is a hybrid dynamical system that involves both discrete and continuous decision variables. However, even the most advanced MIQP solvers can hardly account for the challenging requirements of automotive embedded platforms. Thus, we use machine learning to simplify and hence speed up optimization. Our work builds on recent ideas for solving MIQPs in real-time by training a neural network to predict the optimal values of integer variables and solving the remaining problem by online quadratic programming. Specifically, we propose a recurrent permutation equivariant deep set that is particularly suited for imitating MIQPs that involve many obstacles, which is often the major source of computational burden in motion planning problems. Our framework comprises also a feasibility projector that corrects infeasible predictions of integer variables and considerably increases the likelihood of computing a collision-free trajectory. We evaluate the performance, safety and real-time feasibility of decision-making for autonomous driving using the proposed approach on realistic multi-lane traffic scenarios with interactive agents in SUMO simulations.
混合整数二次规划(MIQPs)是一种将车辆决策和运动规划问题形式化的 versatile方法,其中预测模型是一个涉及离散和连续决策变量的混合动态系统。然而,即使是最先进的MIQP求解器也可能很难满足汽车嵌入平台上的挑战要求。因此,我们使用机器学习来简化,从而加速优化。我们的工作基于通过训练神经网络预测整数变量的最优值来解决实时MIQPs的想法,并通过在线二次规划解决剩余问题。具体来说,我们提出了一个循环移位等价深度集,特别适用于涉及许多障碍物的MIQPs,这是运动规划问题中计算负担的主要来源。我们的框架还包括一个可行性投影器,用于纠正整数变量的不可行预测,大大增加了计算无碰撞轨迹的可能性。我们在SUMO仿真中使用该方法对现实世界的多道交通场景进行自动驾驶的决策分析。
https://arxiv.org/abs/2405.08122
Even as technology and performance gains are made in the sphere of automated driving, safety concerns remain. Vehicle simulation has long been seen as a tool to overcome the cost associated with a massive amount of on-road testing for development and discovery of safety critical "edge-cases". However, purely software-based vehicle models may leave a large realism gap between their real-world counterparts in terms of dynamic response, and highly realistic vehicle-in-the-loop (VIL) simulations that encapsulate a virtual world around a physical vehicle may still be quite expensive to produce and similarly time intensive as on-road testing. In this work, we demonstrate an AV simulation test bed that combines the realism of vehicle-in-the-loop (VIL) simulation with the ease of implementation of model-in-the-loop (MIL) simulation. The setup demonstrated in this work allows for response diagnosis for the VIL simulations. By observing causal links between virtual weather and lighting conditions that surround the virtual depiction of our vehicle, the vision-based perception model and controller of Openpilot, and the dynamic response of our physical vehicle under test, we can draw conclusions regarding how the perceived environment contributed to vehicle response. Conversely, we also demonstrate response prediction for the MIL setup, where the need for a physical vehicle is not required to draw richer conclusions around the impact of environmental conditions on AV performance than could be obtained with VIL simulation alone. These combine for a simulation setup with accurate real-world implications for edge-case discovery that is both cost effective and time efficient to implement.
尽管在自动驾驶技术及其性能方面取得了进步,但安全性仍然是一个值得关注的问题。长期以来,车辆仿真被认为是一种通过开发和发现安全关键“边缘情况”所需的巨额道路测试成本的工具。然而,基于软件的车辆模型可能在其现实世界的对应物之间留下较大的动态响应差距。同样,高度逼真的车辆在环形仿真(VIL)中封装的虚拟世界可能仍然非常昂贵,且类似的时间密集于道路测试。在这项工作中,我们展示了结合车辆在环形仿真(VIL)中的现实主义和模型在环形仿真中实现容易性相结合的AV仿真测试台。该设置允许对VIL仿真进行响应诊断。通过观察围绕我们车辆的虚拟天气和照明条件之间的因果关系,以及Openpilot基于视觉感知的模型和控制器以及我们在测试过程中实车的动态响应,我们可以得出关于感知环境如何影响车辆反应的结论。相反,我们还展示了不需要物理车辆的MIL设置的响应预测。在这种设置中,通过观察虚拟天气和照明条件与我们对车辆的视觉描绘之间的因果关系,以及通过VIL仿真获得的关于环境条件对AV性能的影响,我们可以得出关于感知环境如何影响车辆反应的结论。这些结合为既具有准确现实意义,又具有成本效益和时间效率的仿真设置。
https://arxiv.org/abs/2405.07981
Adaptive Risk Control (ARC) is an online calibration strategy based on set prediction that offers worst-case deterministic long-term risk control, as well as statistical marginal coverage guarantees. ARC adjusts the size of the prediction set by varying a single scalar threshold based on feedback from past decisions. In this work, we introduce Localized Adaptive Risk Control (L-ARC), an online calibration scheme that targets statistical localized risk guarantees ranging from conditional risk to marginal risk, while preserving the worst-case performance of ARC. L-ARC updates a threshold function within a reproducing kernel Hilbert space (RKHS), with the kernel determining the level of localization of the statistical risk guarantee. The theoretical results highlight a trade-off between localization of the statistical risk and convergence speed to the long-term risk target. Thanks to localization, L-ARC is demonstrated via experiments to produce prediction sets with risk guarantees across different data subpopulations, significantly improving the fairness of the calibrated model for tasks such as image segmentation and beam selection in wireless networks.
适应风险控制(ARC)是一种基于集合预测的在线校准策略,提供 worst-case 确定性长期风险控制以及统计边际覆盖保证。ARC 通过根据过去决策的反馈调整预测集中的大小来调整单个标量阈值。在这篇工作中,我们引入了局部适应风险控制(L-ARC),一种在线校准方案,旨在实现从条件风险到边缘风险的统计局部风险保证,同时保留 ARC 的最差性能。L-ARC 在可重复核哈伯空间(RKHS)内更新一个阈值函数,其中核确定统计风险保证的水平。理论结果强调了局部化统计风险与长期风险目标收敛速度之间的权衡。通过局部化,L-ARC 通过实验证明可以在不同数据子集上产生具有风险保证的预测集,从而显著改善无线网络中图像分割和束选择等任务的公平性校准模型。
https://arxiv.org/abs/2405.07976
Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this work, we propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles collected from multiple sites and extracts meaningful representations across multiple WSI scales that enable a large variety of downstream pathology tasks. In particular, we design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales ranging from subcellular to slide-scale, including instance segmentation, tile classification, and slide-level prediction. We compare PLUTO's performance to other state-of-the-art methods on a diverse set of external and internal benchmarks covering multiple biologically relevant tasks, tissue types, resolutions, stains, and scanners. We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, some of which use orders-of-magnitude larger datasets and model sizes when compared to PLUTO. Our findings present a path towards a universal embedding to power pathology image analysis, and motivate further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability in real-world applications.
病理学是对组织进行显微镜检查的研究,而病理学诊断通常是诊断疾病的医学金标准。病理学图像为基于计算机视觉的分析提供了独特的挑战:单个病理学全息切片图(WSI)可能是兆像素大小,并且通常包含跨多个分辨率的成千上万个感兴趣对象。在这项工作中,我们提出了PathoLogy Universal Transformer (PLUTO):一种轻量级的病理学FM,在多个站点收集的1950万张图像片段的多样数据集上进行预训练,并跨多个WSI尺度提取有意义的表示,从而实现多种下游病理学任务的多样化。 特别地,我们设计了一些任务特异性适应头,利用PLUTO的输出嵌入进行任务,这些任务跨越病理学尺度,包括亚细胞到切片水平的实例分割、贴片分类和切片级别预测。我们将PLUTO的性能与来自多个生物学相关任务、组织类型的外部和内部基准的现有方法进行比较。我们发现,PLUTO与现有任务特异性基线相比,或与病理学特定基础模型相比具有优势,有些模型在PLUTO面前使用的是规模更大、模型尺寸更大的数据集。 我们的研究结果表明,PLUTO朝着实现病理学图像分析的通用嵌入迈出了重要的一步,并激励在病理学基础模型方面进行进一步的探索,以提高数据多样性、架构改进、样本效率和实际应用的可行性。
https://arxiv.org/abs/2405.07905
In 2020, prostate cancer saw a staggering 1.4 million new cases, resulting in over 375,000 deaths. The accurate identification of clinically significant prostate cancer is crucial for delivering effective treatment to patients. Consequently, there has been a surge in research exploring the application of deep neural networks to predict clinical significance based on magnetic resonance images. However, these networks demand extensive datasets to attain optimal performance. Recently, transfer learning emerged as a technique that leverages acquired features from a domain with richer data to enhance the performance of a domain with limited data. In this paper, we investigate the improvement of clinically significant prostate cancer prediction in T2-weighted images through transfer learning from breast cancer. The results demonstrate a remarkable improvement of over 30% in leave-one-out cross-validation accuracy.
在2020年,前列腺癌新病例数量令人震惊地达到了140万,导致超过37.5万人死于癌症。准确识别临床显著性前列腺癌对给患者提供有效的治疗至关重要。因此,自那时以来,研究探索将深度神经网络应用于根据磁共振图像预测临床显著性的方法急剧增加。然而,这些网络需要大量数据来达到最佳性能。最近,传输学习成为了一种利用具有更丰富数据领域的已获得特征来增强具有有限数据的领域的技术。在本文中,我们研究了从乳腺癌中通过传输学习改善临床显著性前列腺癌预测的效果。结果表明,在 leave-one-out 交叉验证精度上,超过30%的改善是非常显著的。
https://arxiv.org/abs/2405.07869
Breast cancer was diagnosed for over 7.8 million women between 2015 to 2020. Grading plays a vital role in breast cancer treatment planning. However, the current tumor grading method involves extracting tissue from patients, leading to stress, discomfort, and high medical costs. A recent paper leveraging volumetric deep radiomic features from synthetic correlated diffusion imaging (CDI$^s$) for breast cancer grade prediction showed immense promise for noninvasive methods for grading. Motivated by the impact of CDI$^s$ optimization for prostate cancer delineation, this paper examines using optimized CDI$^s$ to improve breast cancer grade prediction. We fuse the optimized CDI$^s$ signal with diffusion-weighted imaging (DWI) to create a multiparametric MRI for each patient. Using a larger patient cohort and training across all the layers of a pretrained MONAI model, we achieve a leave-one-out cross-validation accuracy of 95.79%, over 8% higher compared to that previously reported.
在2015年至2020年期间,有超过7800万人被确诊为乳腺癌。分级在乳腺癌治疗计划中起着关键作用。然而,目前的肿瘤分级方法需要从患者提取组织,导致不适、疼痛和高医疗费用。一篇利用合成相关扩散成像(CDI$^s$)的体积深度放射性特征进行乳腺癌分级预测的最近论文显示,非侵入性分级方法具有巨大的潜力。受到CDI$^s$优化对前列腺癌勾勒的影响,本文研究使用优化后的CDI$^s$改善乳腺癌分级预测。我们将优化后的CDI$^s$信号与扩散加权成像(DWI)融合,为每位患者创建一个多参数MRI。利用更大的患者人群和预训练的MONAI模型的所有层次进行训练,我们实现了95.79%的 leave-one-out 交叉验证准确率,比之前报道的8%更高。
https://arxiv.org/abs/2405.07861
In 2020, 685,000 deaths across the world were attributed to breast cancer, underscoring the critical need for innovative and effective breast cancer treatment. Neoadjuvant chemotherapy has recently gained popularity as a promising treatment strategy for breast cancer, attributed to its efficacy in shrinking large tumors and leading to pathologic complete response. However, the current process to recommend neoadjuvant chemotherapy relies on the subjective evaluation of medical experts which contain inherent biases and significant uncertainty. A recent study, utilizing volumetric deep radiomic features extracted from synthetic correlated diffusion imaging (CDI$^s$), demonstrated significant potential in noninvasive breast cancer pathologic complete response prediction. Inspired by the positive outcomes of optimizing CDI$^s$ for prostate cancer delineation, this research investigates the application of optimized CDI$^s$ to enhance breast cancer pathologic complete response prediction. Using multiparametric MRI that fuses optimized CDI$^s$ with diffusion-weighted imaging (DWI), we obtain a leave-one-out cross-validation accuracy of 93.28%, over 5.5% higher than that previously reported.
在2020年,全球有685,000人死于乳腺癌,这凸显了创新和有效的乳腺癌治疗迫切需要的重要性。最近,以缩小大肿瘤并导致病理完全反应为特征的辅助化疗( neoadjuvant chemotherapy)已经成为乳腺癌治疗领域的一种有前景的方法。然而,目前推荐 neoadjuvant 化疗的过程仍然依赖于医疗专家的主观评估,这些评估存在固有偏见和不确定性。 最近,一项利用从合成相关扩散成像(CDI$^s$)中提取的体积深度放射性特征的研究表明,非侵入性乳腺癌病理完全反应预测具有显著潜力。受到优化 CDI$^s$ 对前列腺癌分割阳性结果的启发,这项研究探讨了将优化 CDI$^s$ 应用于增强乳腺癌病理完全反应预测的应用。通过将优化后的 CDI$^s$ 与扩散加权成像(DWI)融合的multiparametric MRI,我们获得了93.28%的准确度,比之前报道的5.5%更高。
https://arxiv.org/abs/2405.07854
General Value Functions (GVFs) (Sutton et al, 2011) are an established way to represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique pseudo-reward. Multiple GVFs can be estimated in parallel using off-policy learning from a single stream of data, often sourced from a fixed behavior policy or pre-collected dataset. This leaves an open question: how can behavior policy be chosen for data-efficient GVF learning? To address this gap, we propose GVFExplorer, which aims at learning a behavior policy that efficiently gathers data for evaluating multiple GVFs in parallel. This behavior policy selects actions in proportion to the total variance in the return across all GVFs, reducing the number of environmental interactions. To enable accurate variance estimation, we use a recently proposed temporal-difference-style variance estimator. We prove that each behavior policy update reduces the mean squared error in the summed predictions over all GVFs. We empirically demonstrate our method's performance in both tabular representations and nonlinear function approximation.
一般价值函数(GVFs)(Sutton et al, 2011) 是用于在强化学习中表示预测知识的一种已建立的方法。每个GVF根据唯一的伪奖励计算给定策略的预期回报。可以使用来自单个数据流的多重GVF估计,该数据流通常源自于固定行为策略或预先收集的数据。这留下了一个开放性问题:如何选择数据有效的GVF学习中的行为策略?为了解决这个空白,我们提出了GVFExplorer,它旨在学习一个行为策略,以并行收集数据以评估多个GVF。该行为策略根据所有GVF的回报总方差的比例选择动作,减少环境交互的数量。为了实现准确方差估计,我们使用了一个最近提出的 temporal-difference-style 方差估计器。我们证明了每个行为策略更新都会减少所有GVF的加总预测的均方误差。我们通过实验展示了我们方法的性能,包括表格表示和非线性函数逼近。
https://arxiv.org/abs/2405.07838
Many aging individuals encounter challenges in effectively tracking their dietary intake, exacerbating their susceptibility to nutrition-related health complications. Self-reporting methods are often inaccurate and suffer from substantial bias; however, leveraging intelligent prediction methods can automate and enhance precision in this process. Recent work has explored using computer vision prediction systems to predict nutritional information from food images. Still, these methods are often tailored to specific situations, require other inputs in addition to a food image, or do not provide comprehensive nutritional information. This paper aims to enhance the efficacy of dietary intake estimation by leveraging various neural network architectures to directly predict a meal's nutritional content from its image. Through comprehensive experimentation and evaluation, we present NutritionVerse-Direct, a model utilizing a vision transformer base architecture with three fully connected layers that lead to five regression heads predicting calories (kcal), mass (g), protein (g), fat (g), and carbohydrates (g) present in a meal. NutritionVerse-Direct yields a combined mean average error score on the NutritionVerse-Real dataset of 412.6, an improvement of 25.5% over the Inception-ResNet model, demonstrating its potential for improving dietary intake estimation accuracy.
许多老年人面临着在有效跟踪他们的饮食摄入方面遇到的挑战,从而加剧了他们易受与营养相关的健康问题的易感性。自我报告的方法通常是不准确的,并存在很大的偏差;然而,利用智能预测方法可以自动化和提高这个过程的精度。最近的工作已经探索了使用计算机视觉预测系统预测食品图像中的营养信息。然而,这些方法通常都是针对特定情况设计的,需要其他输入,或者没有提供全面的营养信息。本文旨在通过利用各种神经网络架构增强饮食摄入估计的有效性,从而预测从食品图像直接预测餐的营养成分。通过全面的实验和评估,我们提出了NutritionVerse-Direct模型,该模型使用视觉Transformer基础架构和三个全连接层,从而预测餐中的卡路里、质量(g)、蛋白质(g)、脂肪(g)和碳水化合物(g)。NutritionVerse-Direct在NutritionVerse-Real数据集上的综合平均误差分数为412.6,比Inception-ResNet模型提高了25.5%,证明了其提高饮食摄入估计准确性的潜力。
https://arxiv.org/abs/2405.07814