Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: this https URL.
在内镜手术过程中,将自身定位可能会出现问题,因为内镜设备(如有限视野和复杂的照明条件)以及由于缺乏可区分纹理和标志而产生的困难。在本文中,我们提出了一种基于解剖识别的深度学习方法,在未经监督的情况下从手术视频中构建手术路径,并建模不同视角下的相对位置和变化。在推理时,该模型可以在路径上映射未见过的视频帧,并估计视角,旨在提供指导,例如,到达特定目的地。我们在包括Transsphenoidal腺瘤在内的大规模手术视频数据集上测试了该方法,以及在合成数据集上进行了测试。在这个网站上有这样一个在线工具,让研究人员上传他们的手术视频以获得解剖检测和训练的YOLOv7模型的权重:https:// this URL.
https://arxiv.org/abs/2405.09355
Deep learning classifiers are prone to latching onto dominant confounders present in a dataset rather than on the causal markers associated with the target class, leading to poor generalization and biased predictions. Although explainability via counterfactual image generation has been successful at exposing the problem, bias mitigation strategies that permit accurate explainability in the presence of dominant and diverse artifacts remain unsolved. In this work, we propose the DeCoDEx framework and show how an external, pre-trained binary artifact detector can be leveraged during inference to guide a diffusion-based counterfactual image generator towards accurate explainability. Experiments on the CheXpert dataset, using both synthetic artifacts and real visual artifacts (support devices), show that the proposed method successfully synthesizes the counterfactual images that change the causal pathology markers associated with Pleural Effusion while preserving or ignoring the visual artifacts. Augmentation of ERM and Group-DRO classifiers with the DeCoDEx generated images substantially improves the results across underrepresented groups that are out of distribution for each class. The code is made publicly available at this https URL.
深度学习分类器容易在数据集中固有的主导混淆因素上留下印象,而不是在目标类别的相关因果标记上,导致泛化差和有偏预测。尽管通过反事实图像生成来解释该问题已经取得成功,但允许在主导和多样异常物中实现准确解释的偏差减轻策略仍然是一个未解决的问题。在本文中,我们提出了DeCoDEx框架,并展示了如何在外部预训练的二进制异常物检测器的基础上,在推理过程中指导扩散式反事实图像生成器走向准确解释。在CheXpert数据集上进行的实验(使用合成异常物和真实视觉异常物)表明,与该方法相结合,可以成功生成反事实图像,这些图像在改变与胸膜积液相关的因果病理特征的同时,保留或忽略视觉异常。使用DeCoDEx生成的图像对ERM和Group-DRO分类器的扩展显著提高了分布不寻常类别的结果。代码可在此处公开访问:https://this URL。
https://arxiv.org/abs/2405.09288
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
控制文本到语音合成(TTS)中情感表达仍然具有挑战性。以前的研究主要集中在在句子级别学习全局音调表示,这与语言音调高度相关。我们的目标是构建一个层级情感分布(ED),有效地捕捉情感在各个粒度级别上的强度变化,包括音素、单词和句子。在TTS训练过程中,从地面真实音频中提取层级ED,并引导预测器建立情感和语言音调之间的联系。在运行时推理,TTS模型生成情感语音,同时为语音成分提供情感定量控制。客观和主观评价证实了所提出的框架在情感预测和控制方面的有效性。
https://arxiv.org/abs/2405.09171
Deep models produce a number of features in each internal layer. A key problem in applications such as feature compression for remote inference is determining how important each feature is for the task(s) performed by the model. The problem is especially challenging in the case of multi-task inference, where the same feature may carry different importance for different tasks. In this paper, we examine how effective is mutual information (MI) between a feature and a model's task output as a measure of the feature's importance for that task. Experiments involving hard selection and soft selection (unequal compression) based on MI are carried out to compare the MI-based method with alternative approaches. Multi-objective analysis is provided to offer further insight.
深度模型在内部层中产生多个特征。在远程推理应用中,例如特征压缩,确定每个特征对于模型执行的任务(或任务)的重要性是一个关键问题。尤其是在多任务推理的情况下,相同特征可能对于不同的任务具有不同的重要性。在本文中,我们研究了特征与模型任务输出之间的互信息(MI)之间的重要性。我们进行了涉及硬选择和软选择(不等压缩)的实验,以比较基于MI的方法与其他方法的差异。还提供了多目标分析,以提供进一步的见解。
https://arxiv.org/abs/2405.09077
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.
浏览新闻文章的复杂多变的景观,需要理解各种参与者的身份,这些参与者从决策者到反对派人物、公民等,在塑造新闻故事中发挥着关键作用。了解他们的角色、政治观点、社会地位等,对于深入理解新闻内容至关重要。尽管现有的工作集中于通过社交媒体数据突出实体、覆盖差异和政治立场,但自动检测新闻内容中的参与者角色仍然是一个未被探索的领域。在本文中,我们通过引入一种有效的分类新闻文章中参与者类型的方法,跨越了这个领域的空白。我们的方法将参与者分类问题转化为自然语言推理任务,利用新闻文章的上下文信息和外部知识来提高参与者类型检测的准确性。此外,我们所提出的模型在零散设置中表现出优异效果,进一步拓展了其在各种新闻环境中的应用。
https://arxiv.org/abs/2405.08751
Numerous studies have revealed that deep learning-based medical image classification models may exhibit bias towards specific demographic attributes, such as race, gender, and age. Existing bias mitigation methods often achieve high level of fairness at the cost of significant accuracy degradation. In response to this challenge, we propose an innovative and adaptable Soft Nearest Neighbor Loss-based channel pruning framework, which achieves fairness through channel pruning. Traditionally, channel pruning is utilized to accelerate neural network inference. However, our work demonstrates that pruning can also be a potent tool for achieving fairness. Our key insight is that different channels in a layer contribute differently to the accuracy of different groups. By selectively pruning critical channels that lead to the accuracy difference between the privileged and unprivileged groups, we can effectively improve fairness without sacrificing accuracy significantly. Experiments conducted on two skin lesion diagnosis datasets across multiple sensitive attributes validate the effectiveness of our method in achieving state-of-the-art trade-off between accuracy and fairness. Our code is available at this https URL.
大量研究表明,基于深度学习的医疗图像分类模型可能存在对特定人口属性的偏见,如种族、性别和年龄等。现有的偏见缓解方法通常可以在公正性方面达到很高的水平,但会降低准确性。为了应对这一挑战,我们提出了一个创新且可适应的软最近邻损失基于通道修剪框架,通过通道修剪实现公正。 传统上,通道修剪用于加速神经网络的推理。然而,我们的工作表明,修剪也可以成为实现公正的有力工具。我们的关键洞见是,同一层中不同通道对不同群体的准确性有不同的贡献。通过选择性地修剪导致 privileged 和 unprivileged 群体之间准确率差异的关键通道,我们可以在不牺牲准确率的情况下有效提高公正性。在两个皮肤病变诊断数据集上的实验验证了我们在实现准确性和公正性之间的最优平衡。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2405.08681
With the increasing use of neural networks in critical systems, runtime monitoring becomes essential to reject unsafe predictions during inference. Various techniques have emerged to establish rejection scores that maximize the separability between the distributions of safe and unsafe predictions. The efficacy of these approaches is mostly evaluated using threshold-agnostic metrics, such as the area under the receiver operating characteristic curve. However, in real-world applications, an effective monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions. Despite the pivotal importance of threshold optimization, this problem has received little attention. A few studies touch upon this question, but they typically assume that the runtime data distribution mirrors the training distribution, which is a strong assumption as monitors are supposed to safeguard a system against potentially unforeseen threats. In this work, we present rigorous experiments on various image datasets to investigate: 1. The effectiveness of monitors in handling unforeseen threats, which are not available during threshold adjustments. 2. Whether integrating generic threats into the threshold optimization scheme can enhance the robustness of monitors.
随着神经网络在关键系统中的越来越多应用,运行时监控在推理过程中拒绝不安全的预测变得至关重要。为了确定拒绝分数,以最大程度地增加安全预测和不可预测预测分布之间的分离,各种技术已经涌现出来。这些方法的有效性主要是通过阈值无关的指标,如接收者操作特征曲线下的面积进行评估的。然而,在现实应用中,有效的监控还需要确定一个好的阈值,将这些分数转化为有意义的二进制决策。尽管阈值优化具有关键性,但这个问题尚未引起足够的关注。有一些研究触及了这个问题,但他们通常假定运行时数据分布与训练分布相同,这是一个强烈的假设,因为监控的目的是保护系统免受可能未预见到的威胁。在这项工作中,我们进行了各种图像数据集的实验,以研究:1. 监控在处理未预见到的威胁时的有效性。2. 将通用威胁整合到阈值优化方案中是否可以增强监控的稳健性。
https://arxiv.org/abs/2405.08654
Multi-objective combinatorial optimization (MOCO) problems are prevalent in various real-world applications. Most existing neural methods for MOCO problems rely solely on decomposition and utilize precise hypervolume to enhance diversity. However, these methods often approximate only limited regions of the Pareto front and spend excessive time on diversity enhancement because of ambiguous decomposition and time-consuming hypervolume calculation. To address these limitations, we design a Geometry-Aware Pareto set Learning algorithm named GAPL, which provides a novel geometric perspective for neural MOCO via a Pareto attention model based on hypervolume expectation maximization. In addition, we propose a hypervolume residual update strategy to enable the Pareto attention model to capture both local and non-local information of the Pareto set/front. We also design a novel inference approach to further improve quality of the solution set and speed up hypervolume calculation and local subset selection. Experimental results on three classic MOCO problems demonstrate that our GAPL outperforms state-of-the-art neural baselines via superior decomposition and efficient diversity enhancement.
多目标组合优化(MOCO)问题在各种现实应用中普遍存在。大多数现有的神经方法仅基于分解,并利用精确的半径来增强多样性。然而,由于模糊的分解和耗时的半径计算,这些方法通常只近似Pareto前沿的有限区域,并且花费大量时间进行多样性增强。为了克服这些限制,我们设计了一种基于超体积期望最大化基于Pareto注意模型的Geometry-Aware Pareto集学习算法,为神经MOCO提供了新颖的几何视角。此外,我们还提出了一种半径残差更新策略,使Pareto注意模型能够捕捉到Pareto集/前的局部和非局部信息。我们还设计了一种新的推理方法,以进一步提高解决方案集的质量和加快半径计算和局部子集选择。在三个经典的MOCO问题上的实验结果表明,我们的GAPL通过卓越的分解和高效的多样性增强超越了最先进的神经 baseline。
https://arxiv.org/abs/2405.08604
The nature of diversity in real-world environments necessitates neural network models to expand from closed category settings to accommodate novel emerging categories. In this paper, we study the open-vocabulary object detection (OVD), which facilitates the detection of novel object classes under the supervision of only base annotations and open-vocabulary knowledge. However, we find that the inadequacy of neighboring relationships between regions during the alignment process inevitably constrains the performance on recent distillation-based OVD strategies. To this end, we propose Neighboring Region Attention Alignment (NRAA), which performs alignment within the attention mechanism of a set of neighboring regions to boost the open-vocabulary inference. Specifically, for a given proposal region, we randomly explore the neighboring boxes and conduct our proposed neighboring region attention (NRA) mechanism to extract relationship information. Then, this interaction information is seamlessly provided into the distillation procedure to assist the alignment between the detector and the pre-trained vision-language models (VLMs). Extensive experiments validate that our proposed model exhibits superior performance on open-vocabulary benchmarks.
现实环境中的多样性需要神经网络模型从封闭的类别设置扩展到以容纳新颖的浮现类别。在本文中,我们研究了开放词汇对象检测(OVD),它通过仅基于基本注释的监督来检测新颖的对象类别。然而,我们发现,在配准过程中,区域之间相邻关系的不足会必然限制最近基于蒸馏的OD策略的性能。为此,我们提出了邻居区域注意对齐(NRAA),它通过一组邻居区域的注意机制在注意力机制内进行对齐,以提高开放词汇的推理。具体来说,对于给定的建议区域,我们随机探索邻居框,并执行我们提出的邻居区域注意(NRA)机制来提取关系信息。然后,这种交互信息被无缝地提供到蒸馏过程中,以协助检测器与预训练的视觉语言模型(VLMs)之间的对齐。大量实验证实,与开放词汇基准相比,我们提出的模型具有卓越的性能。
https://arxiv.org/abs/2405.08593
Stereo image super-resolution (SR) refers to the reconstruction of a high-resolution (HR) image from a pair of low-resolution (LR) images as typically captured by a dual-camera device. To enhance the quality of SR images, most previous studies focused on increasing the number and size of feature maps and introducing complex and computationally intensive structures, resulting in models with high computational complexity. Here, we propose a simple yet efficient stereo image SR model called NAFRSSR, which is modified from the previous state-of-the-art model NAFSSR by introducing recursive connections and lightweighting the constituent modules. Our NAFRSSR model is composed of nonlinear activation free and group convolution-based blocks (NAFGCBlocks) and depth-separated stereo cross attention modules (DSSCAMs). The NAFGCBlock improves feature extraction and reduces number of parameters by removing the simple channel attention mechanism from NAFBlock and using group convolution. The DSSCAM enhances feature fusion and reduces number of parameters by replacing 1x1 pointwise convolution in SCAM with weight-shared 3x3 depthwise convolution. Besides, we propose to incorporate trainable edge detection operator into NAFRSSR to further improve the model performance. Four variants of NAFRSSR with different sizes, namely, NAFRSSR-Mobile (NAFRSSR-M), NAFRSSR-Tiny (NAFRSSR-T), NAFRSSR-Super (NAFRSSR-S) and NAFRSSR-Base (NAFRSSR-B) are designed, and they all exhibit fewer parameters, higher PSNR/SSIM, and faster speed than the previous state-of-the-art models. In particular, to the best of our knowledge, NAFRSSR-M is the lightest (0.28M parameters) and fastest (50 ms inference time) model achieving an average PSNR/SSIM as high as 24.657 dB/0.7622 on the benchmark datasets. Codes and models will be released at this https URL.
立体图像超分辨率(SR)是指通常由双相机设备捕获的低分辨率(LR)图像的一对高分辨率(HR)图像的重建。为了提高SR图像的质量,以前的研究主要集中在增加特征图的数量和大小,并引入复杂且计算密集的结构,导致具有高计算复杂性的模型。在这里,我们提出了一种简单而有效的立体图像SR模型,称为NAFRSSR,它基于前 state-of-the-art模型NAFSSR,通过引入递归连接和轻量化构成模块。我们的NAFRSSR模型由非线性激活自由和基于组卷积的块(NAFGCBlocks)以及深度分离的立体跨注意模块(DSSCAMs)组成。NAFGCBlock通过从NAFBlock中移除简单的通道关注机制并使用组卷积来减少参数数量并提高特征提取。DSSCAM通过用权共享的3x3深度卷积来替换SCAM中的1x1点卷积,从而增强特征融合并减少参数数量。此外,我们还提出将可训练的边缘检测操作器集成到NAFRSSR中,以进一步提高模型性能。设计有四种不同大小的NAFRSSR变体,分别为:NAFRSSR-Mobile(NAFRSSR-M),NAFRSSR-Tiny(NAFRSSR-T),NAFRSSR-Super(NAFRSSR-S)和NAFRSSR-Base(NAFRSSR-B),它们都具有更少的参数、更高的PSNR/SSIM和更快的推理速度。特别是,据我们所知,NAFRSSR-M是最轻便(0.28M参数)且最快的(50ms推理时间)模型,在基准数据集上达到平均PSNR/SSIM 24.657 dB/0.7622。代码和模型发布在https://这个URL上。
https://arxiv.org/abs/2405.08423
In the training process of Deep Reinforcement Learning (DRL), agents require repetitive interactions with the environment. With an increase in training volume and model complexity, it is still a challenging problem to enhance data utilization and explainability of DRL training. This paper addresses these challenges by focusing on the temporal correlations within the time dimension of time series. We propose a novel approach to segment multivariate time series into meaningful subsequences and represent the time series based on these subsequences. Furthermore, the subsequences are employed for causal inference to identify fundamental causal factors that significantly impact training outcomes. We design a module to provide feedback on the causality during DRL training. Several experiments demonstrate the feasibility of our approach in common environments, confirming its ability to enhance the effectiveness of DRL training and impart a certain level of explainability to the training process. Additionally, we extended our approach with priority experience replay algorithm, and experimental results demonstrate the continued effectiveness of our approach.
在深度强化学习(DRL)的训练过程中,代理需要与环境进行重复交互。随着训练量的增加和模型复杂性的提高,增强数据利用率和对DRL训练的 可解释性仍然是一个具有挑战性的问题。本文通过关注时间维度中的时间相关性来解决这些问题。我们提出了一种将多维时间序列分割为有意义的子序列的新方法,并基于这些子序列表示时间序列。此外,这些子序列用于进行因果推理,以确定对训练结果具有显著影响的基本因果因素。我们设计了一个模块来提供在DRL训练期间的因果性反馈。 several实验证明,我们的方法在常见环境中是行得通的,证实了其提高DRL训练效果和使其具有某种程度的可解释性的能力。此外,我们还通过优先经验回放算法扩展了我们的方法,实验结果表明,我们的方法仍然具有有效性。
https://arxiv.org/abs/2405.08380
Spatial interference (SI) occurs when the treatment at one location affects the outcomes at other locations. Accounting for spatial interference in spatiotemporal settings poses further challenges as interference violates the stable unit treatment value assumption, making it infeasible for standard causal inference methods to quantify the effects of time-varying treatment at spatially varying outcomes. In this paper, we first formalize the concept of spatial interference in case of time-varying treatment assignments by extending the potential outcome framework under the assumption of no unmeasured confounding. We then propose our deep learning based potential outcome model for spatiotemporal causal inference. We utilize latent factor modeling to reduce the bias due to time-varying confounding while leveraging the power of U-Net architecture to capture global and local spatial interference in data over time. Our causal estimators are an extension of average treatment effect (ATE) for estimating direct (DATE) and indirect effects (IATE) of spatial interference on treated and untreated data. Being the first of its kind deep learning based spatiotemporal causal inference technique, our approach shows advantages over several baseline methods based on the experiment results on two synthetic datasets, with and without spatial interference. Our results on real-world climate dataset also align with domain knowledge, further demonstrating the effectiveness of our proposed method.
空间干扰(SI)发生在某一点的治疗会影响到其他地方的结局。在时空设置中考虑到空间干扰,进一步提出了挑战,因为干扰违反了稳定单位治疗值假设,使得标准因果推断方法无法量化随时间变化的治疗对空间变化结局的影响。在本文中,我们首先通过扩展假设没有未测量混淆的潜在结果框架来形式化时空变化治疗分配的概念。然后,我们提出了基于深度学习的空间时间因果推断模型。我们利用潜在因子建模来降低由于时间变化混淆而产生的方差,同时利用U-Net架构捕获数据中的全局和局部空间干扰。我们的因果估计器是空间时间干扰下治疗的直接(DATE)和间接效应(IATE)的平均治疗效应的扩展。基于实验结果的两个合成数据集,我们的方法与基于空间干扰的基线方法相比具有明显的优势。我们对现实世界气候数据集的研究也符合领域知识,进一步证明了我们提出方法的有效性。
https://arxiv.org/abs/2405.08174
Large language models (LLM) have demonstrated remarkable capabilities in various biomedical natural language processing (NLP) tasks, leveraging the demonstration within the input context to adapt to new tasks. However, LLM is sensitive to the selection of demonstrations. To address the hallucination issue inherent in LLM, retrieval-augmented LLM (RAL) offers a solution by retrieving pertinent information from an established database. Nonetheless, existing research work lacks rigorous evaluation of the impact of retrieval-augmented large language models on different biomedical NLP tasks. This deficiency makes it challenging to ascertain the capabilities of RAL within the biomedical domain. Moreover, the outputs from RAL are affected by retrieving the unlabeled, counterfactual, or diverse knowledge that is not well studied in the biomedical domain. However, such knowledge is common in the real world. Finally, exploring the self-awareness ability is also crucial for the RAL system. So, in this paper, we systematically investigate the impact of RALs on 5 different biomedical tasks (triple extraction, link prediction, classification, question answering, and natural language inference). We analyze the performance of RALs in four fundamental abilities, including unlabeled robustness, counterfactual robustness, diverse robustness, and negative awareness. To this end, we proposed an evaluation framework to assess the RALs' performance on different biomedical NLP tasks and establish four different testbeds based on the aforementioned fundamental abilities. Then, we evaluate 3 representative LLMs with 3 different retrievers on 5 tasks over 9 datasets.
大语言模型(LLM)在各种生物医学自然语言处理(NLP)任务中表现出非凡的能力,通过在输入上下文演示以适应新任务。然而,LLM 对演示的选择非常敏感。为解决LLM固有的虚构问题,检索增强的大语言模型(RAL)通过从已建立的数据库中检索相关信息提供解决方案。然而,现有的研究作品在生物医学领域对检索增强的大语言模型的影响缺乏严谨的评估。这一缺陷使得确定RAL在生物医学领域的能力具有挑战性。此外,RAL的输出受到从生物医学领域检索未标记、反事实或多样知识的影響,而这些知识在生物医学领域中并没有得到充分研究。然而,在现实世界中,这些知识是很常见的。最后,探索自意识能力对RAL系统也是至关重要的。因此,在本文中,我们系统地研究了RAL对5种生物医学任务(三重提取、链接预测、分类、问答和自然语言推理)的影响。我们分析RAL在四种基本能力(未标记稳健性、反事实稳健性、多样性稳健性和负面意识)上的性能。为此,我们提出了一个评估框架来评估RAL在不同生物医学NLP任务上的性能,并基于上述基本能力建立四个测试台。然后,我们在9个数据集上评估了3个具有不同检索器的LLM的5个任务的表现。
https://arxiv.org/abs/2405.08151
We introduce Many-Shot Regurgitation (MSR) prompting, a new black-box membership inference attack framework for examining verbatim content reproduction in large language models (LLMs). MSR prompting involves dividing the input text into multiple segments and creating a single prompt that includes a series of faux conversation rounds between a user and a language model to elicit verbatim regurgitation. We apply MSR prompting to diverse text sources, including Wikipedia articles and open educational resources (OER) textbooks, which provide high-quality, factual content and are continuously updated over time. For each source, we curate two dataset types: one that LLMs were likely exposed to during training ($D_{\rm pre}$) and another consisting of documents published after the models' training cutoff dates ($D_{\rm post}$). To quantify the occurrence of verbatim matches, we employ the Longest Common Substring algorithm and count the frequency of matches at different length thresholds. We then use statistical measures such as Cliff's delta, Kolmogorov-Smirnov (KS) distance, and Kruskal-Wallis H test to determine whether the distribution of verbatim matches differs significantly between $D_{\rm pre}$ and $D_{\rm post}$. Our findings reveal a striking difference in the distribution of verbatim matches between $D_{\rm pre}$ and $D_{\rm post}$, with the frequency of verbatim reproduction being significantly higher when LLMs (e.g. GPT models and LLaMAs) are prompted with text from datasets they were likely trained on. For instance, when using GPT-3.5 on Wikipedia articles, we observe a substantial effect size (Cliff's delta $= -0.984$) and a large KS distance ($0.875$) between the distributions of $D_{\rm pre}$ and $D_{\rm post}$. Our results provide compelling evidence that LLMs are more prone to reproducing verbatim content when the input text is likely sourced from their training data.
我们提出了Many-Shot Regurgitation(MSR)提示,一种新的大语言模型(LLM)中正文内容复制的黑盒会员推理攻击框架,用于研究大型语言模型(LLMs)中的 verbatim 内容复制。MSR 提示包括将输入文本划分为多个部分并创建一个包含用户和语言模型之间一系列伪对话环节的单个提示。我们将 MSR 提示应用于各种文本来源,包括维基百科文章和开放教育资源(OER)教科书,这些资料提供高质量、事实内容,并且随着时间的推移不断更新。对于每个来源,我们挑选两个数据集:一个是在训练过程中 likely接触到LLM的($D_{\rm pre}$),另一个是模型训练截止日期之后发布的文档($D_{\rm post}$)。为了量化匹配的出现情况,我们使用 Longest Common Substring 算法计数不同长度阈值下的匹配频率。然后使用统计方法如 Cliff's delta、Kolmogorov-Smirnov(KS)距离和 Kruskal-Wallis H 检验来确定是否在 $D_{\rm pre}$ 和 $D_{\rm post}$ 之间显著存在匹配分布的差异。我们的研究结果揭示了 $D_{\rm pre}$ 和 $D_{\rm post}$ 之间 verbatim 匹配分布的显著差异,当 LLM(例如 GPT 模型和 LLaMAs)从其训练数据中接收到文本时,复制的频率显著更高。例如,在用 GPT-3.5 处理维基百科文章时,我们观察到显著的效应量(Cliff's delta = -0.984)和较大的KS距离(0.875)。我们的研究结果提供了令人信服的证据,表明当输入文本可能来自其训练数据时,LLM 更倾向于复制 verbatim 内容。
https://arxiv.org/abs/2405.08134
The multi-plane representation has been highlighted for its fast training and inference across static and dynamic neural radiance fields. This approach constructs relevant features via projection onto learnable grids and interpolating adjacent vertices. However, it has limitations in capturing low-frequency details and tends to overuse parameters for low-frequency features due to its bias toward fine details, despite its multi-resolution concept. This phenomenon leads to instability and inefficiency when training poses are sparse. In this work, we propose a method that synergistically integrates multi-plane representation with a coordinate-based network known for strong bias toward low-frequency signals. The coordinate-based network is responsible for capturing low-frequency details, while the multi-plane representation focuses on capturing fine-grained details. We demonstrate that using residual connections between them seamlessly preserves their own inherent properties. Additionally, the proposed progressive training scheme accelerates the disentanglement of these two features. We empirically show that the proposed method achieves comparable results to explicit encoding with fewer parameters, and particularly, it outperforms others for the static and dynamic NeRFs under sparse inputs.
多平面表示法因其在静态和动态神经元辐射场中的快速训练和推理而受到突出关注。这种方法通过投影到可学习网格来构建相关特征,并通过插值相邻顶点来 interpolate。然而,它对于低频细节的捕捉有限,并且由于其对细节的倾向,倾向于过度使用参数。尽管它具有多分辨率的概念,但这种现象在训练稀疏 poses 时会导致不稳定和低效。 在本文中,我们提出了一种方法,将多平面表示法与一个对低频信号有很强偏好的坐标基础网络相结合。坐标基础网络负责捕捉低频细节,而多平面表示法专注于捕捉高频细节。我们证明了通过它们之间的残差连接来协同操作,可以保留它们的固有属性。此外,所提出的渐进式训练方法加速了这两个特征的分离。 我们通过实验实证,表明与用更少的参数进行显式编码相比,所提出的方法具有可比较的结果,特别是在稀疏输入下。特别地,它在静态和动态 NeRFs 上表现优异,即使输入稀疏。
https://arxiv.org/abs/2405.07857
Infants learn actively in their environments, shaping their own learning curricula. They learn about their environments' affordances, that is, how local circumstances determine how their behavior can affect the environment. Here we model this type of behavior by means of a deep learning architecture. The architecture mediates between global cognitive map exploration and local affordance learning. Inference processes actively move the simulated agent towards regions where they expect affordance-related knowledge gain. We contrast three measures of uncertainty to guide this exploration: predicted uncertainty of a model, standard deviation between the means of several models (SD), and the Jensen-Shannon Divergence (JSD) between several models. We show that the first measure gets fooled by aleatoric uncertainty inherent in the environment, while the two other measures focus learning on epistemic uncertainty. JSD exhibits the most balanced exploration strategy. From a computational perspective, our model suggests three key ingredients for coordinating the active generation of learning curricula: (1) Navigation behavior needs to be coordinated with local motor behavior for enabling active affordance learning. (2) Affordances need to be encoded locally for acquiring generalized knowledge. (3) Effective active affordance learning mechanisms should use density comparison techniques for estimating expected knowledge gain. Future work may seek collaborations with developmental psychology to model active play in children in more realistic scenarios.
婴儿在他们的环境中积极学习,塑造自己的学习课程。他们学习了解他们环境的适应性,即当地情况如何决定他们的行为如何影响环境。在这里,我们通过深度学习架构来建模这种行为。该架构在全局认知图探索和局部适应性学习之间起中介作用。推理过程积极地将模拟代理推向他们期望获得适应性相关知识的区域。我们通过预测模型的不确定性、多个模型的平均值之间的标准差以及多个模型之间的Jensen-Shannon差异来引导探索。我们发现,第一个度量被环境固有的随机不确定性所欺骗,而其他两个度量重点学习元认知不确定性。JSD表现出最平衡的探索策略。从计算角度来看,我们的模型建议协调学习课程的积极生成需要:(1)导航行为需要与局部运动行为协调以实现积极适应性学习。 (2)适应性需要局部编码以获取一般性知识。 (3)有效的积极适应性学习机制应使用密度比较技术估计预期知识 gain。未来的工作可能会寻求与发育心理学家的合作,在更真实的场景中建模儿童的学习玩耍。
https://arxiv.org/abs/2405.07816
Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics. Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their dependency on labeled training data, model compactness, robustness under challenging conditions, and their ability to generalize to novel unseen objects. A recent survey discussing the progress made on different aspects of this area, outstanding challenges, and promising future directions, is missing. To fill this gap, we discuss the recent advances in deep learning-based object pose estimation, covering all three formulations of the problem, i.e., instance-level, category-level, and unseen object pose estimation. Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks, providing readers with a holistic understanding of this field. Additionally, it discusses training paradigms of different domains, inference modes, application areas, evaluation metrics, and benchmark datasets, as well as reports the performance of current state-of-the-art methods on these benchmarks, thereby facilitating readers in selecting the most suitable method for their application. Finally, the survey identifies key challenges, reviews prevailing trends along with their pros and cons, and identifies promising directions for future research. We also keep tracing the latest works at this https URL.
对象姿态估计是一个在增强现实和机器人领域具有广泛应用的基本计算机视觉问题。在过去的十年里,由于其卓越的准确性和鲁棒性,深度学习模型越来越多地取代了依赖于人工点对特征的传统算法。然而,在当代方法中仍然存在几个挑战,包括对标注训练数据的依赖,模型的紧凑性,在复杂条件下的鲁棒性以及泛化到新颖未知物的能力。一份最近关于该领域进展的调查讨论了这个问题,突出了一些挑战和有前景的未来方向,但缺少了对这个领域的深入讨论。为了填补这个空白,我们讨论了基于深度学习的对象姿态估计的最新进展,涵盖了问题的所有三种形式,即实例级、类别级和未见物体姿态估计。我们的调查还涵盖了多个输入数据模态,输出姿态的自由度,物体属性以及下游任务,为读者提供了对这一领域的全面了解。此外,它还讨论了不同领域的训练范式、推理模式、应用领域、评估指标和基准数据集,以及报告了最先进方法在这些基准上的性能。最后,调查列举了关键挑战,回顾了现有趋势的优缺点,并提出了未来研究的建议。我们还在这个链接上持续追踪最新的工作。
https://arxiv.org/abs/2405.07801
Persona-driven role-playing (PRP) aims to build AI characters that can respond to user queries by faithfully sticking with all persona statements. Unfortunately, existing faithfulness criteria for PRP are limited to coarse-grained LLM-based scoring without a clear definition or formulation. This paper presents a pioneering exploration to quantify PRP faithfulness as a fine-grained and explainable criterion, which also serves as a reliable reference for optimization. Our criterion first discriminates persona statements into active and passive constraints by identifying the query-statement relevance. Then, we incorporate all constraints following the principle that the AI character's response should be (a) entailed by active (relevant) constraints and (b) not contradicted by passive (irrelevant) constraints. We translate this principle mathematically into a novel Active-Passive-Constraint (APC) score, a constraint-wise sum of natural language inference (NLI) scores weighted by relevance scores. In practice, we build the APC scoring system by symbolically distilling small discriminators from GPT-4 for efficiency. We validate the quality of the APC score against human evaluation based on example personas with tens of statements, and the results show a high correlation. We further leverage it as a reward system in direct preference optimization (DPO) for better AI characters. Our experiments offer a fine-grained and explainable comparison between existing PRP techniques, revealing their advantages and limitations. We further find APC-based DPO to be one of the most competitive techniques for sticking with all constraints and can be well incorporated with other techniques. We then extend the scale of the experiments to real persons with hundreds of statements and reach a consistent conclusion.
人物驱动的角色扮演(PRP)旨在构建能够通过忠实遵循所有人物陈述来响应用户查询的人工智能角色。然而,现有的PRP可靠性标准仅限于基于粗粒度LLM的评分,且没有明确的定义或表述。本文提出了一种开创性的探索,以量化PRP的可靠性作为一个细粒度和可解释的指标,同时也作为优化的重要参考。我们的标准首先通过确定查询语句的相关性将人物陈述区分为活动约束和被动约束。然后,我们遵循这个原则,将所有约束,即人工智能角色的响应应该(a)由活动约束所蕴含,(b)不由被动约束所矛盾。我们将这个原则数学地转化为一个新的活动-被动约束(APC)得分,即加权自然语言推理(NLI)分数的约束组合。在实践中,我们通过符号地从GPT-4中提炼小判别器来构建APC评分系统,以提高效率。我们根据具有数十个陈述的人样本来验证APC得分的质量,结果表明具有很高的相关性。我们进一步将其作为偏好优化(DPO)中的奖励系统,用于优化更优秀的人工智能角色。我们的实验揭示了现有PRP技术的细粒度和可解释的比较,揭示了它们的优缺点。我们进一步发现,基于APC的DPO是一种最具竞争力的技术,可以与其他技术很好地结合使用。然后,我们将实验扩展到有数百个陈述的现实人物,得出结论。
https://arxiv.org/abs/2405.07726
Hyperparameter optimization plays a pivotal role in enhancing the predictive performance and generalization capabilities of ML models. However, in many applications, we do not only care about predictive performance but also about objectives such as inference time, memory, or energy consumption. In such MOO scenarios, determining the importance of hyperparameters poses a significant challenge due to the complex interplay between the conflicting objectives. In this paper, we propose the first method for assessing the importance of hyperparameters in the context of multi-objective hyperparameter optimization. Our approach leverages surrogate-based hyperparameter importance (HPI) measures, i.e. fANOVA and ablation paths, to provide insights into the impact of hyperparameters on the optimization objectives. Specifically, we compute the a-priori scalarization of the objectives and determine the importance of the hyperparameters for different objective tradeoffs. Through extensive empirical evaluations on diverse benchmark datasets with three different objectives paired with accuracy, namely time, demographic parity, and energy consumption, we demonstrate the effectiveness and robustness of our proposed method. Our findings not only offer valuable guidance for hyperparameter tuning in MOO tasks but also contribute to advancing the understanding of HPI in complex optimization scenarios.
参数优化在增强机器学习模型的预测性能和泛化能力方面发挥着重要作用。然而,在许多应用中,我们不仅关注预测性能,还关注诸如推理时间、内存或能源消耗等目标。在这样的多目标优化(MOO)场景中,确定参数的重要性由于相互矛盾的目标之间的复杂相互作用而带来了巨大的挑战。在本文中,我们提出了第一个评估多目标参数优化中参数重要性的方法。我们的方法利用基于 surrogate 的超参数重要性度量(HPI)方法,即 fANOVA 和 ablation 路径,为研究超参数对优化目标的影响提供了洞察。具体来说,我们计算了目标的a-priori归一化,并确定了不同目标之间的超参数重要性。通过在不同的基准数据集上进行广泛的实证评估,我们证明了我们提出方法的有效性和稳健性。我们的研究结果不仅为机器学习优化任务中的超参数 tuning提供了宝贵的指导,而且还有助于深入理解在复杂优化场景中 HPI 的理解。
https://arxiv.org/abs/2405.07640