Synthetic aperture radar (SAR) is essential in actively acquiring information for Earth observation. SAR Automatic Target Recognition (ATR) focuses on detecting and classifying various target categories under different image conditions. The current deep learning-based SAR ATR methods are typically designed for specific datasets and applications. Various target characteristics, scene background information, and sensor parameters across ATR datasets challenge the generalization of those methods. This paper aims to achieve general SAR ATR based on a foundation model with Self-Supervised Learning (SSL). Our motivation is to break through the specific dataset and condition limitations and obtain universal perceptual capabilities across the target, scene, and sensor. A foundation model named SARATR-X is proposed with the following four aspects: pre-training dataset, model backbone, SSL, and evaluation task. First, we integrated 14 datasets with various target categories and imaging conditions as a pre-training dataset. Second, different model backbones were discussed to find the most suitable approaches for remote-sensing images. Third, we applied two-stage training and SAR gradient features to ensure the diversity and scalability of SARATR-X. Finally, SARATR-X has achieved competitive and superior performance on 5 datasets with 8 task settings, which shows that the foundation model can achieve universal SAR ATR. We believe it is time to embrace fundamental models for SAR image interpretation in the era of increasing big data.
合成孔雷达(SAR)在积极获取地球观测信息方面至关重要。SAR自动目标识别(ATR)关注于在不同的图像条件下检测和分类各种目标类别。目前基于深度学习的SAR ATR方法通常是为特定数据集和应用设计的。各种目标特征、场景背景信息和ATR数据集中的传感器参数挑战了这些方法的一般化。本文旨在基于自监督学习(SSL)的基础模型实现通用SAR ATR。我们的目标是突破特定数据和条件的限制,获得目标、场景和传感器之间的普遍感知能力。 我们提出了一个名为SARATR-X的基础模型,包括以下四个方面:预训练数据集、模型骨架、SSL和评估任务。首先,我们将14个数据集与各种目标和成像条件集成作为一个预训练数据集。其次,讨论了不同的模型骨架,以找到最适合远程感测图像的适当方法。第三,我们应用了两阶段培训和SAR梯度特征来确保SARATR-X的多样性和可扩展性。最后,SARATR-X在5个数据集和8个任务设置上实现了竞争性和卓越性能,这表明基础模型可以实现通用SAR ATR。我们认为,在数据和数据量不断增加的时代,应该拥抱基本模型用于SAR图像解释。
https://arxiv.org/abs/2405.09365
Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: this https URL.
在内镜手术过程中,将自身定位可能会出现问题,因为内镜设备(如有限视野和复杂的照明条件)以及由于缺乏可区分纹理和标志而产生的困难。在本文中,我们提出了一种基于解剖识别的深度学习方法,在未经监督的情况下从手术视频中构建手术路径,并建模不同视角下的相对位置和变化。在推理时,该模型可以在路径上映射未见过的视频帧,并估计视角,旨在提供指导,例如,到达特定目的地。我们在包括Transsphenoidal腺瘤在内的大规模手术视频数据集上测试了该方法,以及在合成数据集上进行了测试。在这个网站上有这样一个在线工具,让研究人员上传他们的手术视频以获得解剖检测和训练的YOLOv7模型的权重:https:// this URL.
https://arxiv.org/abs/2405.09355
AI-based analysis of histopathology whole slide images (WSIs) is central in computational pathology. However, image quality can impact model performance. Here, we investigate to what extent unsharp areas of WSIs impact deep convolutional neural network classification performance. We propose a multi-model approach, i.e. DeepBlurMM, to alleviate the impact of unsharp image areas and improve the model performance. DeepBlurMM uses the sigma cut-offs to determine the most suitable model for predicting tiles with various levels of blurring within a single WSI, where sigma is the standard deviation of the Gaussian distribution. Specifically, the cut-offs categorise the tiles into sharp or slight blur, moderate blur, and high blur. Each blur level has a corresponding model to be selected for tile-level predictions. Throughout the simulation study, we demonstrated the application of DeepBlurMM in a binary classification task for breast cancer Nottingham Histological Grade 1 vs 3. Performance, evaluated over 5-fold cross-validation, showed that DeepBlurMM outperformed the base model under moderate blur and mixed blur conditions. Unsharp image tiles (local blurriness) at prediction time reduced model performance. The proposed multi-model approach improved performance under some conditions, with the potential to improve quality in both research and clinical applications.
基于AI的病理学全切片图像(WSIs)分析在计算病理学中具有核心地位。然而,图像质量会 impact 模型性能。在这里,我们研究了 WSIs 非锐利区域对深度卷积神经网络分类性能的影响程度。我们提出了一个多模型方法,即 DeepBlurMM,以减轻非锐利图像区域对模型性能的影响,并提高模型性能。DeepBlurMM 使用高斯分布的σ截止值来确定在单个 WSI 中预测具有各种模糊程度的贴片的最合适的模型。具体来说,截止值将贴片分类为锐利、轻微模糊、中度模糊和高模糊。对于每个模糊级别,都有相应的模型用于预测贴片级别的结果。在模拟研究中,我们证明了 DeepBlurMM 在乳腺癌诺丁山病理 grade 1 与 3 的二分类任务中的应用。性能通过 5 倍交叉验证评估,在 moderate blur 和 mixed blur 条件下,DeepBlurMM 超过了基线模型。预测时间内的非锐利图像贴片(局部模糊)降低了模型性能。所提出的多模型方法在某些条件下改善了性能,具有在研究和临床应用中提高质量的潜力。
https://arxiv.org/abs/2405.09298
With the benefit of deep learning techniques, recent researches have made significant progress in image compression artifacts reduction. Despite their improved performances, prevailing methods only focus on learning a mapping from the compressed image to the original one but ignore the intrinsic attributes of the given compressed images, which greatly harms the performance of downstream parsing tasks. Different from these methods, we propose to decouple the intrinsic attributes into two complementary features for artifacts reduction,ie, the compression-insensitive features to regularize the high-level semantic representations during training and the compression-sensitive features to be aware of the compression degree. To achieve this, we first employ adversarial training to regularize the compressed and original encoded features for retaining high-level semantics, and we then develop the compression quality-aware feature encoder for compression-sensitive features. Based on these dual complementary features, we propose a Dual Awareness Guidance Network (DAGN) to utilize these awareness features as transformation guidance during the decoding phase. In our proposed DAGN, we develop a cross-feature fusion module to maintain the consistency of compression-insensitive features by fusing compression-insensitive features into the artifacts reduction baseline. Our method achieves an average 2.06 dB PSNR gains on BSD500, outperforming state-of-the-art methods, and only requires 29.7 ms to process one image on BSD500. Besides, the experimental results on LIVE1 and LIU4K also demonstrate the efficiency, effectiveness, and superiority of the proposed method in terms of quantitative metrics, visual quality, and downstream machine vision tasks.
得益于深度学习技术的优势,近年来图像压缩伪影减少的研究取得了显著进展。尽管其性能有所提高,但现有的方法仅关注从压缩图像到原始图像的映射学习,而忽略了给定压缩图像的固有属性,这大大削弱了下游解码任务的性能。与这些方法不同,我们提出了一种将固有属性解耦为两个互补特征的方法,以便在图像压缩伪影减少中实现压缩敏感特征的感知。为了实现这一目标,我们首先使用对抗训练来对压缩和原始编码特征进行规范,保留高级语义表示,然后我们为压缩敏感特征开发了压缩质量感知特征编码器。基于这些互补特征,我们提出了一个双感知指导网络(DAGN)来在解码阶段利用这些感知特征作为变换指导。在我们的DAGN中,我们开发了一个跨特征融合模块,通过将压缩敏感特征与 artifacts reduction 基线融合来保持压缩感知特征的一致性。我们的方法在BSD500上的平均PSNR增益达到2.06 dB,超越了最先进的方法,并且仅在BSD500上处理一张图片就需要29.7毫秒。此外,LIVE1和LIU4K的实验结果也证明了我们在数量指标、视觉质量和下游机器视觉任务方面的方法的有效性和优越性。
https://arxiv.org/abs/2405.09291
Deep learning classifiers are prone to latching onto dominant confounders present in a dataset rather than on the causal markers associated with the target class, leading to poor generalization and biased predictions. Although explainability via counterfactual image generation has been successful at exposing the problem, bias mitigation strategies that permit accurate explainability in the presence of dominant and diverse artifacts remain unsolved. In this work, we propose the DeCoDEx framework and show how an external, pre-trained binary artifact detector can be leveraged during inference to guide a diffusion-based counterfactual image generator towards accurate explainability. Experiments on the CheXpert dataset, using both synthetic artifacts and real visual artifacts (support devices), show that the proposed method successfully synthesizes the counterfactual images that change the causal pathology markers associated with Pleural Effusion while preserving or ignoring the visual artifacts. Augmentation of ERM and Group-DRO classifiers with the DeCoDEx generated images substantially improves the results across underrepresented groups that are out of distribution for each class. The code is made publicly available at this https URL.
深度学习分类器容易在数据集中固有的主导混淆因素上留下印象,而不是在目标类别的相关因果标记上,导致泛化差和有偏预测。尽管通过反事实图像生成来解释该问题已经取得成功,但允许在主导和多样异常物中实现准确解释的偏差减轻策略仍然是一个未解决的问题。在本文中,我们提出了DeCoDEx框架,并展示了如何在外部预训练的二进制异常物检测器的基础上,在推理过程中指导扩散式反事实图像生成器走向准确解释。在CheXpert数据集上进行的实验(使用合成异常物和真实视觉异常物)表明,与该方法相结合,可以成功生成反事实图像,这些图像在改变与胸膜积液相关的因果病理特征的同时,保留或忽略视觉异常。使用DeCoDEx生成的图像对ERM和Group-DRO分类器的扩展显著提高了分布不寻常类别的结果。代码可在此处公开访问:https://this URL。
https://arxiv.org/abs/2405.09288
Sensor placement optimization methods have been studied extensively. They can be applied to a wide range of applications, including surveillance of known environments, optimal locations for 5G towers, and placement of missile defense systems. However, few works explore the robustness and efficiency of the resulting sensor network concerning sensor failure or adversarial attacks. This paper addresses this issue by optimizing for the least number of sensors to achieve multiple coverage of non-simply connected domains by a prescribed number of sensors. We introduce a new objective function for the greedy (next-best-view) algorithm to design efficient and robust sensor networks and derive theoretical bounds on the network's optimality. We further introduce a Deep Learning model to accelerate the algorithm for near real-time computations. The Deep Learning model requires the generation of training examples. Correspondingly, we show that understanding the geometric properties of the training data set provides important insights into the performance and training process of deep learning techniques. Finally, we demonstrate that a simple parallel version of the greedy approach using a simpler objective can be highly competitive.
传感器部署优化方法已经得到了广泛研究。它们可以应用于广泛的领域,包括监测已知环境、5G塔的最佳位置和导弹防御系统的部署。然而,很少有工作探讨传感器网络关于传感器故障或对抗攻击的鲁棒性和效率。本文通过优化传感器数量,实现对非简单连通领域指定传感器数量的多重覆盖,解决了这个问题。我们引入了一个新的目标函数,用于贪婪(下一个最佳视角)算法,以设计高效和鲁棒的传感器网络,并得出关于网络最优性的理论界线。我们进一步引入了一个深度学习模型,以加速算法的近实时计算。深度学习模型需要生成训练示例。因此,我们证明了理解训练数据集的几何特征提供了对深度学习技术性能和训练过程的重要见解。最后,我们证明了使用更简单的目标函数的简单并行版本可以具有很高的竞争力。
https://arxiv.org/abs/2405.09096
Recent advancements in deep learning for 3D models have propelled breakthroughs in generation, detection, and scene understanding. However, the effectiveness of these algorithms hinges on large training datasets. We address the challenge by introducing Efficient 3D Seam Carving (E3SC), a novel 3D model augmentation method based on seam carving, which progressively deforms only part of the input model while ensuring the overall semantics are unchanged. Experiments show that our approach is capable of producing diverse and high-quality augmented 3D shapes across various types and styles of input models, achieving considerable improvements over previous methods. Quantitative evaluations demonstrate that our method effectively enhances the novelty and quality of shapes generated by other subsequent 3D generation algorithms.
近年来,在深度学习领域为3D模型取得突破性的进展,主要体现在生成、检测和场景理解方面的提升。然而,这些算法的有效性依赖于大型训练数据集。为了解决这一挑战,我们引入了Efficient 3D Seam Carving(E3SC),一种基于缝合切割的新3D模型增强方法,在确保整体语义不变的前提下,逐步改变输入模型的部分部分。实验结果表明,我们的方法能够为各种输入模型的多样性和高质量生成3D形状,并在很大程度上超过了以前的方法。定量的评估结果表明,我们的方法有效地增强了后续3D生成算法生成的形状的新奇度和质量。
https://arxiv.org/abs/2405.09050
In recent years, street view imagery has grown to become one of the most important sources of geospatial data collection and urban analytics, which facilitates generating meaningful insights and assisting in decision-making. Synthesizing a street-view image from its corresponding satellite image is a challenging task due to the significant differences in appearance and viewpoint between the two domains. In this study, we screened 20 recent research papers to provide a thorough review of the state-of-the-art of how street-view images are synthesized from their corresponding satellite counterparts. The main findings are: (i) novel deep learning techniques are required for synthesizing more realistic and accurate street-view images; (ii) more datasets need to be collected for public usage; and (iii) more specific evaluation metrics need to be investigated for evaluating the generated images appropriately. We conclude that, due to applying outdated deep learning techniques, the recent literature failed to generate detailed and diverse street-view images.
近年来,街景图像已成为地理空间数据收集和城市分析中最重要的数据来源之一,从而促进了有意义的见解和决策支持。从相应的卫星图像合成街景图像是一个具有挑战性的任务,因为两个领域之间的外观和视点存在显著差异。在这项研究中,我们审查了20篇最近的研究论文,以全面回顾从相应卫星图像合成街景图像的最佳现状。研究结果是: (i)需要使用新颖的深度学习技术合成更真实和准确的街景图像;(ii)需要收集更多的数据用于公共使用;(iii)需要研究更多的评估指标,以便适当地评估生成的图像。我们得出结论,由于应用过时的深度学习技术,最近的文章没有生成详细和多样化的街景图像。
https://arxiv.org/abs/2405.08961
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports to address downstream tasks of interest on bone radiography. A practical processing pipeline is introduced to anonymize and process French medical reports. Pretraining then consists in the self-supervised alignment of visual and textual embedding spaces derived from deep model encoders. The resulting image encoder is then used to handle various downstream tasks, including quantification of osteoarthritis, estimation of bone age on pediatric wrists, bone fracture and anomaly detection. Our approach demonstrates competitive performance on downstream tasks, compared to alternatives requiring a significantly larger amount of human expert annotations. Our work stands as the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations, capitalizing on the large quantity of paired images and reports data available in an hospital. By relying on generic vision-laguage deep models in a language-specific scenario, it contributes to the deployement of vision models for wider healthcare applications.
本文提出了一种利用成骨扫描图像搭配法国报告进行视觉-语言预训练的方法,以解决骨摄影下游任务的挑战。具体来说,我们介绍了一个实用的处理流程来匿名化和处理法国医疗报告。预训练包括对深度模型编码器产生的视觉和文本嵌入空间的自监督对齐。然后,经过调整的图像编码器用于处理各种下游任务,包括对骨关节炎的定量、对儿童手腕上的骨龄估计、骨骨折和异常检测。我们的方法在下游任务上具有竞争力的性能,与需要大量人专家注释的替代方法相比。我们的工作是第一项将法国报告整合到专门用于骨X光表示的嵌入空间的研究,充分利用了医院中存在的大量成对图像和报告数据。通过在语言特定的场景中依赖通用视觉语言深度模型,它为更广泛的医疗应用部署视觉模型做出了贡献。
https://arxiv.org/abs/2405.08932
A recent study by De et al. (2022) has reported that large-scale representation learning through pre-training on a public dataset significantly enhances differentially private (DP) learning in downstream tasks, despite the high dimensionality of the feature space. To theoretically explain this phenomenon, we consider the setting of a layer-peeled model in representation learning, which results in interesting phenomena related to learned features in deep learning and transfer learning, known as Neural Collapse (NC). Within the framework of NC, we establish an error bound indicating that the misclassification error is independent of dimension when the distance between actual features and the ideal ones is smaller than a threshold. Additionally, the quality of the features in the last layer is empirically evaluated under different pre-trained models within the framework of NC, showing that a more powerful transformer leads to a better feature representation. Furthermore, we reveal that DP fine-tuning is less robust compared to fine-tuning without DP, particularly in the presence of perturbations. These observations are supported by both theoretical analyses and experimental evaluation. Moreover, to enhance the robustness of DP fine-tuning, we suggest several strategies, such as feature normalization or employing dimension reduction methods like Principal Component Analysis (PCA). Empirically, we demonstrate a significant improvement in testing accuracy by conducting PCA on the last-layer features.
De等人(2022)最近发表的一篇研究报道指出,通过在公共数据集上进行大规模表示学习,可以显著增强下游任务的差异隐私(DP)学习,尽管特征空间具有高维度。为了理论地解释这一现象,我们考虑表示学习中的层剥离设置,该设置在深度学习和迁移学习中学到的特征中产生了关于神经崩塌(NC)有趣的现象。在NC的框架内,我们建立了误分类误差与维度之间的独立性,即实际特征与理想特征之间的距离小于一个阈值时,误分类误差是独立的。此外,在不同预训练模型下对NC框架内最后层的特征进行了实证评估,结果表明,更强大的Transformer导致更好的特征表示。此外,我们还发现,与没有DP的微调相比,DP微调的鲁棒性较低,尤其是在存在扰动的情况下。这些观察结果得到了理论分析和实验评估的支持。此外,为了增强DP微调的鲁棒性,我们提出了几个策略,例如特征归一化或采用如主成分分析(PCA)等降维方法。通过PCA对最后一层特征进行降维,我们得到了显著的测试准确率提升。
https://arxiv.org/abs/2405.08920
The Prostate Imaging Reporting and Data System (PI-RADS) is pivotal in the diagnosis of clinically significant prostate cancer through MRI imaging. Current deep learning-based PI-RADS scoring methods often lack the incorporation of essential PI-RADS clinical guidelines~(PICG) utilized by radiologists, potentially compromising scoring accuracy. This paper introduces a novel approach that adapts a multi-modal large language model (MLLM) to incorporate PICG into PI-RADS scoring without additional annotations and network parameters. We present a two-stage fine-tuning process aimed at adapting MLLMs originally trained on natural images to the MRI data domain while effectively integrating the PICG. In the first stage, we develop a domain adapter layer specifically tailored for processing 3D MRI image inputs and design the MLLM instructions to differentiate MRI modalities effectively. In the second stage, we translate PICG into guiding instructions for the model to generate PICG-guided image features. Through feature distillation, we align scoring network features with the PICG-guided image feature, enabling the scoring network to effectively incorporate the PICG information. We develop our model on a public dataset and evaluate it in a real-world challenging in-house dataset. Experimental results demonstrate that our approach improves the performance of current scoring networks.
前列腺影像报告和数据系统(PI-RADS)在通过MRI成像诊断临床显著的前列腺癌中具有关键作用。当前的深度学习为基础的PI-RADS评分方法通常缺乏使用放射学家所使用的关键PI-RADS临床指南(PICG)进行整合,这可能影响评分准确性。本文介绍了一种新方法,将一个多模态的大语言模型(MLLM)适应性地整合到PI-RADS评分中,而无需额外注释和网络参数。我们提出了一个两阶段微调过程,旨在将最初在自然图像上训练的MLLM适应性地映射到MRI数据领域,同时有效地整合PICG。在第一阶段,我们开发了一个专门针对处理3D MRI图像输入的领域适配层,并设计MLLM指令以有效区分MRI模式。在第二阶段,我们将PICG转换为指导模型生成PICG指导的图像特征的指导指令。通过特征蒸馏,我们将评分网络特征与PICG指导的图像特征对齐,使评分网络能够有效整合PICG信息。我们在公共数据集上开发我们的模型,并在真实世界具有挑战性的内部数据集上进行评估。实验结果表明,我们的方法提高了当前评分网络的性能。
https://arxiv.org/abs/2405.08786
Deep learning has enabled breakthroughs in automated diagnosis from medical imaging, with many successful applications in ophthalmology. However, standard medical image classification approaches only assess disease presence at the time of acquisition, neglecting the common clinical setting of longitudinal imaging. For slow, progressive eye diseases like age-related macular degeneration (AMD) and primary open-angle glaucoma (POAG), patients undergo repeated imaging over time to track disease progression and forecasting the future risk of developing disease is critical to properly plan treatment. Our proposed Longitudinal Transformer for Survival Analysis (LTSA) enables dynamic disease prognosis from longitudinal medical imaging, modeling the time to disease from sequences of fundus photography images captured over long, irregular time periods. Using longitudinal imaging data from the Age-Related Eye Disease Study (AREDS) and Ocular Hypertension Treatment Study (OHTS), LTSA significantly outperformed a single-image baseline in 19/20 head-to-head comparisons on late AMD prognosis and 18/20 comparisons on POAG prognosis. A temporal attention analysis also suggested that, while the most recent image is typically the most influential, prior imaging still provides additional prognostic value.
深度学习在医学影像自动诊断方面取得了突破性进展,在眼科领域有很多成功应用。然而,标准的医学图像分类方法仅在获取时评估疾病的存在,而忽略了常见的临床扫描设置——长期影像扫描。对于像年龄相关性黄斑变性(AMD)和原发性开角型眼压升高(POAG)这样的进展缓慢、进行性的眼病,患者需要重复进行影像检查以跟踪疾病进展,并预测未来患病的风险,以便正确规划治疗。我们提出的纵向Transformer for Survival Analysis(LTSA)可以从长期医学影像中动态预测疾病预后,建模长时间 irregular 时间间隔内捕获的序列帧图像中的疾病从眼轴摄影图中的时间。使用年龄相关性眼病研究(AREDS)和眼压升高治疗研究(OHTS)中的纵向影像数据,LTSA在19/20 头对头比较中显著超过了单张图像基线在晚期 AMD 预后方面的表现,而在18/20 比较中超过了原发性开角型眼压升高预后的表现。时间注意分析还表明,虽然最最新的图像通常是最有影响力的,但之前的图像仍然提供了额外的预后价值。
https://arxiv.org/abs/2405.08780
Numerous studies have revealed that deep learning-based medical image classification models may exhibit bias towards specific demographic attributes, such as race, gender, and age. Existing bias mitigation methods often achieve high level of fairness at the cost of significant accuracy degradation. In response to this challenge, we propose an innovative and adaptable Soft Nearest Neighbor Loss-based channel pruning framework, which achieves fairness through channel pruning. Traditionally, channel pruning is utilized to accelerate neural network inference. However, our work demonstrates that pruning can also be a potent tool for achieving fairness. Our key insight is that different channels in a layer contribute differently to the accuracy of different groups. By selectively pruning critical channels that lead to the accuracy difference between the privileged and unprivileged groups, we can effectively improve fairness without sacrificing accuracy significantly. Experiments conducted on two skin lesion diagnosis datasets across multiple sensitive attributes validate the effectiveness of our method in achieving state-of-the-art trade-off between accuracy and fairness. Our code is available at this https URL.
大量研究表明,基于深度学习的医疗图像分类模型可能存在对特定人口属性的偏见,如种族、性别和年龄等。现有的偏见缓解方法通常可以在公正性方面达到很高的水平,但会降低准确性。为了应对这一挑战,我们提出了一个创新且可适应的软最近邻损失基于通道修剪框架,通过通道修剪实现公正。 传统上,通道修剪用于加速神经网络的推理。然而,我们的工作表明,修剪也可以成为实现公正的有力工具。我们的关键洞见是,同一层中不同通道对不同群体的准确性有不同的贡献。通过选择性地修剪导致 privileged 和 unprivileged 群体之间准确率差异的关键通道,我们可以在不牺牲准确率的情况下有效提高公正性。在两个皮肤病变诊断数据集上的实验验证了我们在实现准确性和公正性之间的最优平衡。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2405.08681
The increasing complexity of Artificial Intelligence models poses challenges to interpretability, particularly in the healthcare sector. This study investigates the impact of deep learning model complexity and Explainable AI (XAI) efficacy, utilizing four ResNet architectures (ResNet-18, 34, 50, 101). Through methodical experimentation on 4,369 lung X-ray images of COVID-19-infected and healthy patients, the research evaluates models' classification performance and the relevance of corresponding XAI explanations with respect to the ground-truth disease masks. Results indicate that the increase in model complexity is associated with a decrease in classification accuracy and AUC-ROC scores (ResNet-18: 98.4%, 0.997; ResNet-101: 95.9%, 0.988). Notably, in eleven out of twelve statistical tests performed, no statistically significant differences occurred between XAI quantitative metrics - Relevance Rank Accuracy and the proposed Positive Attribution Ratio - across trained models. These results suggest that increased model complexity does not consistently lead to higher performance or relevance of explanations for models' decision-making processes.
人工智能模型的复杂度增加对可解释性提出了挑战,特别是在医疗领域。这项研究调查了深度学习模型的复杂度和可解释AI(XAI)的有效性,利用了四个ResNet架构(ResNet-18、34、50和101)。通过在COVID-19感染者和健康患者的大规模肺X光片上进行实验,研究评估了模型的分类表现以及相应XAI解释与真实疾病口罩的关系。研究结果表明,模型复杂度的增加与分类准确性和AUC-ROC分数(ResNet-18:98.4%,0.997;ResNet-101:95.9%,0.988)的降低有关。值得注意的是,在12个统计测试中,XAI定量指标——相关性排名准确性和所提出的积极归因比——在训练模型上没有显著的差异。这些结果表明,增加模型复杂度并不一定导致模型性能或解释性的提高。
https://arxiv.org/abs/2405.08658
Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses the curated task dataset for both pretraining the networks and fine-tuning them. Availability of large, diverse, and uncurated public medical image sets provides the opportunity to apply SSL in the "wild" and potentially extract features robust to imaging variations. However, the benefit of wild- vs self-pretraining has not been studied for medical image analysis. In this paper, we compare robustness of wild versus self-pretrained transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models to computed tomography (CT) imaging differences for non-small cell lung cancer (NSCLC) segmentation. Wild-pretrained Swin models outperformed self-pretrained Swin for the various imaging acquisitions. ViT resulted in similar accuracy for both wild- and self-pretrained models. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained models resulted in higher feature reuse at the lower level layers and feature differentiation close to output layer after fine-tuning. Hence, we conclude: Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. Swin architecture benefited from such pretraining more than ViT.
自监督学习(SSL)是从未标记数据中提取有用特征表示的方法,并在有限标记示例的情况下,在下游任务上进行微调。自监督预训练是一种SSL方法,它利用精心挑选的任务数据集来预训练网络并对其进行微调。大型、多样化和未标记的公共医疗图像数据集的可用性提供了在“野地”应用SSL的机会,从而可能提取对影像变异具有鲁棒性的特征。然而,对于医学图像分析,还没有研究野生预训练和自监督预训练之间的优势。在本文中,我们比较野生预训练和自监督预训练的Transformer(视觉Transformer [ViT]和层次窗滑动窗口 [Swin])模型的计算断层扫描(CT)成像差异对非小细胞肺癌(NSCLC)分割的鲁棒性。野生预训练的Swin模型在各种成像采集中都优于自监督预训练的Swin模型。ViT模型的准确性与野生预训练和自监督预训练模型相当。使网络学习局部结构的目标预处理任务产生了比全局图像信息建模的对比任务更高的准确率。野生预训练模型在微调后,低层层级的特征复用较高,输出层附近的特征分化也较高。因此,我们得出结论:野生预训练网络在肺癌分割分析中的鲁棒性大于自监督方法。Swin架构从预训练中受益更多。
https://arxiv.org/abs/2405.08657
With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.
随着深度学习的 recent 进步,已经开发了众多算法来提高视频质量、减少视觉伪影和提高感知质量。然而,关于增强内容质量评估的研究仍然很少。增强内容的质量评估通常基于为压缩应用设计的质量度量指标。在本文中,我们提出了一个专门针对增强视频内容的新颖盲深度视频质量评估(VQA)方法。它采用了一种基于新数据库(包含 13K 训练补丁)的内容质量感知循环记忆转置(RMT)网络架构来获得视频质量表示,并通过一种基于新数据库(包含13K训练补丁)的内容质量感知对比学习策略来优化它。提取出的质量表示通过线性回归合并生成视频级的质量索引。所提出的方法,RMT-BVQA,通过五倍交叉验证对 VDPVE(用于感知视频增强的数据集)数据库进行了评估。结果表明,与十个现有无参考质量度量指标相比,其相关性能优越。
https://arxiv.org/abs/2405.08621
Image stitching aims to construct a wide field of view with high spatial resolution, which cannot be achieved in a single exposure. Typically, conventional image stitching techniques, other than deep learning, require complex computation and thus computational pricy, especially for stitching large raw images. In this study, inspired by the multiscale feature of fluid turbulence, we developed a fast feature point detection algorithm named local-peak scale-invariant feature transform (LP-SIFT), based on the multiscale local peaks and scale-invariant feature transform method. By combining LP-SIFT and RANSAC in image stitching, the stitching speed can be improved by orders, compared with the original SIFT method. Nine large images (over 2600*1600 pixels), arranged randomly without prior knowledge, can be stitched within 158.94 s. The algorithm is highly practical for applications requiring a wide field of view in diverse application scenes, e.g., terrain mapping, biological analysis, and even criminal investigation.
图像拼接旨在通过高空间分辨率来构建广阔的视野,这在一个曝光中无法实现。通常,除了深度学习之外,传统的图像拼接技术需要复杂的计算,因此计算代价较高,尤其是在拼接大型的原始图像时。在这项研究中,我们受到流体湍流中多尺度特征的启发,开发了一种名为局部峰值尺度不变特征变换(LP-SIFT)的快速特征点检测算法,基于多尺度局部峰值和尺度不变特征变换方法。通过将LP-SIFT和RANSAC在图像拼接中结合,拼接速度可以提高 orders,与原始SIFT方法相比。 在随机排列的9个大图像(超过2600*1600像素)上进行拼接,仅用158.94秒。该算法在需要广阔视野的各种应用场景中具有很高的实用性,例如地形测绘、生物分析,甚至犯罪调查。
https://arxiv.org/abs/2405.08578
Deep learning methods, especially Convolutional Neural Networks (CNN) and Vision Transformer (ViT), are frequently employed to perform semantic segmentation of high-resolution remotely sensed images. However, CNNs are constrained by their restricted receptive fields, while ViTs face challenges due to their quadratic complexity. Recently, the Mamba model, featuring linear complexity and a global receptive field, has gained extensive attention for vision tasks. In such tasks, images need to be serialized to form sequences compatible with the Mamba model. Numerous research efforts have explored scanning strategies to serialize images, aiming to enhance the Mamba model's understanding of images. However, the effectiveness of these scanning strategies remains uncertain. In this research, we conduct a comprehensive experimental investigation on the impact of mainstream scanning directions and their combinations on semantic segmentation of remotely sensed images. Through extensive experiments on the LoveDA, ISPRS Potsdam, and ISPRS Vaihingen datasets, we demonstrate that no single scanning strategy outperforms others, regardless of their complexity or the number of scanning directions involved. A simple, single scanning direction is deemed sufficient for semantic segmentation of high-resolution remotely sensed images. Relevant directions for future research are also recommended.
深度学习方法,特别是卷积神经网络(CNN)和视觉Transformer(ViT),通常用于对高分辨率遥感图像进行语义分割。然而,CNN受到其有限 receptive field 的限制,而ViT由于其二次复杂度而面临挑战。最近,Mamba模型,具有线性复杂度和全局 receptive field,在视觉任务上受到了广泛关注。在 such tasks 中,图像需要以与Mamba模型兼容的序列形式进行 serialization。许多研究努力探讨扫描策略来序列化图像,旨在增强Mamba模型对图像的理解。然而,这些扫描策略的有效性仍然不确定。在这项研究中,我们对主流扫描方向及其组合对遥感图像语义分割的影响进行全面实验调查。通过在LoveDA、ISPRS Potsdam和ISPRS Vaihingen数据集上的广泛实验,我们证明无论复杂程度如何,没有一种扫描策略能够优于其他扫描策略。我们还推荐了未来研究的方向。
https://arxiv.org/abs/2405.08493
In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors.
近年来,深度学习极大地简化了生成逼真假脸图像的过程。为了意识到这种伪造技术的危险性,研究人员开发了各种工具来检测这些伪造技术。然而,没有一个工具问到这个问题:什么数字编辑会使得真实照片脸孔图像伪造,而其他工具不会?在本文中,我们将面部伪造置于语义背景下,并定义了能够超过人类判断阈值的语义人脸属性编辑的方法是面部伪造的源头。在受到新定义的指导下,我们构建了一个大规模的人脸伪造图像数据集,其中每个图像都与一个由层次图组织起来的标签集相关联。我们的数据集使得有两个新的测试协议可以探究面部伪造检测器的泛化能力。此外,我们提出了一个语义导向的面部伪造检测方法,它抓住了标签关系并优先考虑了主要任务(即真实或伪造脸孔检测)。我们证明了所提出的数据集成功地揭示了当前检测器的缺陷作为测试集,并且在训练集上持续改进了它们的泛化能力。此外,我们还证明了我们的语义导向方法比传统的二分类和多分类分类基础检测器具有优越性。
https://arxiv.org/abs/2405.08487
Perivascular spaces(PVSs) form a central component of the brainś waste clearance system, the glymphatic system. These structures are visible on MRI images, and their morphology is associated with aging and neurological disease. Manual quantification of PVS is time consuming and subjective. Numerous deep learning methods for PVS segmentation have been developed, however the majority have been developed and evaluated on homogenous datasets and high resolution scans, perhaps limiting their applicability for the wide range of image qualities acquired in clinic and research. In this work we train a nnUNet, a top-performing biomedical image segmentation algorithm, on a heterogenous training sample of manually segmented MRI images of a range of different qualities and resolutions from 6 different datasets. These are compared to publicly available deep learning methods for 3D segmentation of PVS. The resulting model, PINGU (Perivascular space Identification Nnunet for Generalised Usage), achieved voxel and cluster level dice scores of 0.50(SD=0.15), 0.63(0.17) in the white matter(WM), and 0.54(0.11), 0.66(0.17) in the basal ganglia(BG). Performance on data from unseen sites was substantially lower for both PINGU(0.20-0.38(WM, voxel), 0.29-0.58(WM, cluster), 0.22-0.36(BG, voxel), 0.46-0.60(BG, cluster)) and the publicly available algorithms(0.18-0.30(WM, voxel), 0.29-0.38(WM cluster), 0.10-0.20(BG, voxel), 0.15-0.37(BG, cluster)), but PINGU strongly outperformed the publicly available algorithms, particularly in the BG. Finally, training PINGU on manual segmentations from a single site with homogenous scan properties gave marginally lower performances on internal cross-validation, but in some cases gave higher performance on external validation. PINGU stands out as broad-use PVS segmentation tool, with particular strength in the BG, an area of PVS related to vascular disease and pathology.
皮层外间隙(PVS)是清除系统中的一个重要组成部分,称为糖质系统。这些结构在MRI图像上是可见的,它们的形态与衰老和神经系统疾病有关。手动量化PVS是耗时且主观的。已经开发了许多用于PVS分割的深度学习方法,然而,大多数都针对具有相同质量和分辨率的高质量MRI数据集进行开发和评估,这可能使它们在广泛的诊所和研究的图像质量上应用有限。在这项工作中,我们使用了一个nnUNet,一种在各种质量和分辨率下手动分割的生物医学图像分割算法的顶级性能,对6个不同数据集的异质训练样本进行训练。这些与公开可用的深度学习方法进行比较,用于3D分割PVS。得到的模型PINGU(Perivascular space Identification Nnunet for Generalised Usage)在白质(WM)的体积和聚类级别 dice 分数分别为0.50(SD=0.15),0.63(0.17),在黑质(BG)的体积和聚类级别 dice 分数分别为0.54(0.11),0.66(0.17)。对于未见过的站点数据,PINGU的性能显著较低,尤其是在BG方面(0.20-0.38(WM,体积),0.29-0.58(WM,聚类),0.22-0.36(BG,体积),0.46-0.60(BG,聚类))。然而,与公开可用的算法相比,PINGU在BG方面表现出色。最后,使用同质扫描属性从单一站点训练PINGU,在内部交叉验证上的性能稍低,但有时在 external validation 上表现出更高的性能。总的来说,PINGU是一个通用的 PVS 分割工具,尤其是在 BG 方面,这是一个与血管疾病和病理学相关的 PVS 区域。
https://arxiv.org/abs/2405.08337