Indian folk paintings have a rich mosaic of symbols, colors, textures, and stories making them an invaluable repository of cultural legacy. The paper presents a novel approach to classifying these paintings into distinct art forms and tagging them with their unique salient features. A custom dataset named FolkTalent, comprising 2279 digital images of paintings across 12 different forms, has been prepared using websites that are direct outlets of Indian folk paintings. Tags covering a wide range of attributes like color, theme, artistic style, and patterns are generated using GPT4, and verified by an expert for each painting. Classification is performed employing the RandomForest ensemble technique on fine-tuned Convolutional Neural Network (CNN) models to classify Indian folk paintings, achieving an accuracy of 91.83%. Tagging is accomplished via the prominent fine-tuned CNN-based backbones with a custom classifier attached to its top to perform multi-label image classification. The generated tags offer a deeper insight into the painting, enabling an enhanced search experience based on theme and visual attributes. The proposed hybrid model sets a new benchmark in folk painting classification and tagging, significantly contributing to cataloging India's folk-art heritage.
印度民间绘画具有丰富的象征、色彩、纹理和故事,使其成为文化遗产的无价财富。本文提出了一种新颖的方法来将这些绘画分类为不同的艺术形式,并为它们独特的突出特点贴上标签。一个由12种形式、共2279幅绘画图片组成的自定义数据集FolkTalent已经准备就绪,这些图片来源于印度民间绘画的直接网站。使用GPT4生成覆盖颜色、主题、艺术风格和图案等广泛属性的标签,并请专家对每幅绘画进行验证。分类采用随机森林技术对经过微调的卷积神经网络(CNN)模型进行,实现91.83%的准确率。标签通过一个显著地进行微调的CNN骨干网络与自定义分类器连接在一起进行多标签图像分类。生成的标签提供了对绘画的更深刻的洞察,使主题和视觉属性能够成为增强的搜索体验。所提出的混合模型在民间绘画分类和标签方面设定了新的基准,显著地贡献了印度民间艺术遗产的目录。
https://arxiv.org/abs/2405.08776
Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.
浏览新闻文章的复杂多变的景观,需要理解各种参与者的身份,这些参与者从决策者到反对派人物、公民等,在塑造新闻故事中发挥着关键作用。了解他们的角色、政治观点、社会地位等,对于深入理解新闻内容至关重要。尽管现有的工作集中于通过社交媒体数据突出实体、覆盖差异和政治立场,但自动检测新闻内容中的参与者角色仍然是一个未被探索的领域。在本文中,我们通过引入一种有效的分类新闻文章中参与者类型的方法,跨越了这个领域的空白。我们的方法将参与者分类问题转化为自然语言推理任务,利用新闻文章的上下文信息和外部知识来提高参与者类型检测的准确性。此外,我们所提出的模型在零散设置中表现出优异效果,进一步拓展了其在各种新闻环境中的应用。
https://arxiv.org/abs/2405.08751
Portrait images typically consist of a salient person against diverse backgrounds. With the development of mobile devices and image processing techniques, users can conveniently capture portrait images anytime and anywhere. However, the quality of these portraits may suffer from the degradation caused by unfavorable environmental conditions, subpar photography techniques, and inferior capturing devices. In this paper, we introduce a dual-branch network for portrait image quality assessment (PIQA), which can effectively address how the salient person and the background of a portrait image influence its visual quality. Specifically, we utilize two backbone networks (\textit{i.e.,} Swin Transformer-B) to extract the quality-aware features from the entire portrait image and the facial image cropped from it. To enhance the quality-aware feature representation of the backbones, we pre-train them on the large-scale video quality assessment dataset LSVQ and the large-scale facial image quality assessment dataset GFIQA. Additionally, we leverage LIQE, an image scene classification and quality assessment model, to capture the quality-aware and scene-specific features as the auxiliary features. Finally, we concatenate these features and regress them into quality scores via a multi-perception layer (MLP). We employ the fidelity loss to train the model via a learning-to-rank manner to mitigate inconsistencies in quality scores in the portrait image quality assessment dataset PIQ. Experimental results demonstrate that the proposed model achieves superior performance in the PIQ dataset, validating its effectiveness. The code is available at \url{this https URL}.
肖像图像通常由一个突出的人物和多种不同的背景组成。随着移动设备的发展和图像处理技术的不断发展,用户可以随时随地方便地捕捉到肖像图像。然而,这些肖像可能会受到不良环境条件、拍摄技巧和低质量采集设备等因素引起的质量下降的影响。在本文中,我们提出了一个用于肖像图像质量评估(PIQA)的双分支网络,可以有效地解决突出的人物和肖像图像背景如何影响其视觉质量的问题。具体来说,我们使用两个骨干网络(即Swin Transformer-B)从整个肖像图像和从其中提取的面部图像中提取质量感知特征。为了提高骨干网络的质量感知特征表示,我们在LSVQ和GFIQA等大规模视频质量评估数据集上进行预训练。此外,我们还利用LIQE,一种图像场景分类和质量评估模型,作为辅助特征来捕捉质量感知和场景特定的特征。最后,我们通过多感知层(MLP)将这些特征进行特征串联并对其进行回归,并通过一个多感知层(MLP)将特征和质量评分回归到质量分数。我们使用可靠性损失来通过学习排序的方式来训练模型,以减轻肖像图像质量评估数据集中质量评分不一致的问题。实验结果表明,与原始数据集相比,所提出的模型在PIQA数据集上取得了卓越的性能,验证了其有效性。代码可在此处访问:\url{this <https:// this URL>.
https://arxiv.org/abs/2405.08555
Recent advances in artificial intelligence for education leverage generative large language models, including using them to predict open-ended student responses rather than their correctness only. However, the black-box nature of these models limits the interpretability of the learned student knowledge representations. In this paper, we conduct a first exploration into interpreting latent student knowledge representations by presenting InfoOIRT, an Information regularized Open-ended Item Response Theory model, which encourages the latent student knowledge states to be interpretable while being able to generate student-written code for open-ended programming questions. InfoOIRT maximizes the mutual information between a fixed subset of latent knowledge states enforced with simple prior distributions and generated student code, which encourages the model to learn disentangled representations of salient syntactic and semantic code features including syntactic styles, mastery of programming skills, and code structures. Through experiments on a real-world programming education dataset, we show that InfoOIRT can both accurately generate student code and lead to interpretable student knowledge representations.
近年来,人工智能在教育领域的进步主要依赖于生成式大型语言模型,包括使用这些模型预测开放性学生答案,而不仅仅是正确性。然而,这些模型的黑盒性质限制了学习到的学生知识表示的可解释性。在本文中,我们首先对通过InfoOIRT(信息 regularized Open-ended Item Response Theory 模型)解释潜在学生知识表示进行了探索。InfoOIRT 通过简单先验分布强制指定固定子集的潜在学生知识状态,并能够在生成学生编程问题的情况下鼓励潜在学生知识状态具有可解释性。InfoOIRT 最大化强制简单先验分布与生成的学生代码之间的互信息,这鼓励模型学习显著的语义和语序代码特征,包括语义风格、编程技能和代码结构。通过在现实世界的编程教育数据集上进行的实验,我们发现InfoOIRT 既能准确生成学生代码,又能产生具有可解释性的学生知识表示。
https://arxiv.org/abs/2405.08213
Humans are remarkable in their ability to navigate without metric information. We can read abstract 2D maps, such as floor-plans or hand-drawn sketches, and use them to navigate in unseen rich 3D environments, without requiring prior traversals to map out these scenes in detail. We posit that this is enabled by the ability to represent the environment abstractly as interconnected navigational behaviours, e.g., "follow the corridor" or "turn right", while avoiding detailed, accurate spatial information at the metric level. We introduce the Scene Action Map (SAM), a behavioural topological graph, and propose a learnable map-reading method, which parses a variety of 2D maps into SAMs. Map-reading extracts salient information about navigational behaviours from the overlooked wealth of pre-existing, abstract and inaccurate maps, ranging from floor-plans to sketches. We evaluate the performance of SAMs for navigation, by building and deploying a behavioural navigation stack on a quadrupedal robot. Videos and more information is available at: this https URL.
人类在无指标信息的情况下进行导航的能力是令人惊叹的。我们可以阅读抽象的2D地图,如平面图或手绘草图,并利用它们在未见过的丰富3D环境中进行导航,而无需先前的遍历来详细绘制这些场景。我们认为,这是通过将环境抽象地表示为相互连接的导航行为而实现的,例如"跟随走廊"或"右转",而避免在指标级别上详细、准确的空间信息。我们引入了场景动作图(SAM),一个行为拓扑图,并提出了一种可学习地图阅读方法,可以将各种2D地图解析为SAM。地图阅读从忽视的丰富预先存在的抽象和不准确地图中提取有关导航行为的突出信息,从平面图到草图。我们通过在四足机器人上构建和部署行为导航堆栈来评估SAM的导航性能。有关更多信息,请参阅:https://this URL。
https://arxiv.org/abs/2405.07948
Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some researchers pay attention to the triple-modal SOD task, where they attempt to explore the complementarity of the RGB image, the depth image, and the thermal image. However, existing triple-modal SOD methods fail to perceive the quality of depth maps and thermal images, which leads to performance degradation when dealing with scenes with low-quality depth and thermal images. Therefore, we propose a quality-aware selective fusion network (QSF-Net) to conduct VDT salient object detection, which contains three subnets including the initial feature extraction subnet, the quality-aware region selection subnet, and the region-guided selective fusion subnet. Firstly, except for extracting features, the initial feature extraction subnet can generate a preliminary prediction map from each modality via a shrinkage pyramid architecture. Then, we design the weakly-supervised quality-aware region selection subnet to generate the quality-aware maps. Concretely, we first find the high-quality and low-quality regions by using the preliminary predictions, which further constitute the pseudo label that can be used to train this subnet. Finally, the region-guided selective fusion subnet purifies the initial features under the guidance of the quality-aware maps, and then fuses the triple-modal features and refines the edge details of prediction maps through the intra-modality and inter-modality attention (IIA) module and the edge refinement (ER) module, respectively. Extensive experiments are performed on VDT-2048
深度图像和热图像包含空间几何信息和表面温度信息,这些信息可以为红外模态提供互补信息。然而,在某些具有挑战性的场景中,深度和热图像的质量通常不可靠,这将导致基于双模态的显着目标检测(SOD)性能下降。同时,一些研究人员关注三元模态的SOD任务,他们试图探讨RGB图像、深度图像和热图像的互补性。然而,现有的三元模态SOD方法无法感知深度图和热图的质量,因此在处理低质量深度和热图的场景时,性能会下降。因此,我们提出了一个质量感知的选择性融合网络(QSF-Net)来进行VDT显着目标检测,它包含三个子网络,包括初始特征提取子网、质量感知区域选择子网和区域引导的选择性融合子网。首先,除了提取特征外,初始特征提取子网可以通过收缩金字塔架构从每个模式生成初步预测图。然后,我们设计了一个弱监督的质量感知区域选择子网,用于生成质量感知图。具体来说,我们首先通过初步预测找到高质量和低质量的区域,这进一步构成了可以用于训练这个子网的伪标签。最后,在质量感知地图的指导下,区域引导选择性融合子网对初始特征进行净化,然后通过内模态和跨模态关注(ER)模块对预测地图的边缘进行细化。在VDT-2048上进行大量实验。
https://arxiv.org/abs/2405.07655
An efficient and effective decoding mechanism is crucial in medical image segmentation, especially in scenarios with limited computational resources. However, these decoding mechanisms usually come with high computational costs. To address this concern, we introduce EMCAD, a new efficient multi-scale convolutional attention decoder, designed to optimize both performance and computational efficiency. EMCAD leverages a unique multi-scale depth-wise convolution block, significantly enhancing feature maps through multi-scale convolutions. EMCAD also employs channel, spatial, and grouped (large-kernel) gated attention mechanisms, which are highly effective at capturing intricate spatial relationships while focusing on salient regions. By employing group and depth-wise convolution, EMCAD is very efficient and scales well (e.g., only 1.91M parameters and 0.381G FLOPs are needed when using a standard encoder). Our rigorous evaluations across 12 datasets that belong to six medical image segmentation tasks reveal that EMCAD achieves state-of-the-art (SOTA) performance with 79.4% and 80.3% reduction in #Params and #FLOPs, respectively. Moreover, EMCAD's adaptability to different encoders and versatility across segmentation tasks further establish EMCAD as a promising tool, advancing the field towards more efficient and accurate medical image analysis. Our implementation is available at this https URL.
高效的有效的解码机制在医学图像分割中至关重要,尤其是在计算资源有限的情况下。然而,这些解码机制通常伴随着高昂的计算成本。为了应对这一担忧,我们引入了EMCAD,一种新型高效多尺度卷积注意解码器,旨在同时提高性能和计算效率。EMCAD利用独特的多尺度深度卷积模块,通过多尺度卷积显著增强特征图。EMCAD还采用通道、空间和分组(大核)卷积注意力机制,这些机制在捕捉复杂的空间关系的同时,专注于突出区域。通过采用分组和深度卷积,EMCAD非常高效,并且具有良好的扩展性(例如,使用标准编码器时,只需1.91M参数和0.381G FLOPs)。我们在六个医学图像分割任务上进行严格的评估发现,EMCAD在分别实现最佳性能(SOTA)和最佳计算效率(CPU效率和GFLOP效率)方面取得了显著优势。此外,EMCAD对不同编码器具有适应性,在分割任务上的多样性进一步证明EMCAD是一种有前景的工具,促进了该领域的更高效和精确的医学图像分析。我们的实现可以从以下链接获得:https://www.emcad.cn/。
https://arxiv.org/abs/2405.06880
As machine learning (ML) gains widespread adoption, practitioners are increasingly seeking means to quantify and control the risk these systems incur. This challenge is especially salient when ML systems have autonomy to collect their own data, such as in black-box optimization and active learning, where their actions induce sequential feedback-loop shifts in the data distribution. Conformal prediction has emerged as a promising approach to uncertainty and risk quantification, but existing variants either fail to accommodate sequences of data-dependent shifts, or do not fully exploit the fact that agent-induced shift is under our control. In this work we prove that conformal prediction can theoretically be extended to \textit{any} joint data distribution, not just exchangeable or quasi-exchangeable ones, although it is exceedingly impractical to compute in the most general case. For practical applications, we outline a procedure for deriving specific conformal algorithms for any data distribution, and we use this procedure to derive tractable algorithms for a series of agent-induced covariate shifts. We evaluate the proposed algorithms empirically on synthetic black-box optimization and active learning tasks.
随着机器学习(ML)的广泛采用,实践者 increasingly寻求衡量和控制这些系统所面临的风险的手段。当 ML 系统具有自主收集数据的能力时,这一挑战尤为突出,例如在黑盒优化和主动学习场景中,系统的行动会导致数据分布的序列反馈循环转移。同构预测作为一种有前景的不确定性和风险量化方法已经出现,但现有的变体要么无法适应数据依赖的序列变化,要么没有充分利用代理导致的转变是受我们控制的这一事实。在这项工作中,我们证明了同构预测可以理论上扩展到任何联合数据分布,而不仅仅是可交换或准可交换的 ones,虽然在最一般的情况下,计算它是极为困难的。为了实际应用,我们概述了一种为任何数据分布生成特定同构算法的程序,并利用这一程序为一系列代理引起的协变量转移生成可处理算法。我们在 synthetic 黑盒优化和主动学习任务上通过实验评估所提出的算法。
https://arxiv.org/abs/2405.06627
Advanced image data augmentation techniques play a pivotal role in enhancing the training of models for diverse computer vision tasks. Notably, SalfMix and KeepAugment have emerged as popular strategies, showcasing their efficacy in boosting model performance. However, SalfMix reliance on duplicating salient features poses a risk of overfitting, potentially compromising the model's generalization capabilities. Conversely, KeepAugment, which selectively preserves salient regions and augments non-salient ones, introduces a domain shift that hinders the exchange of crucial contextual information, impeding overall model understanding. In response to these challenges, we introduce KeepOriginalAugment, a novel data augmentation approach. This method intelligently incorporates the most salient region within the non-salient area, allowing augmentation to be applied to either region. Striking a balance between data diversity and information preservation, KeepOriginalAugment enables models to leverage both diverse salient and non-salient regions, leading to enhanced performance. We explore three strategies for determining the placement of the salient region minimum, maximum, or random and investigate swapping perspective strategies to decide which part (salient or non-salient) undergoes augmentation. Our experimental evaluations, conducted on classification datasets such as CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate the superior performance of KeepOriginalAugment compared to existing state-of-the-art techniques.
高级图像数据增强技术在增强多样计算机视觉任务的模型训练中发挥了关键作用。值得注意的是,SalfMix 和 KeepAugment 已成为流行的策略,展示了它们在提高模型性能方面的有效性。然而,SalfMix 对显著特征的复制可能导致过拟合,从而可能削弱模型的泛化能力。相反,KeepAugment 选择性地保留显著区域并增强非显著区域,引入了领域转移,阻碍了关键上下文信息的交换,从而阻碍了整体模型理解。为了应对这些挑战,我们引入了 KeepOriginalAugment,一种新颖的数据增强方法。这种方法在非显著区域内智能地包含最显著的区域,允许对两个区域进行增强。在数据多样性和信息保留之间取得平衡,KeepOriginalAugment 使模型能够利用多样化的显著和非显著区域,从而提高性能。我们研究了三种确定显著区域最小、最大或随机的方法,并探讨了交替视角策略来决定哪个部分(显著或非显著)进行增强。我们对像 CIFAR-10、CIFAR-100 和 TinyImageNet 等分类数据集进行的实验评估证明,KeepOriginalAugment 相对于现有最先进的技术具有卓越的性能。
https://arxiv.org/abs/2405.06354
The intrinsic capability to perceive depth of field and extract salient information by the Human Vision System (HVS) stimulates a pilot to perform manual landing over an autoland approach. However, harsh weather creates visibility hindrances, and a pilot must have a clear view of runway elements before the minimum decision altitude. To help a pilot in manual landing, a vision-based system tailored to localize runway elements likewise gets affected, especially during crosswind due to the projective distortion of aircraft camera images. To combat this, we propose to integrate a prompt-based climatic diffusion network with a weather distillation model using a novel diffusion-distillation loss. Precisely, the diffusion model synthesizes climatic-conditioned landing images, and the weather distillation model learns inverse mapping by clearing those visual degradations. Then, to tackle the crosswind landing scenario, a novel Regularized Spatial Transformer Networks (RuSTaN) learns to accurately calibrate for projective distortion using self-supervised learning, which minimizes localization error by the downstream runway object detector. Finally, we have simulated a clear-day landing scenario at the busiest airport globally to curate an image-based Aircraft Landing Dataset (AIRLAD) and experimentally validated our contributions using this dataset to benchmark the performance.
通过对人类视觉系统(HVS)的深度感知能力和突出信息提取能力的内在能力,激发了飞行员在自动驾驶程序中进行手动着陆。然而,恶劣的天气会降低可见度,因此飞行员在最低决策高度前必须清楚地看到跑道元素。为了帮助飞行员进行手动着陆,专门针对当地定位跑道元素的视觉基于系统的视图也会受到影响,特别是在横风时,由于飞机影像的投影扭曲。为了应对这种情况,我们提出了一个基于提示的气候扩散网络和一个天气分离模型,使用新的扩散-扩散损失。具体来说,扩散模型合成气候条件下的着陆图像,而天气分离模型通过清除这些视觉降级学习反向映射。然后,为解决跨风着陆场景,一种新颖的规范化空间Transformer网络(RuSTaN)学会了通过自监督学习准确校准投影扭曲,从而最小化下游跑道对象检测器的局部定位误差。最后,我们还在全球最繁忙的机场模拟了清晰度着陆场景,以创建一个图像Based的Aircraft Landing Dataset(AIRLAD),并通过这个数据集实验验证了我们的贡献,以评估其性能。
https://arxiv.org/abs/2405.05574
We present AFEN (Audio Feature Ensemble Learning), a model that leverages Convolutional Neural Networks (CNN) and XGBoost in an ensemble learning fashion to perform state-of-the-art audio classification for a range of respiratory diseases. We use a meticulously selected mix of audio features which provide the salient attributes of the data and allow for accurate classification. The extracted features are then used as an input to two separate model classifiers 1) a multi-feature CNN classifier and 2) an XGBoost Classifier. The outputs of the two models are then fused with the use of soft voting. Thus, by exploiting ensemble learning, we achieve increased robustness and accuracy. We evaluate the performance of the model on a database of 920 respiratory sounds, which undergoes data augmentation techniques to increase the diversity of the data and generalizability of the model. We empirically verify that AFEN sets a new state-of-the-art using Precision and Recall as metrics, while decreasing training time by 60%.
我们提出了AFEN(音频特征集成学习)模型,该模型利用卷积神经网络(CNN)和XGBoost在集成学习方式下对各种呼吸疾病进行最先进的音频分类。我们使用精心选择的音频特征,这些特征提供了数据的突出特征并允许准确分类。提取的特征随后作为输入输入到两个单独的模型分类器(1)一个多特征CNN分类器和(2)一个XGBoost分类器。两个模型的输出然后通过软投票进行融合。因此,通过利用集成学习,我们实现了模型的稳健性和准确性。我们在一个由920个呼吸声数据集组成的数据库上评估了该模型的性能,该数据集经过数据增强技术以增加数据的多样性和模型的泛化能力。我们通过实验验证,AFEN模型通过精确度和召回率作为指标,达到了新的最先进水平,同时将训练时间降低了60%。
https://arxiv.org/abs/2405.05467
Over the past few years, deep neural models have made considerable advances in image quality assessment (IQA). However, the underlying reasons for their success remain unclear, owing to the complex nature of deep neural networks. IQA aims to describe how the human visual system (HVS) works and to create its efficient approximations. On the other hand, Saliency Prediction task aims to emulate HVS via determining areas of visual interest. Thus, we believe that saliency plays a crucial role in human perception. In this work, we conduct an empirical study that reveals the relation between IQA and Saliency Prediction tasks, demonstrating that the former incorporates knowledge of the latter. Moreover, we introduce a novel SACID dataset of saliency-aware compressed images and conduct a large-scale comparison of classic and neural-based IQA methods. All supplementary code and data will be available at the time of publication.
在过去的几年里,深度神经网络在图像质量评估(IQA)方面取得了显著的进步。然而,由于深度神经网络的复杂性,其成功背后的原因仍然不明确。IQA 的目标描述了人视觉系统(HVS)的工作,并旨在创建其有效的近似。另一方面,Saliency 预测任务旨在通过确定视觉兴趣区域来模仿 HVS。因此,我们认为 高亮在人类感知中扮演着关键角色。在这项工作中,我们进行了一项实证研究,揭示了 IQA 和 Saliency 预测任务之间的关系,证明了前一个包含了后一个的知识。此外,我们还引入了一个名为 SACID 的适用于高亮度的压缩图像的新 SACID 数据集,并对基于经典方法和神经网络的 IQA 方法进行了大规模比较。所有补充代码和数据将在发表时提供。
https://arxiv.org/abs/2405.04997
Medical Image Synthesis (MIS) plays an important role in the intelligent medical field, which greatly saves the economic and time costs of medical diagnosis. However, due to the complexity of medical images and similar characteristics of different tissue cells, existing methods face great challenges in meeting their biological consistency. To this end, we propose the Hybrid Augmented Generative Adversarial Network (HAGAN) to maintain the authenticity of structural texture and tissue cells. HAGAN contains Attention Mixed (AttnMix) Generator, Hierarchical Discriminator and Reverse Skip Connection between Discriminator and Generator. The AttnMix consistency differentiable regularization encourages the perception in structural and textural variations between real and fake images, which improves the pathological integrity of synthetic images and the accuracy of features in local areas. The Hierarchical Discriminator introduces pixel-by-pixel discriminant feedback to generator for enhancing the saliency and discriminance of global and local details simultaneously. The Reverse Skip Connection further improves the accuracy for fine details by fusing real and synthetic distribution features. Our experimental evaluations on three datasets of different scales, i.e., COVID-CT, ACDC and BraTS2018, demonstrate that HAGAN outperforms the existing methods and achieves state-of-the-art performance in both high-resolution and low-resolution.
医学图像合成(MIS)在智能医疗领域中发挥着重要作用,大大降低了医疗诊断的经济和时间成本。然而,由于医学图像的复杂性和不同组织细胞的类似特征,现有方法在满足其生物一致性方面面临巨大挑战。为此,我们提出了混合增强生成对抗网络(HAGAN)来保持结构的真实性和组织细胞的真实性。HAGAN包括注意力混合(AttnMix)生成器、分层判别器和判别器和生成器的反向跳过连接。AttnMix一致性差分 regularization 鼓励在真实和假图像之间关注结构和组织学变异性,从而提高合成图像的病理完整性以及局部区域的特征准确性。分层判别器引入了逐像素判别反馈来增强生成器,以同时提高全局和局部细节的清晰度和鉴别度。反向跳过连接通过融合真实和合成分布特征进一步提高了准确度。我们在三个不同规模的数据集(即 COVID-CT、ACDC 和 BraTS2018)上的实验评估结果表明,HAGAN 优于现有方法,在 both high-resolution 和 low-resolution 高分辨率低分辨率方面实现了最先进的性能。
https://arxiv.org/abs/2405.04902
Lung and colon cancer are serious worldwide health challenges that require early and precise identification to reduce mortality risks. However, diagnosis, which is mostly dependent on histopathologists' competence, presents difficulties and hazards when expertise is insufficient. While diagnostic methods like imaging and blood markers contribute to early detection, histopathology remains the gold standard, although time-consuming and vulnerable to inter-observer mistakes. Limited access to high-end technology further limits patients' ability to receive immediate medical care and diagnosis. Recent advances in deep learning have generated interest in its application to medical imaging analysis, specifically the use of histopathological images to diagnose lung and colon cancer. The goal of this investigation is to use and adapt existing pre-trained CNN-based models, such as Xception, DenseNet201, ResNet101, InceptionV3, DenseNet121, DenseNet169, ResNet152, and InceptionResNetV2, to enhance classification through better augmentation strategies. The results show tremendous progress, with all eight models reaching impressive accuracy ranging from 97% to 99%. Furthermore, attention visualization techniques such as GradCAM, GradCAM++, ScoreCAM, Faster Score-CAM, and LayerCAM, as well as Vanilla Saliency and SmoothGrad, are used to provide insights into the models' classification decisions, thereby improving interpretability and understanding of malignant and benign image classification.
肺癌和结直肠癌是全球性的健康挑战,需要早期和精确的识别以降低死亡率风险。然而,病理学家依赖的诊断方法在专业知识不足时会带来困难和风险。虽然影像技术和血液标记物等诊断方法有助于早期诊断,但组织学仍然是金标准,尽管时间漫长且易受操作者错误的影响。限制高端技术的访问程度进一步限制了患者获得及时医疗护理和诊断的能力。近年来在深度学习方面的进步引起了对其在医学影像分析中的应用的关注,特别是使用组织学图像诊断肺癌和结直肠癌。本研究的目标是利用和适应现有的预训练CNN模型,如Xception、DenseNet201、ResNet101、InceptionV3、DenseNet121、DenseNet169、ResNet152和InceptionResNetV2,通过更好的增强策略来增强分类。结果显示,所有模型都取得了巨大的进步,所有八个模型都达到了令人印象深刻的准确率,从97%到99%不等。此外,还使用了注意力图技术,如GradCAM、GradCAM++、ScoreCAM、Faster Score-CAM和LayerCAM,以及Vanilla Saliency和SmoothGrad,以提供模型分类决策的洞察,从而改善恶性和良性图像分类的可解释性和理解。
https://arxiv.org/abs/2405.04610
Most fake news detection methods learn latent feature representations based on neural networks, which makes them black boxes to classify a piece of news without giving any justification. Existing explainable systems generate veracity justifications from investigative journalism, which suffer from debunking delayed and low efficiency. Recent studies simply assume that the justification is equivalent to the majority opinions expressed in the wisdom of crowds. However, the opinions typically contain some inaccurate or biased information since the wisdom of crowds is uncensored. To detect fake news from a sea of diverse, crowded and even competing narratives, in this paper, we propose a novel defense-based explainable fake news detection framework. Specifically, we first propose an evidence extraction module to split the wisdom of crowds into two competing parties and respectively detect salient evidences. To gain concise insights from evidences, we then design a prompt-based module that utilizes a large language model to generate justifications by inferring reasons towards two possible veracities. Finally, we propose a defense-based inference module to determine veracity via modeling the defense among these justifications. Extensive experiments conducted on two real-world benchmarks demonstrate that our proposed method outperforms state-of-the-art baselines in terms of fake news detection and provides high-quality justifications.
大多数假新闻检测方法基于神经网络学习潜在特征表示,这让它们在分类一篇新闻时缺乏任何依据。现有的解释性系统从调查性新闻中生成真实性证明,但这种方法存在延迟和低效率的缺点。最近的研究只是简单地假设证据的证明等同于民智的多数观点。然而,由于民智未受审查,这些证据通常包含一些不准确或偏见的信息。为了从丰富多样的、拥挤的和竞争性的叙事中检测出假新闻,本文我们提出了一个基于防御的 explainable fake news detection framework。具体来说,我们首先提出了一种证据提取模块,将民智分为两个对抗的派别并分别检测显著的证据。为了从证据中获得简洁的洞察,我们然后设计了一个基于提示的模块,利用一个大语言模型生成两个可能的真理原因。最后,我们提出了一种基于防御的推理模块,通过建模这些证据之间的防御来确定真理。在两个真实世界基准上的大量实验证明,与最先进的基线相比,我们提出的方法在假新闻检测方面表现出色,并提供高质量的可信证明。
https://arxiv.org/abs/2405.03371
Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance for handling different types of inputs, raising huge hardware and research costs. Differently, in this paper, we propose a new type of SOD task, termed Arbitrary Modality SOD (AM SOD). The most prominent characteristics of AM SOD are that the modality types and modality numbers will be arbitrary or dynamically changed. The former means that the inputs to the AM SOD algorithm may be arbitrary modalities such as RGB, depths, or even any combination of them. While, the latter indicates that the inputs may have arbitrary modality numbers as the input type is changed, e.g. single-modality RGB image, dual-modality RGB-Depth (RGB-D) images or triple-modality RGB-Depth-Thermal (RGB-D-T) images. Accordingly, a preliminary solution to the above challenges, ı.e. a modality switch network (MSN), is proposed in this paper. In particular, a modality switch feature extractor (MSFE) is first designed to extract discriminative features from each modality effectively by introducing some modality indicators, which will generate some weights for modality switching. Subsequently, a dynamic fusion module (DFM) is proposed to adaptively fuse features from a variable number of modalities based on a novel Transformer structure. Finally, a new dataset, named AM-XD, is constructed to facilitate research on AM SOD. Extensive experiments demonstrate that our AM SOD method can effectively cope with changes in the type and number of input modalities for robust salient object detection.
为实现理想的显著性预测,具有显著性目标检测(SOD)算法的输入类型和数量可能会在许多现实生活中动态地发生变化。然而,现有的SOD算法主要是为特定类型的输入而设计和训练的,无法推广到其他类型的输入。因此,在提前准备更多类型的SOD算法以处理不同类型的输入方面,提出了更多的硬件和研究的成本。相反,本文提出了一种名为任意模态SOD(AM SOD)的新类型SOD任务。AM SOD的最显著特点是其模态类型和数量将是任意或动态变化的。前者的含义是,AM SOD算法的输入可能是任意模态,如RGB、深度或甚至任何组合。而后者表示输入类型发生变化时,输入可能具有任意模态数,例如单模态RGB图像、双模态RGB-Depth(RGB-D)图像或三模态RGB-Depth-Thermal(RGB-D-T)图像。因此,在本文中,我们提出了一个初步解决方案,即模态切换网络(MSN)。 具体来说,首先设计了一个模态切换特征提取器(MSFE),通过引入一些模态指标来有效地提取每个模态的判别特征,从而生成一些权重以进行模态切换。接着,提出了一个动态融合模块(DFM),根据一种新颖的Transformer结构动态地将来自不同模态的特征进行融合。最后,还构建了一个新的数据集,名为AM-XD,以促进对AM SOD的研究。大量实验证明,我们的AM SOD方法能有效地应对输入类型和数量的变化,从而实现稳健的显著性目标检测。
https://arxiv.org/abs/2405.03352
This paper delves into the task of arbitrary modality salient object detection (AM SOD), aiming to detect salient objects from arbitrary modalities, eg RGB images, RGB-D images, and RGB-D-T images. A novel modality-adaptive Transformer (MAT) will be proposed to investigate two fundamental challenges of AM SOD, ie more diverse modality discrepancies caused by varying modality types that need to be processed, and dynamic fusion design caused by an uncertain number of modalities present in the inputs of multimodal fusion strategy. Specifically, inspired by prompt learning's ability of aligning the distributions of pre-trained models to the characteristic of downstream tasks by learning some prompts, MAT will first present a modality-adaptive feature extractor (MAFE) to tackle the diverse modality discrepancies by introducing a modality prompt for each modality. In the training stage, a new modality translation contractive (MTC) loss will be further designed to assist MAFE in learning those modality-distinguishable modality prompts. Accordingly, in the testing stage, MAFE can employ those learned modality prompts to adaptively adjust its feature space according to the characteristics of the input modalities, thus being able to extract discriminative unimodal features. Then, MAFE will present a channel-wise and spatial-wise fusion hybrid (CSFH) strategy to meet the demand for dynamic fusion. For that, CSFH dedicates a channel-wise dynamic fusion module (CDFM) and a novel spatial-wise dynamic fusion module (SDFM) to fuse the unimodal features from varying numbers of modalities and meanwhile effectively capture cross-modal complementary semantic and detail information, respectively. Moreover, CSFH will carefully align CDFM and SDFM to different levels of unimodal features based on their characteristics for more effective complementary information exploitation.
本文深入研究任意模态显著物体检测(AM SOD)任务,旨在从任意模态中检测显著物体,例如RGB图像、RGB-D图像和RGB-D-T图像。将提出一种新颖的模态自适应Transformer(MAT),以研究AM SOD中的两个基本挑战,即需要处理的不同模态之间模态差异的多样性,以及来自多模态融合策略输入的模态数量不确定的动态融合设计。具体来说,受到提示学习将预训练模型的分布与下游任务的特征对齐的能力启发,MAT将首先提供一个模态自适应特征提取器(MAFE),通过为每个模态引入模态提示来解决模态差异。在训练阶段,将设计一个新的模态转换契约损失(MTC)以进一步协助MAFE学习那些模态可分的模态提示。因此,在测试阶段,MAFE可以利用这些学习到的模态提示适当地调整其特征空间,从而能够提取出有区分性的单模态特征。然后,MAFE将提出一种通道维和空间维的融合混合(CSFH)策略来满足动态融合的需求。为此,CSFH为每个通道维度和空间维度分配一个动态融合模块(CDFM)和一个新颖的空间维动态融合模块(SDFM),用于融合不同模态的单模态特征,同时有效地捕捉跨模态互补语义和细节信息。此外,CSFH将仔细根据模态特征将CDFM和SDFM对齐到不同的单模态特征级别,以更有效地实现互补信息的最大化利用。
https://arxiv.org/abs/2405.03351
Transparency and explainability in image classification are essential for establishing trust in machine learning models and detecting biases and errors. State-of-the-art explainability methods generate saliency maps to show where a specific class is identified, without providing a detailed explanation of the model's decision process. Striving to address such a need, we introduce a post-hoc method that explains the entire feature extraction process of a Convolutional Neural Network. These explanations include a layer-wise representation of the features the model extracts from the input. Such features are represented as saliency maps generated by clustering and merging similar feature maps, to which we associate a weight derived by generalizing Grad-CAM for the proposed methodology. To further enhance these explanations, we include a set of textual labels collected through a gamified crowdsourcing activity and processed using NLP techniques and Sentence-BERT. Finally, we show an approach to generate global explanations by aggregating labels across multiple images.
透明度和可解释性在图像分类中至关重要,用于建立对机器学习模型的信任并检测偏见和错误。最先进的可解释性方法生成确切显示特定类别的 saliency 地图,而不会提供模型决策过程的详细解释。为了解决这个问题,我们引入了一种后置方法,该方法解释了卷积神经网络(CNN)的完整特征提取过程。这些解释包括从输入中提取的每个层的特征的层级表示。这些特征以通过聚类和合并类似特征图生成的 saliency 地图的形式表示,并附有通过扩展 Grad-CAM 获得的权重。为了进一步增强这些解释,我们在活动中通过游戏化众包活动收集了一组文本标签,并使用 NLP 技术和 Sentence-BERT 对这些标签进行处理。最后,我们展示了通过聚合多个图像上的标签来生成全局解释的方法。
https://arxiv.org/abs/2405.03301
A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and potentially malicious attack, jeopardizing the precision of spatial-temporal alignment. Rather than relying on external hardwares, this work proposes a novel approach: aligning by recognizing the inherent geometric patterns within the perceptual data of various agents. Following this spirit, we propose a robust collaborative perception system that operates independently of external localization and clock devices. The key module of our system,~\emph{FreeAlign}, constructs a salient object graph for each agent based on its detected boxes and uses a graph neural network to identify common subgraphs between agents, leading to accurate relative pose and time. We validate \emph{FreeAlign} on both real-world and simulated datasets. The results show that, the ~\emph{FreeAlign} empowered robust collaborative perception system perform comparably to systems relying on precise localization and clock devices.
在多个代理器之间实现一致的空间-时间协调是协同感知的基本前提,该目标是通过代理器之间的信息交流来改善代理器的感知能力。要实现这种空间-时间对齐,传统方法依赖于外部设备提供定位和时钟信号。然而,硬件生成的信号可能受到噪声和潜在恶意攻击的影响,从而危及空间-时间对齐的精度。因此,我们提出了一个新方法:通过识别各种代理感知数据中的固有几何模式来对齐。 在這種精神下,我们提出了一个自適應的協同感知系統,該系統獨立於外部定位和時鐘設備運行。我們的关键模塊~\emph{FreeAlign}基於其檢測到的邊框構建每個代理器的顯著對象圖形,並使用圖神經網絡在代理之間識別共同子圖形,從而實現準確的相對姿態和時間。 我们在現實世界和模擬數據上驗證了~\emph{FreeAlign}。結果表明,~\emph{FreeAlign}輔助的堅實協同感知系統與依賴精確定位和時鐘設備的系統性能相当。
https://arxiv.org/abs/2405.02965
Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and attentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal guides it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.
之前关于扫描路径预测的研究主要集中在群体模型上,而忽视了个体差异的存在。忽视这些差异对社交人机交互的影响尤为严重,因为机器人通常根据启发式或预定义的模式模仿人类的注视。然而,人类注视模式具有异质性,并且变化的行为可能显著影响此类人机交互的结果。为了填补这一空白,我们开发了一种基于深度学习的社交线索整合模型来进行显著性预测,而不是预测视频中的扫描路径。我们的模型通过通过门控机制和顺序注意来递归整合注意历史和社交线索来学习扫描路径。我们在动态社会场景的凝视数据集上评估了我们的方法。引入注意历史到我们的模型使得训练单个统一模型成为可能,而不是为每个扫描路径训练资源密集的模型。我们观察到,在训练模型的大型数据集上,晚期的神经整合方法超越了早期的融合方法,而在类似分布的小数据集上,情况则相反。结果还表明,对于所有观察者的扫描路径进行联合训练的单统一模型,其表现与单独训练的模型相当或者更好。我们假设,这种结果是因为群体突出表现引起了模型对普遍注意的关注,而监督信号则指导其学习个性化的注意行为,使得统一模型相对于单独模型具有优势,因为它隐含地表示了普遍注意。
https://arxiv.org/abs/2405.02929