Deep learning has enabled breakthroughs in automated diagnosis from medical imaging, with many successful applications in ophthalmology. However, standard medical image classification approaches only assess disease presence at the time of acquisition, neglecting the common clinical setting of longitudinal imaging. For slow, progressive eye diseases like age-related macular degeneration (AMD) and primary open-angle glaucoma (POAG), patients undergo repeated imaging over time to track disease progression and forecasting the future risk of developing disease is critical to properly plan treatment. Our proposed Longitudinal Transformer for Survival Analysis (LTSA) enables dynamic disease prognosis from longitudinal medical imaging, modeling the time to disease from sequences of fundus photography images captured over long, irregular time periods. Using longitudinal imaging data from the Age-Related Eye Disease Study (AREDS) and Ocular Hypertension Treatment Study (OHTS), LTSA significantly outperformed a single-image baseline in 19/20 head-to-head comparisons on late AMD prognosis and 18/20 comparisons on POAG prognosis. A temporal attention analysis also suggested that, while the most recent image is typically the most influential, prior imaging still provides additional prognostic value.
深度学习在医学影像自动诊断方面取得了突破性进展,在眼科领域有很多成功应用。然而,标准的医学图像分类方法仅在获取时评估疾病的存在,而忽略了常见的临床扫描设置——长期影像扫描。对于像年龄相关性黄斑变性(AMD)和原发性开角型眼压升高(POAG)这样的进展缓慢、进行性的眼病,患者需要重复进行影像检查以跟踪疾病进展,并预测未来患病的风险,以便正确规划治疗。我们提出的纵向Transformer for Survival Analysis(LTSA)可以从长期医学影像中动态预测疾病预后,建模长时间 irregular 时间间隔内捕获的序列帧图像中的疾病从眼轴摄影图中的时间。使用年龄相关性眼病研究(AREDS)和眼压升高治疗研究(OHTS)中的纵向影像数据,LTSA在19/20 头对头比较中显著超过了单张图像基线在晚期 AMD 预后方面的表现,而在18/20 比较中超过了原发性开角型眼压升高预后的表现。时间注意分析还表明,虽然最最新的图像通常是最有影响力的,但之前的图像仍然提供了额外的预后价值。
https://arxiv.org/abs/2405.08780
Indian folk paintings have a rich mosaic of symbols, colors, textures, and stories making them an invaluable repository of cultural legacy. The paper presents a novel approach to classifying these paintings into distinct art forms and tagging them with their unique salient features. A custom dataset named FolkTalent, comprising 2279 digital images of paintings across 12 different forms, has been prepared using websites that are direct outlets of Indian folk paintings. Tags covering a wide range of attributes like color, theme, artistic style, and patterns are generated using GPT4, and verified by an expert for each painting. Classification is performed employing the RandomForest ensemble technique on fine-tuned Convolutional Neural Network (CNN) models to classify Indian folk paintings, achieving an accuracy of 91.83%. Tagging is accomplished via the prominent fine-tuned CNN-based backbones with a custom classifier attached to its top to perform multi-label image classification. The generated tags offer a deeper insight into the painting, enabling an enhanced search experience based on theme and visual attributes. The proposed hybrid model sets a new benchmark in folk painting classification and tagging, significantly contributing to cataloging India's folk-art heritage.
印度民间绘画具有丰富的象征、色彩、纹理和故事,使其成为文化遗产的无价财富。本文提出了一种新颖的方法来将这些绘画分类为不同的艺术形式,并为它们独特的突出特点贴上标签。一个由12种形式、共2279幅绘画图片组成的自定义数据集FolkTalent已经准备就绪,这些图片来源于印度民间绘画的直接网站。使用GPT4生成覆盖颜色、主题、艺术风格和图案等广泛属性的标签,并请专家对每幅绘画进行验证。分类采用随机森林技术对经过微调的卷积神经网络(CNN)模型进行,实现91.83%的准确率。标签通过一个显著地进行微调的CNN骨干网络与自定义分类器连接在一起进行多标签图像分类。生成的标签提供了对绘画的更深刻的洞察,使主题和视觉属性能够成为增强的搜索体验。所提出的混合模型在民间绘画分类和标签方面设定了新的基准,显著地贡献了印度民间艺术遗产的目录。
https://arxiv.org/abs/2405.08776
Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.
在视频中的多标签动作识别是一个对机器人动态环境应用的显著挑战,尤其是在机器人需要与人类在涉及对象的任務中進行合作时。现有的方法仍然很难识别未见到的动作,或者需要大量的训练数据。为了克服这些问题,我们提出了Dual-VCLIP,一种用于零散标签多标签动作识别的统一方法。Dual-VCLIP通过DualCoOp方法增强了VCLIP,一种用于零散标签图像分类的零散动作识别方法。我们方法的优势在于,在训练时它只学习两个提示,因此它比其他方法要简单得多。我们在包含大量物体为基础的动作的Charades数据集上验证我们的方法,证明了--尽管其简单性--我们的方法在完整数据集上与现有方法的表现相当,而在测试未见到的动作时具有 promising 的表现。我们的贡献强调了在机器人训练过程中动词-物体类别的分割对新型合作任务的影响,突出了对表现和减轻偏见的影响。
https://arxiv.org/abs/2405.08695
Numerous studies have revealed that deep learning-based medical image classification models may exhibit bias towards specific demographic attributes, such as race, gender, and age. Existing bias mitigation methods often achieve high level of fairness at the cost of significant accuracy degradation. In response to this challenge, we propose an innovative and adaptable Soft Nearest Neighbor Loss-based channel pruning framework, which achieves fairness through channel pruning. Traditionally, channel pruning is utilized to accelerate neural network inference. However, our work demonstrates that pruning can also be a potent tool for achieving fairness. Our key insight is that different channels in a layer contribute differently to the accuracy of different groups. By selectively pruning critical channels that lead to the accuracy difference between the privileged and unprivileged groups, we can effectively improve fairness without sacrificing accuracy significantly. Experiments conducted on two skin lesion diagnosis datasets across multiple sensitive attributes validate the effectiveness of our method in achieving state-of-the-art trade-off between accuracy and fairness. Our code is available at this https URL.
大量研究表明,基于深度学习的医疗图像分类模型可能存在对特定人口属性的偏见,如种族、性别和年龄等。现有的偏见缓解方法通常可以在公正性方面达到很高的水平,但会降低准确性。为了应对这一挑战,我们提出了一个创新且可适应的软最近邻损失基于通道修剪框架,通过通道修剪实现公正。 传统上,通道修剪用于加速神经网络的推理。然而,我们的工作表明,修剪也可以成为实现公正的有力工具。我们的关键洞见是,同一层中不同通道对不同群体的准确性有不同的贡献。通过选择性地修剪导致 privileged 和 unprivileged 群体之间准确率差异的关键通道,我们可以在不牺牲准确率的情况下有效提高公正性。在两个皮肤病变诊断数据集上的实验验证了我们在实现准确性和公正性之间的最优平衡。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2405.08681
As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.
随着训练数据越来越来自于无结构、无控制的环境(如Web),研究人员和实践者越来越依赖数据筛选技术来“滤除网络爬取数据的噪音”。尽管数据集已被广泛证明反映了其创建者的偏见和价值观,但在这篇论文中,我们为评估创建这些数据集所使用的过滤器的研究新进展做出了贡献。我们证明了图像文本数据过滤也存在偏见和价值,并编码了“高质量”数据的特定概念。在我们的工作中,我们通过对学术基准DataComp的CommonPool进行图像文本CLIP过滤的标准方法进行分析,研究了通过各种标注技术在多个图像、文本和网站来源之间过滤的差异。我们发现,与几个假设的人口群体相关的数据——例如LGBTQ+人员、老年女性和年轻男性——被排除的比例较高。此外,我们证明了排除放大案例:不仅是某些边缘群体在未过滤的数据中已经代表性不足,而且CLIP过滤器在这些群体的数据上排除的比例更高。因此,机器学习管道中的数据过滤步骤可能加剧数据收集阶段已经存在的代表性差异,特别是在现有的过滤器被设计为优化特定的下游性能指标(如零散图像分类准确性)时。最后,我们发现NSFW过滤器无法从CommonPool中移除性暗示内容,而CLIP过滤器包括多个类别的受版权保护的内容。我们的结论指出,数据创建和过滤实践需要进行根本性的改变。
https://arxiv.org/abs/2405.08209
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named \emph{MambaOut} through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at this https URL
Mamba 是一种具有类似于 RNN 式的状态空间模型(SSM)的架构,最近为解决注意力机制的二次复杂度而引入。然而,与卷积和基于注意的模型相比,Mamba 在视觉任务上的表现往往令人失望。在本文中,我们深入研究了 Mamba 的本质,并从理论上得出了结论,即 Mamba 非常适合具有长序列和自回归特性的任务。对于视觉任务,由于图像分类不涉及任何特性,我们假设 Mamba 对这项任务是不必要的;检测和分割任务也不具有自回归特性,但它们仍然遵循长序列特性,因此我们认为值得探索 Mamba 在这些任务上的潜力。为了通过实验验证我们的假设,我们通过堆叠 Mamba 模块并删除其核心词混合器构建了一系列模型,命名为 \emph{MambaOut}。实验结果强烈支持我们的假设。具体来说,我们的 MambaOut 模型在 ImageNet 图像分类中超过了所有视觉 Mamba 模型,表明 Mamba 确实是不必要的。对于检测和分割任务,MambaOut 的性能无法与最先进的视觉 Mamba 模型相媲美,这表明了 Mamba 在长序列视觉任务上的潜力。代码可在此处获得:https:// this URL
https://arxiv.org/abs/2405.07992
Replica exchange stochastic gradient Langevin dynamics (reSGLD) is an effective sampler for non-convex learning in large-scale datasets. However, the simulation may encounter stagnation issues when the high-temperature chain delves too deeply into the distribution tails. To tackle this issue, we propose reflected reSGLD (r2SGLD): an algorithm tailored for constrained non-convex exploration by utilizing reflection steps within a bounded domain. Theoretically, we observe that reducing the diameter of the domain enhances mixing rates, exhibiting a \emph{quadratic} behavior. Empirically, we test its performance through extensive experiments, including identifying dynamical systems with physical constraints, simulations of constrained multi-modal distributions, and image classification tasks. The theoretical and empirical findings highlight the crucial role of constrained exploration in improving the simulation efficiency.
复制品交换随机梯度Langevin动力学(reSGLD)在大型数据集中的非凸学习是一种有效的采样方法。然而,当高维链深入分布的尾部时,模拟可能会遇到停滞问题。为解决这个问题,我们提出了反射式reSGLD(r2SGLD)算法:通过在有界域内利用反射步来拟合约束的非凸探索。从理论上看,我们观察到减小域的直径会增强混合率,并表现出一个二次行为。在实证研究中,我们通过广泛的实验,包括具有物理约束的动态系统、约束多模态分布的模拟和图像分类任务,测试了其性能。理论和实证研究结果强调了在提高仿真效率中确保约束探索的关键作用。
https://arxiv.org/abs/2405.07839
Recent advances in Tiny Machine Learning (TinyML) empower low-footprint embedded devices for real-time on-device Machine Learning. While many acknowledge the potential benefits of TinyML, its practical implementation presents unique challenges. This study aims to bridge the gap between prototyping single TinyML models and developing reliable TinyML systems in production: (1) Embedded devices operate in dynamically changing conditions. Existing TinyML solutions primarily focus on inference, with models trained offline on powerful machines and deployed as static objects. However, static models may underperform in the real world due to evolving input data distributions. We propose online learning to enable training on constrained devices, adapting local models towards the latest field conditions. (2) Nevertheless, current on-device learning methods struggle with heterogeneous deployment conditions and the scarcity of labeled data when applied across numerous devices. We introduce federated meta-learning incorporating online learning to enhance model generalization, facilitating rapid learning. This approach ensures optimal performance among distributed devices by knowledge sharing. (3) Moreover, TinyML's pivotal advantage is widespread adoption. Embedded devices and TinyML models prioritize extreme efficiency, leading to diverse characteristics ranging from memory and sensors to model architectures. Given their diversity and non-standardized representations, managing these resources becomes challenging as TinyML systems scale up. We present semantic management for the joint management of models and devices at scale. We demonstrate our methods through a basic regression example and then assess them in three real-world TinyML applications: handwritten character image classification, keyword audio classification, and smart building presence detection, confirming our approaches' effectiveness.
近年来,Tiny Machine Learning (TinyML) 的发展使嵌入式设备能够实现实时本地机器学习。虽然许多人承认 TinyML 的潜在好处,但它的实际实现带来了独特的挑战。本研究旨在弥补在原型设计单个 TinyML 模型和开发可靠的 TinyML 系统之间的差距:(1)嵌入式设备运行在不断变化的条件下。现有的 TinyML 解决方案主要关注推理,模型在强大的机器上进行离线训练,然后部署为静态对象。然而,静态模型可能在真实世界里由于不断变化的输入数据分布而表现不佳。我们提出了在线学习来实现在约束设备上的训练,将局部模型适应最新的领域条件。(2)然而,当前的 on-device 学习方法在处理多样部署条件和高标数据稀缺的情况下遇到了困难。我们引入了联合式元学习,包括在线学习,以增强模型的泛化能力,促进快速学习。这种方法确保了分布式设备之间的最优性能,通过知识共享实现。(3)此外,TinyML 的关键优势是广泛应用。嵌入式设备和 TinyML 模型优先考虑极端效率,导致具有从内存和传感器到模型架构的多样特征。随着 TinyML 系统规模的增长,管理这些资源变得具有挑战性。我们提出了针对大規模模型和设备管理的语义管理。我们通过基本的回归示例展示了我们的方法,然后对三个真实世界 TinyML 应用进行了评估:手写字符图像分类、关键词音频分类和智能建筑存在检测,证实了我们的方法的有效性。
https://arxiv.org/abs/2405.07601
While Deep Neural Networks (DNNs) have demonstrated remarkable performance in tasks related to perception and control, there are still several unresolved concerns regarding the privacy of their training data, particularly in the context of vulnerability to Membership Inference Attacks (MIAs). In this paper, we explore a connection between the susceptibility to membership inference attacks and the vulnerability to distillation-based functionality stealing attacks. In particular, we propose {GLiRA}, a distillation-guided approach to membership inference attack on the black-box neural network. We observe that the knowledge distillation significantly improves the efficiency of likelihood ratio of membership inference attack, especially in the black-box setting, i.e., when the architecture of the target model is unknown to the attacker. We evaluate the proposed method across multiple image classification datasets and models and demonstrate that likelihood ratio attacks when guided by the knowledge distillation, outperform the current state-of-the-art membership inference attacks in the black-box setting.
尽管深度神经网络(DNNs)在感知和控制任务方面的表现已经非常出色,但关于其训练数据的隐私问题仍然存在许多未解决的问题,特别是在易受会员推理攻击(MIAs)的背景下。在本文中,我们探讨了易受会员推理攻击和易受蒸馏基础功能偷袭攻击之间的联系。特别是,我们提出了{GLiRA},一种基于蒸馏的会员推理攻击方法,用于黑盒神经网络。我们观察到,知识蒸馏显著提高了成员推理攻击的概率比,尤其是在黑盒设置中,即攻击者不知道目标模型的架构时。我们评估所提出的方法在多个图像分类数据集和模型上,并证明了在黑盒设置下,指导知识蒸馏的成员推理攻击比现有技术水平要好。
https://arxiv.org/abs/2405.07562
Medical images are often more difficult to acquire than natural images due to the specialism of the equipment and technology, which leads to less medical image datasets. So it is hard to train a strong pretrained medical vision model. How to make the best of natural pretrained vision model and adapt in medical domain still pends. For image classification, a popular method is linear probe (LP). However, LP only considers the output after feature extraction. Yet, there exists a gap between input medical images and natural pretrained vision model. We introduce visual prompting (VP) to fill in the gap, and analyze the strategies of coupling between LP and VP. We design a joint learning loss function containing categorisation loss and discrepancy loss, which describe the variance of prompted and plain images, naming this joint training strategy MoVL (Mixture of Visual Prompting and Linear Probe). We experiment on 4 medical image classification datasets, with two mainstream architectures, ResNet and CLIP. Results shows that without changing the parameters and architecture of backbone model and with less parameters, there is potential for MoVL to achieve full finetune (FF) accuracy (on four medical datasets, average 90.91% for MoVL and 91.13% for FF). On out of distribution medical dataset, our method(90.33%) can outperform FF (85.15%) with absolute 5.18 % lead.
医学图像通常比自然图像更难以获取,因为设备的特殊性和技术的专业性导致医疗图像数据集较少。因此,很难训练一个强大的预训练医学视觉模型。如何充分利用自然预训练视觉模型并在医学领域进行适应仍然是一个悬而未决的问题。对于图像分类,一种流行的方法是线性探测(LP)。然而,LP仅考虑特征提取后的输出。然而,输入医疗图像与自然预训练视觉模型之间存在差距。我们引入了视觉提示(VP)来填补这个差距,并分析了LP和VP之间的耦合策略。我们设计了一个包含分类损失和差异损失的联合训练损失函数,描述了提示和普通图像的方差,将这种联合训练策略称为MoVL(混合视觉提示和线性探测)。我们在4个医学图像分类数据集上进行实验,包括主流架构ResNet和CLIP。结果表明,在不改变骨干模型的参数和架构的情况下,MoVL具有实现完全微调(FF)准确度的潜力(在四个医学数据集上,平均90.91% for MoVL和91.13% for FF)。然而,在离散分布的医学数据集上,我们的方法(90.33%)可以击败FF(85.15%),绝对误差比为5.18%。
https://arxiv.org/abs/2405.07411
Our research focuses on the critical field of early diagnosis of disease by examining retinal blood vessels in fundus images. While automatic segmentation of retinal blood vessels holds promise for early detection, accurate analysis remains challenging due to the limitations of existing methods, which often lack discrimination power and are susceptible to influences from pathological regions. Our research in fundus image analysis advances deep learning-based classification using eight pre-trained CNN models. To enhance interpretability, we utilize Explainable AI techniques such as Grad-CAM, Grad-CAM++, Score-CAM, Faster Score-CAM, and Layer CAM. These techniques illuminate the decision-making processes of the models, fostering transparency and trust in their predictions. Expanding our exploration, we investigate ten models, including TransUNet with ResNet backbones, Attention U-Net with DenseNet and ResNet backbones, and Swin-UNET. Incorporating diverse architectures such as ResNet50V2, ResNet101V2, ResNet152V2, and DenseNet121 among others, this comprehensive study deepens our insights into attention mechanisms for enhanced fundus image analysis. Among the evaluated models for fundus image classification, ResNet101 emerged with the highest accuracy, achieving an impressive 94.17%. On the other end of the spectrum, EfficientNetB0 exhibited the lowest accuracy among the models, achieving a score of 88.33%. Furthermore, in the domain of fundus image segmentation, Swin-Unet demonstrated a Mean Pixel Accuracy of 86.19%, showcasing its effectiveness in accurately delineating regions of interest within fundus images. Conversely, Attention U-Net with DenseNet201 backbone exhibited the lowest Mean Pixel Accuracy among the evaluated models, achieving a score of 75.87%.
我们的研究重点关注在对 fundus 图像中检查视网膜血管的关键领域。虽然自动分割视网膜血管具有早期诊断疾病的有希望,但现有方法的准确分析仍然具有挑战性,因为它们往往缺乏区分能力和易受病理区域的影响。我们的 fundus 图像分析研究采用基于深度学习的分类方法,利用八个预训练的 CNN 模型。为了增强可解释性,我们利用了 Grad-CAM、Grad-CAM++、Score-CAM、Faster Score-CAM 和 Layer CAM 等可解释 AI 技术。这些技术揭示了模型的决策过程,加强了对其预测的透明度和信任。 在拓展我们的研究方面,我们调查了包括 TransUNet 与 ResNet 骨干网络、Attention U-Net 与 DenseNet 和 ResNet 骨干网络以及 Swin-UNET 在内的十种模型。通过包括 ResNet50V2、ResNet101V2、ResNet152V2 和 DenseNet121 等多样化架构,全面研究深入探讨了 attention 机制在 fundus 图像分析中的应用。 在基金图像分类评估模型中,ResNet101 脱颖而出,其准确率达到了令人印象深刻的 94.17%。在另一端,EfficientNetB0 的准确率最低,为 88.33%。此外,在基金图像分割领域,Swin-Unet 显示了平均像素准确率 86.19%,表明其在对感兴趣区域准确描绘方面非常有效。相反,在评估模型中,Attention U-Net 使用 DenseNet201 骨干网络的准确率最低,为 75.87%。
https://arxiv.org/abs/2405.07338
Over the past few years, as large language models have ushered in an era of intelligence emergence, there has been an intensified focus on scaling networks. Currently, many network architectures are designed manually, often resulting in sub-optimal configurations. Although Neural Architecture Search (NAS) methods have been proposed to automate this process, they suffer from low search efficiency. This study introduces Differentiable Model Scaling (DMS), increasing the efficiency for searching optimal width and depth in networks. DMS can model both width and depth in a direct and fully differentiable way, making it easy to optimize. We have evaluated our DMS across diverse tasks, ranging from vision tasks to NLP tasks and various network architectures, including CNNs and Transformers. Results consistently indicate that our DMS can find improved structures and outperforms state-of-the-art NAS methods. Specifically, for image classification on ImageNet, our DMS improves the top-1 accuracy of EfficientNet-B0 and Deit-Tiny by 1.4% and 0.6%, respectively, and outperforms the state-of-the-art zero-shot NAS method, ZiCo, by 1.3% while requiring only 0.4 GPU days for searching. For object detection on COCO, DMS improves the mAP of Yolo-v8-n by 2.0%. For language modeling, our pruned Llama-7B outperforms the prior method with lower perplexity and higher zero-shot classification accuracy. We will release our code in the future.
在过去的几年里,随着大型语言模型的出现,我们越来越关注扩展网络。目前,许多网络架构是手动设计的,通常导致 sub-optimal 配置。尽管已经提出了神经架构搜索(NAS)方法来自动化这个过程,但它们搜索效率较低。本文介绍了 Differentiable Model Scaling(DMS),增加了在网络中搜索最优宽度和深度的效率。DMS 可以直接和完全不同心地建模宽度和深度,使得优化变得更加容易。我们在各种任务上评估了 DMS,包括从视觉任务到自然语言处理任务和各种网络架构,包括卷积神经网络(CNN)和Transformer。结果表明,我们的 DMS 可以找到更好的结构和优于最先进的 NAS 方法。具体来说,在 ImageNet 上进行图像分类时,我们的 DMS 分别提高了 EfficientNet-B0 和 Deit-Tiny 的 top-1 准确率 by 1.4% 和 0.6%,并分别超过了最先进的 zero-shot NAS 方法 ZiCo 的 1.3% 的准确率,同时仅用 0.4 个 GPU days 的搜索时间。在 COCO 上进行对象检测时,DMS 提高了 Yolo-v8-n 的 mAP by 2.0%。对于自然语言处理,经过优化的 Llama-7B 击败了之前的方法,降低了 perplexity,提高了零散准确率。我们将发布我们的代码。
https://arxiv.org/abs/2405.07194
The recent introduction of prompt tuning based on pre-trained vision-language models has dramatically improved the performance of multi-label image classification. However, some existing strategies that have been explored still have drawbacks, i.e., either exploiting massive labeled visual data at a high cost or using text data only for text prompt tuning and thus failing to learn the diversity of visual knowledge. Hence, the application scenarios of these methods are limited. In this paper, we propose a pseudo-visual prompt~(PVP) module for implicit visual prompt tuning to address this problem. Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models. Then, a co-learning strategy with a dual-adapter module is designed to transfer visual knowledge from pseudo-visual prompt to text prompt, enhancing their visual representation abilities. Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art~(SOTA) methods across various settings for multi-label image classification tasks. The code is available at this https URL.
近年来,基于预训练视觉语言模型的提示调整方法在多标签图像分类中的性能得到了显著提高。然而,已经探索的一些策略仍然具有局限性,即要么在高代价的巨量标记视觉数据上进行挖掘,要么仅使用文本数据进行文本提示调整,从而无法学习视觉知识的地多样性。因此,这些方法的适用场景有限。在本文中,我们提出了一种伪视觉提示器(PVP)模块,用于隐式视觉提示调整来解决这个问题。具体来说,我们首先通过预训练视觉语言模型的一致性空间学习每个类别的伪视觉提示,然后设计了一个具有双适应器模块的协同学习策略,将预视觉提示的视觉知识传递给文本提示,增强它们的视觉表示能力。在VOC2007、MS-COCO和NUSWIDE数据集上的实验结果表明,我们的方法可以在各种设置的多标签图像分类任务中超越最先进的(SOTA)方法。代码可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2405.06926
Vision graph neural networks (ViG) offer a new avenue for exploration in computer vision. A major bottleneck in ViGs is the inefficient k-nearest neighbor (KNN) operation used for graph construction. To solve this issue, we propose a new method for designing ViGs, Dynamic Axial Graph Construction (DAGC), which is more efficient than KNN as it limits the number of considered graph connections made within an image. Additionally, we propose a novel CNN-GNN architecture, GreedyViG, which uses DAGC. Extensive experiments show that GreedyViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification, object detection, instance segmentation, and semantic segmentation tasks. Our smallest model, GreedyViG-S, achieves 81.1% top-1 accuracy on ImageNet-1K, 2.9% higher than Vision GNN and 2.2% higher than Vision HyperGraph Neural Network (ViHGNN), with less GMACs and a similar number of parameters. Our largest model, GreedyViG-B obtains 83.9% top-1 accuracy, 0.2% higher than Vision GNN, with a 66.6% decrease in parameters and a 69% decrease in GMACs. GreedyViG-B also obtains the same accuracy as ViHGNN with a 67.3% decrease in parameters and a 71.3% decrease in GMACs. Our work shows that hybrid CNN-GNN architectures not only provide a new avenue for designing efficient models, but that they can also exceed the performance of current state-of-the-art models.
视觉图神经网络(ViG)为计算机视觉领域提供了一个新的研究方向。ViG的一个主要瓶颈是用于构建图的k-最近邻(KNN)操作。为解决这个问题,我们提出了一个新的方法,动态轴向图构建(DAGC),它比KNN更有效,因为它限制了在图像内考虑的图连接数量。此外,我们还提出了一个新的CNN-GNN架构, GreedyViG,它使用了DAGC。大量的实验证明,GreedyViG在准确率、GANs和参数方面都超过了现有的ViG、CNN和ViT架构。我们的最小模型,GreedyViG-S,在ImageNet-1K上的top-1准确率为81.1%,比ViG和ViHGNN分别高2.9%和2.2%,且参数更少,GANs更少,参数更少。我们的最大模型,GreedyViG-B,在ImageNet上的top-1准确率为83.9%,比ViG高0.2%,参数减少了66.6%,GANs减少了69%。GreedyViG-B在ViHGNN上的top-1准确率也为67.3%,参数减少了71.3%,GANs减少了70%。我们的工作表明,混合CNN-GNN架构不仅为设计高效的模型提供了新的途径,而且还可以超越当前最先进的模型。
https://arxiv.org/abs/2405.06849
Intracerebral hemorrhage (ICH) is a severe and sudden medical condition caused by the rupture of blood vessels in the brain, leading to permanent damage to brain tissue and often resulting in functional disabilities or death in patients. Diagnosis and analysis of ICH typically rely on brain CT imaging. Given the urgency of ICH conditions, early treatment is crucial, necessitating rapid analysis of CT images to formulate tailored treatment plans. However, the complexity of ICH CT images and the frequent scarcity of specialist radiologists pose significant challenges. Therefore, we built a dataset for ICH and normal classification and three types of ICH image classification based on the hemorrhage location, i.e., Deep, Subcortical, and Lobar. In addition, we propose a dual-task vision transformer (DTViT) for the automated classification and diagnosis of ICH images. This neural network utilizes the encoder from ViT, employing attention mechanisms for feature extraction from CT images. We incorporated two multilayer perception (MLP)-based decoders within the network to simultaneously identify the presence of ICH and classify three types of hemorrhage locations. Experimental results demonstrate that our proposed multi-classification network performs well on the built real-world test dataset. The code and dataset for this study will be made publicly available upon paper acceptance at: this https URL.
颅内出血(ICH)是一种严重的急性医疗状况,由于脑部血管的破裂导致脑组织永久性损伤,通常会导致功能残疾或死亡。诊断和分析ICH通常依赖于脑部CT成像。鉴于ICH病情的紧迫性,早期治疗至关重要,需要快速分析CT图像以制定个性化的治疗计划。然而,ICH CT图像的复杂性和经常缺乏专门的放射科医生带来的挑战,使得这个任务非常具有挑战性。因此,我们为ICH和正常分类创建了一个数据集,并基于出血位置分为三种类型,即深部、亚脑室和叶状。此外,我们提出了一个用于自动分类和诊断ICH图像的双任务视觉Transformer(DTViT)。这个神经网络利用ViT的编码器,并采用注意力机制从CT图像中提取特征。我们通过网络中的两个多层感知(MLP)解码器,同时识别ICH和分类三种出血类型。实验结果表明,我们在构建的现实世界测试数据集上的多分类网络表现良好。本研究的代码和数据集将在论文接受后公开发布在:https:// this URL。
https://arxiv.org/abs/2405.06814
The task of medical image recognition is notably complicated by the presence of varied and multiple pathological indications, presenting a unique challenge in multi-label classification with unseen labels. This complexity underlines the need for computer-aided diagnosis methods employing multi-label zero-shot learning. Recent advancements in pre-trained vision-language models (VLMs) have showcased notable zero-shot classification abilities on medical images. However, these methods have limitations on leveraging extensive pre-trained knowledge from broader image datasets, and often depend on manual prompt construction by expert radiologists. By automating the process of prompt tuning, prompt learning techniques have emerged as an efficient way to adapt VLMs to downstream tasks. Yet, existing CoOp-based strategies fall short in performing class-specific prompts on unseen categories, limiting generalizability in fine-grained scenarios. To overcome these constraints, we introduce a novel prompt generation approach inspirited by text generation in natural language processing (NLP). Our method, named Pseudo-Prompt Generating (PsPG), capitalizes on the priori knowledge of multi-modal features. Featuring a RNN-based decoder, PsPG autoregressively generates class-tailored embedding vectors, i.e., pseudo-prompts. Comparative evaluations on various multi-label chest radiograph datasets affirm the superiority of our approach against leading medical vision-language and multi-label prompt learning methods. The source code is available at this https URL
医学图像识别任务的复杂性显著地由多种病理诊断表现的存在所加剧,这为在多标签分类中处理未见标签的挑战带来了独特的挑战。这种复杂性突出了需要使用多标签零样本学习来进行计算机辅助诊断的方法。最近,预训练视觉语言模型(VLMs)在医学图像上的显著零样本分类能力引起了人们的关注。然而,这些方法在利用更广泛的预训练知识方面存在局限,并且通常依赖于专家放射科医生的手动提示构建。通过自动调整提示过程,提示学习技术已成为将VLMs适应下游任务的有效方法。然而,现有的CoOp基策略在为未见类别生成类特定提示时存在局限,从而限制了在细粒度场景下的泛化能力。为了克服这些限制,我们引入了一种基于自然语言处理(NLP)的全新提示生成方法,我们称之为伪提示生成(PsPG)。PsPG利用多模态特征的先前知识。它采用循环神经网络(RNN)的解码器,逐个生成类定制嵌入向量,即伪提示。在各种多标签胸部X光片数据集上的比较评估证实了我们的方法相对于最先进的医学视觉语言和多标签提示学习方法具有优越性。源代码可在此链接下载:https://url.cn/
https://arxiv.org/abs/2405.06468
Federated learning (FL) offers a privacy-centric distributed learning framework, enabling model training on individual clients and central aggregation without necessitating data exchange. Nonetheless, FL implementations often suffer from non-i.i.d. and long-tailed class distributions across mobile applications, e.g., autonomous vehicles, which leads models to overfitting as local training may converge to sub-optimal. In our study, we explore the impact of data heterogeneity on model bias and introduce an innovative personalized FL framework, Multi-level Personalized Federated Learning (MuPFL), which leverages the hierarchical architecture of FL to fully harness computational resources at various levels. This framework integrates three pivotal modules: Biased Activation Value Dropout (BAVD) to mitigate overfitting and accelerate training; Adaptive Cluster-based Model Update (ACMU) to refine local models ensuring coherent global aggregation; and Prior Knowledge-assisted Classifier Fine-tuning (PKCF) to bolster classification and personalize models in accord with skewed local data with shared knowledge. Extensive experiments on diverse real-world datasets for image classification and semantic segmentation validate that MuPFL consistently outperforms state-of-the-art baselines, even under extreme non-i.i.d. and long-tail conditions, which enhances accuracy by as much as 7.39% and accelerates training by up to 80% at most, marking significant advancements in both efficiency and effectiveness.
翻译:联邦学习(FL)提供了一个以隐私为中心的分布式学习框架,可以实现在不进行数据交换的情况下对单个客户端进行模型训练,并实现集中聚合。然而,FL 的实现通常在移动应用程序中存在非均匀且长尾的类分布,例如自动驾驶车辆,这导致模型在局部训练可能收敛到次优的情况下过拟合。在我们的研究中,我们探讨了数据异质性对模型偏差的影响,并引入了一种创新的多级个性化 FL 框架,称为 Multi-level Personalized Federated Learning(MuPFL),该框架利用了 FL 的分层架构,充分利用各种层面的计算资源。该框架包括三个关键模块:有偏激活值下落(BAVD)以减轻过拟合并加速训练;自适应聚类为基础的模型更新(ACMU)以确保局部模型的全局一致性;以及基于共享知识的类器微调(PKCF),以加强分类并个性化与分不平衡局部数据相关的模型。在不同的现实世界数据集(如图像分类和语义分割)上进行广泛的实验证明,MuPFL 始终优于最先进的基准模型,即使在极端不均匀和长尾条件下,MuPFL 的准确率也提高了 7.39%,训练速度也提高了 80%。这表明,在效率和效果方面,FL 都取得了显著的进步。
https://arxiv.org/abs/2405.06413
In recent years, convolutional neural networks (CNNs) with channel-wise feature refining mechanisms have brought noticeable benefits to modelling channel dependencies. However, current attention paradigms fail to infer an optimal channel descriptor capable of simultaneously exploiting statistical and spatial relationships among feature maps. In this paper, to overcome this shortcoming, we present a novel channel-wise spatially autocorrelated (CSA) attention mechanism. Inspired by geographical analysis, the proposed CSA exploits the spatial relationships between channels of feature maps to produce an effective channel descriptor. To the best of our knowledge, this is the f irst time that the concept of geographical spatial analysis is utilized in deep CNNs. The proposed CSA imposes negligible learning parameters and light computational overhead to the deep model, making it a powerful yet efficient attention module of choice. We validate the effectiveness of the proposed CSA networks (CSA-Nets) through extensive experiments and analysis on ImageNet, and MS COCO benchmark datasets for image classification, object detection, and instance segmentation. The experimental results demonstrate that CSA-Nets are able to consistently achieve competitive performance and superior generalization than several state-of-the-art attention-based CNNs over different benchmark tasks and datasets.
近年来,带有通道级特征细化机制的卷积神经网络(CNNs)在建模通道依赖方面带来了显著的益处。然而,当前的注意范式无法推断出能够同时利用特征图之间的统计和空间关系的最优通道描述符。在本文中,为了克服这一不足,我们提出了一个新的通道级自相关(CSA)注意机制。受到地理分析的启发,所提出的CSA利用了特征图通道之间的空间关系来产生有效的通道描述符。据我们所知,这是第一次将地理空间分析的概念应用于深度CNN中。所提出的CSA对深度模型只引入了微小的学习和计算开销,使其成为了一个强大而有效的注意模块。我们通过在ImageNet和MS COCO基准数据集上进行广泛的实验和分析来验证所提出的CSA网络(CSA-Nets)的有效性。实验结果表明,CSA-Nets能够比几种最先进的基于注意力的CNN在不同的基准任务和数据集上始终实现竞争力的性能和卓越的泛化能力。
https://arxiv.org/abs/2405.05755
In this paper, we propose a No-Reference Image Quality Assessment (NRIQA) guided cut-off point selection (CPS) strategy to enhance the performance of a fine-grained classification system. Scores given by existing NRIQA methods on the same image may vary and not be as independent of natural image augmentations as expected, which weakens their connection and explainability to fine-grained image classification. Taking the three most commonly adopted image augmentation configurations -- cropping, rotating, and blurring -- as the entry point, we formulate a two-step mechanism for selecting the most discriminative subset from a given image dataset by considering both the confidence of model predictions and the density distribution of image qualities over several NRIQA methods. Concretely, the cut-off points yielded by those methods are aggregated via majority voting to inform the process of image subset selection. The efficacy and efficiency of such a mechanism have been confirmed by comparing the models being trained on high-quality images against a combination of high- and low-quality ones, with a range of 0.7% to 4.2% improvement on a commercial product dataset in terms of mean accuracy through four deep neural classifiers. The robustness of the mechanism has been proven by the observations that all the selected high-quality images can work jointly with 70% low-quality images with 1.3% of classification precision sacrificed when using ResNet34 in an ablation study.
在本文中,我们提出了一个名为“无参考图像质量评估引导切点选择(NRIQA)策略”的方法,用于提高细粒度分类系统的性能。相同图像上现有NRIQA方法的得分可能会有所不同,并且不一定与自然图像增强方法的独立性相同,这使得它们与细粒度图像分类的连接和可解释性减弱。我们以裁剪、旋转和模糊三种最常用的图像增强方式作为切入点,通过考虑模型的预测置信度和几种NRIQA方法上图像质量的密度分布,我们构建了一个两步机制,用于从给定的图像数据集中选择最具判别性的子集。具体来说,这些方法生成的切点通过多数投票汇总,以告知图像子集选择过程。该机制的有效性和效率已经通过将训练于高质量图像的模型与高质量和低质量图像的组合与商业产品数据集上的平均准确率进行比较来得到证实,实现了0.7%到4.2%的改善。通过在消融研究中使用ResNet34,当使用该模型时,高质量图像可以与70%的低质量图像共同工作,并且在分类精度方面损失了1.3%的精度。这种机制的鲁棒性得到了通过在抽象研究中使用ResNet34并观察到所选高质量图像可以与70%的低质量图像共同工作,并且在分类精度方面损失了1.3%的精度的观察结果的证明。
https://arxiv.org/abs/2405.05742
Ownership verification is currently the most critical and widely adopted post-hoc method to safeguard model copyright. In general, model owners exploit it to identify whether a given suspicious third-party model is stolen from them by examining whether it has particular properties `inherited' from their released models. Currently, backdoor-based model watermarks are the primary and cutting-edge methods to implant such properties in the released models. However, backdoor-based methods have two fatal drawbacks, including harmfulness and ambiguity. The former indicates that they introduce maliciously controllable misclassification behaviors ($i.e.$, backdoor) to the watermarked released models. The latter denotes that malicious users can easily pass the verification by finding other misclassified samples, leading to ownership ambiguity. In this paper, we argue that both limitations stem from the `zero-bit' nature of existing watermarking schemes, where they exploit the status ($i.e.$, misclassified) of predictions for verification. Motivated by this understanding, we design a new watermarking paradigm, $i.e.$, Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution instead of model predictions. Specifically, EaaW embeds a `multi-bit' watermark into the feature attribution explanation of specific trigger samples without changing the original prediction. We correspondingly design the watermark embedding and extraction algorithms inspired by explainable artificial intelligence. In particular, our approach can be used for different tasks ($e.g.$, image classification and text generation). Extensive experiments verify the effectiveness and harmlessness of our EaaW and its resistance to potential attacks.
所有权验证是目前最严格且应用最广泛的后续方法,以保护模型版权。通常,模型所有者利用它来确定是否存在从他们发布的模型中盗用的可疑第三方模型,检查它是否具有从他们发布的模型中继承的特定属性。目前,基于后门的模型水印是植入此类属性的主要和最先进的方法。然而,基于后门的方法有两个致命的缺陷,包括有害性和模糊性。前者表示它们将恶意可控制的不正确分类行为(即后门)引入到标记的发布模型中。后者表示恶意用户可以轻松地绕过验证,通过找到其他已分类的样本,导致所有权模糊。在本文中,我们认为这两个缺陷都源于现有水印方案的“零比特”特性,即它们利用验证状态(即分类)的预测。为了理解这一点,我们设计了一个新的水印范例:解释作为水印(EaaW),将验证行为嵌入到特征归因的解释中,而不是模型预测。具体来说,EaaW 将一个“多比特”水印嵌入到特定触发样本的特征归因解释中,而不会改变原始预测。我们相应地设计水印嵌入和提取算法,受到可解释人工智能(EVA)的启发。特别地,我们的方法可用于不同任务(例如,图像分类和文本生成)。大量实验证实了我们的EaaW的有效性和无害性,以及其对潜在攻击的抵抗能力。
https://arxiv.org/abs/2405.04825