The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in computational and memory footprints as the convolutional kernel size increases. To address the first issue, we propose the multi-scale blueprint separable convolutions (MBSConv) as highly efficient building block with multi-scale receptive field, it can focus on the learning for the multi-scale information which is a vital component of discriminative representation. As for the second issue, we revisit the key properties of LKA in which we find that the adjacent direct interaction of local information and long-distance dependencies is crucial to provide remarkable performance. Thus, taking this into account and in order to mitigate the complexity of LKA, we propose a large coordinate kernel attention (LCKA) module which decomposes the 2D convolutional kernels of the depth-wise convolutional layers in LKA into horizontal and vertical 1-D kernels. LCKA enables the adjacent direct interaction of local information and long-distance dependencies not only in the horizontal direction but also in the vertical. Besides, LCKA allows for the direct use of extremely large kernels in the depth-wise convolutional layers to capture more contextual information, which helps to significantly improve the reconstruction performance, and it incurs lower computational complexity and memory footprints. Integrating MBSConv and LCKA, we propose a large coordinate kernel attention network (LCAN).
多尺度 receptive 场和大型内核注意 (LKA) 模块已经被证明在轻量图像超分辨率任务中显著提高了性能。然而,现有的轻量级超分辨率(SR)方法很少关注设计具有多尺度 receptive 场的有效构建模块,并且随着卷积核大小的增加,它们的 LKA 模块的计算和内存足迹呈指数增长。为解决第一个问题,我们提出了多尺度蓝色模板分离卷积(MBSConv)作为具有多尺度 receptive 场的非常高效构建模块,它可以关注多尺度信息,这是判别表示的重要组成部分。对于第二个问题,我们重新审视了 LKA 的关键特性,我们发现邻近信息之间的直接相互作用和长距离依赖关系对提供出色的性能至关重要。因此,考虑到这一点,为了减轻 LKA 的复杂性,我们提出了大型坐标卷积注意(LCKA)模块,它将 LKA 的深度卷积层中的 2D 卷积核拆分为水平和垂直 1D 卷积核。LCKA 不仅使相邻直接相互作用于局部信息和长距离依赖关系,而且在水平和垂直方向上都有。此外,LCKA 允许在深度卷积层中直接使用极其大的卷积核来捕捉更多的上下文信息,从而显著提高重构性能,并使其计算复杂性和内存足迹更低。将 MBSConv 和 LCKA 集成起来,我们提出了大型坐标卷积注意网络 (LCAN)。
https://arxiv.org/abs/2405.09353
AI-based analysis of histopathology whole slide images (WSIs) is central in computational pathology. However, image quality can impact model performance. Here, we investigate to what extent unsharp areas of WSIs impact deep convolutional neural network classification performance. We propose a multi-model approach, i.e. DeepBlurMM, to alleviate the impact of unsharp image areas and improve the model performance. DeepBlurMM uses the sigma cut-offs to determine the most suitable model for predicting tiles with various levels of blurring within a single WSI, where sigma is the standard deviation of the Gaussian distribution. Specifically, the cut-offs categorise the tiles into sharp or slight blur, moderate blur, and high blur. Each blur level has a corresponding model to be selected for tile-level predictions. Throughout the simulation study, we demonstrated the application of DeepBlurMM in a binary classification task for breast cancer Nottingham Histological Grade 1 vs 3. Performance, evaluated over 5-fold cross-validation, showed that DeepBlurMM outperformed the base model under moderate blur and mixed blur conditions. Unsharp image tiles (local blurriness) at prediction time reduced model performance. The proposed multi-model approach improved performance under some conditions, with the potential to improve quality in both research and clinical applications.
基于AI的病理学全切片图像(WSIs)分析在计算病理学中具有核心地位。然而,图像质量会 impact 模型性能。在这里,我们研究了 WSIs 非锐利区域对深度卷积神经网络分类性能的影响程度。我们提出了一个多模型方法,即 DeepBlurMM,以减轻非锐利图像区域对模型性能的影响,并提高模型性能。DeepBlurMM 使用高斯分布的σ截止值来确定在单个 WSI 中预测具有各种模糊程度的贴片的最合适的模型。具体来说,截止值将贴片分类为锐利、轻微模糊、中度模糊和高模糊。对于每个模糊级别,都有相应的模型用于预测贴片级别的结果。在模拟研究中,我们证明了 DeepBlurMM 在乳腺癌诺丁山病理 grade 1 与 3 的二分类任务中的应用。性能通过 5 倍交叉验证评估,在 moderate blur 和 mixed blur 条件下,DeepBlurMM 超过了基线模型。预测时间内的非锐利图像贴片(局部模糊)降低了模型性能。所提出的多模型方法在某些条件下改善了性能,具有在研究和临床应用中提高质量的潜力。
https://arxiv.org/abs/2405.09298
We propose a new graph convolutional block, called MusGConv, specifically designed for the efficient processing of musical score data and motivated by general perceptual principles. It focuses on two fundamental dimensions of music, pitch and rhythm, and considers both relative and absolute representations of these components. We evaluate our approach on four different musical understanding problems: monophonic voice separation, harmonic analysis, cadence detection, and composer identification which, in abstract terms, translate to different graph learning problems, namely, node classification, link prediction, and graph classification. Our experiments demonstrate that MusGConv improves the performance on three of the aforementioned tasks while being conceptually very simple and efficient. We interpret this as evidence that it is beneficial to include perception-informed processing of fundamental musical concepts when developing graph network applications on musical score data.
我们提出了一个新的图形卷积块,称为MusGConv,专门为音乐分数数据的高效处理而设计,并受到一般感知原则的启发。它专注于音乐的两个基本维度,即音高和节奏,并考虑这两个组件的相对和绝对表示。我们对我们的方法在四个不同的音乐理解问题进行了评估:单声道声音分离,和弦分析,句末检测和作曲家识别。用抽象的话,这些问题翻译为不同的图学习问题,即节点分类,链预测和图分类。我们的实验结果表明,MusGConv在提高上述三个任务的同时,在概念上非常简单和高效。我们将这一结果解释为在开发基于音乐分数数据的图形网络应用时,有意识地处理基本音乐概念的感知信息是有益的。
https://arxiv.org/abs/2405.09224
Due to the increasing need for effective security measures and the integration of cameras in commercial products, a hugeamount of visual data is created today. Law enforcement agencies (LEAs) are inspecting images and videos to findradicalization, propaganda for terrorist organizations and illegal products on darknet markets. This is time consuming.Instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specificlocations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deepconvolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has fivecontributions. The first contribution allows image-based geo-localization to estimate the origin of an image. CNNs andgeotagged images are used to create a model that determines the location of an image by its pixel values. The secondcontribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposedmethod encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition ofperson attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attributeproblem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotationtool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimalannotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion.Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectableconcepts is required for the users. The methods are validated on data with varying locations (popular and non-touristiclocations), varying person attributes (CelebA dataset), and varying number of annotations.
由于对有效安全措施的需求不断增加以及摄像头在商业产品中的应用,如今产生了大量的视觉数据。执法机构(LEAs)正在检查图像和视频以寻找极端化、恐怖主义组织和非法商品在暗网市场上的传播。这需要耗费大量时间。 instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specific locations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deep convolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has five contributions. The first contribution allows image-based geolocation to estimate the origin of an image. CNNs and geotagged images are used to create a model that determines the location of an image by its pixel values. The second contribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposed method encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition of person attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attribute problem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotation tool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimal annotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion. Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectable concepts is required for the users. The methods are validated on data with varying locations (popular and non-tourist locations), varying person attributes (CelebA dataset), and varying number of annotations.
https://arxiv.org/abs/2405.09194
Indian folk paintings have a rich mosaic of symbols, colors, textures, and stories making them an invaluable repository of cultural legacy. The paper presents a novel approach to classifying these paintings into distinct art forms and tagging them with their unique salient features. A custom dataset named FolkTalent, comprising 2279 digital images of paintings across 12 different forms, has been prepared using websites that are direct outlets of Indian folk paintings. Tags covering a wide range of attributes like color, theme, artistic style, and patterns are generated using GPT4, and verified by an expert for each painting. Classification is performed employing the RandomForest ensemble technique on fine-tuned Convolutional Neural Network (CNN) models to classify Indian folk paintings, achieving an accuracy of 91.83%. Tagging is accomplished via the prominent fine-tuned CNN-based backbones with a custom classifier attached to its top to perform multi-label image classification. The generated tags offer a deeper insight into the painting, enabling an enhanced search experience based on theme and visual attributes. The proposed hybrid model sets a new benchmark in folk painting classification and tagging, significantly contributing to cataloging India's folk-art heritage.
印度民间绘画具有丰富的象征、色彩、纹理和故事,使其成为文化遗产的无价财富。本文提出了一种新颖的方法来将这些绘画分类为不同的艺术形式,并为它们独特的突出特点贴上标签。一个由12种形式、共2279幅绘画图片组成的自定义数据集FolkTalent已经准备就绪,这些图片来源于印度民间绘画的直接网站。使用GPT4生成覆盖颜色、主题、艺术风格和图案等广泛属性的标签,并请专家对每幅绘画进行验证。分类采用随机森林技术对经过微调的卷积神经网络(CNN)模型进行,实现91.83%的准确率。标签通过一个显著地进行微调的CNN骨干网络与自定义分类器连接在一起进行多标签图像分类。生成的标签提供了对绘画的更深刻的洞察,使主题和视觉属性能够成为增强的搜索体验。所提出的混合模型在民间绘画分类和标签方面设定了新的基准,显著地贡献了印度民间艺术遗产的目录。
https://arxiv.org/abs/2405.08776
Depth estimation plays a crucial role in various tasks within endoscopic surgery, including navigation, surface reconstruction, and augmented reality visualization. Despite the significant achievements of foundation models in vision tasks, including depth estimation, their direct application to the medical domain often results in suboptimal performance. This highlights the need for efficient adaptation methods to adapt these models to endoscopic depth estimation. We propose Endoscopic Depth Any Camera (EndoDAC) which is an efficient self-supervised depth estimation framework that adapts foundation models to endoscopic scenes. Specifically, we develop the Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks to tailor the foundational model to the surgical domain, utilizing remarkably few trainable parameters. Given that camera information is not always accessible, we also introduce a self-supervised adaptation strategy that estimates camera intrinsics using the pose encoder. Our framework is capable of being trained solely on monocular surgical videos from any camera, ensuring minimal training costs. Experiments demonstrate that our approach obtains superior performance even with fewer training epochs and unaware of the ground truth camera intrinsics. Code is available at this https URL.
深度估计在内窥镜手术的各种任务中扮演着至关重要的角色,包括导航、表面重建和增强现实可视化。尽管基础模型在视觉任务中取得了显著的成就,包括深度估计,但它们的直接应用到医学领域通常会导致性能较低。这表明了需要有效的适应方法将这些模型应用于内窥镜深度估计。我们提出了Endoscopic Depth Any Camera(EndoDAC),这是一种有效的自监督深度估计框架,将基础模型适应内窥镜场景。具体来说,我们开发了基于动态向量的高级低秩适应(DV-LoRA)方法,并使用卷积颈块将基本模型裁剪为手术领域,利用训练参数的数量非常少。鉴于相机信息通常不可用,我们还引入了一种自监督的适应策略,使用姿态编码器估计相机内参。我们的框架能够仅通过单目手术视频进行训练,确保最小化训练成本。实验结果表明,即使训练次数较少,甚至不知道真实相机内参,我们的方法也能获得卓越的性能。代码位于此链接处。
https://arxiv.org/abs/2405.08672
Deep learning methods, especially Convolutional Neural Networks (CNN) and Vision Transformer (ViT), are frequently employed to perform semantic segmentation of high-resolution remotely sensed images. However, CNNs are constrained by their restricted receptive fields, while ViTs face challenges due to their quadratic complexity. Recently, the Mamba model, featuring linear complexity and a global receptive field, has gained extensive attention for vision tasks. In such tasks, images need to be serialized to form sequences compatible with the Mamba model. Numerous research efforts have explored scanning strategies to serialize images, aiming to enhance the Mamba model's understanding of images. However, the effectiveness of these scanning strategies remains uncertain. In this research, we conduct a comprehensive experimental investigation on the impact of mainstream scanning directions and their combinations on semantic segmentation of remotely sensed images. Through extensive experiments on the LoveDA, ISPRS Potsdam, and ISPRS Vaihingen datasets, we demonstrate that no single scanning strategy outperforms others, regardless of their complexity or the number of scanning directions involved. A simple, single scanning direction is deemed sufficient for semantic segmentation of high-resolution remotely sensed images. Relevant directions for future research are also recommended.
深度学习方法,特别是卷积神经网络(CNN)和视觉Transformer(ViT),通常用于对高分辨率遥感图像进行语义分割。然而,CNN受到其有限 receptive field 的限制,而ViT由于其二次复杂度而面临挑战。最近,Mamba模型,具有线性复杂度和全局 receptive field,在视觉任务上受到了广泛关注。在 such tasks 中,图像需要以与Mamba模型兼容的序列形式进行 serialization。许多研究努力探讨扫描策略来序列化图像,旨在增强Mamba模型对图像的理解。然而,这些扫描策略的有效性仍然不确定。在这项研究中,我们对主流扫描方向及其组合对遥感图像语义分割的影响进行全面实验调查。通过在LoveDA、ISPRS Potsdam和ISPRS Vaihingen数据集上的广泛实验,我们证明无论复杂程度如何,没有一种扫描策略能够优于其他扫描策略。我们还推荐了未来研究的方向。
https://arxiv.org/abs/2405.08493
Robust road surface estimation is required for autonomous ground vehicles to navigate safely. Despite it becoming one of the main targets for autonomous mobility researchers in recent years, it is still an open problem in which cameras and LiDAR sensors have demonstrated to be adequate to predict the position, size and shape of the road a vehicle is driving on in different environments. In this work, a novel Convolutional Neural Network model is proposed for the accurate estimation of the roadway surface. Furthermore, an ablation study has been conducted to investigate how different encoding strategies affect model performance, testing 6 slightly different neural network architectures. Our model is based on the use of a Twin Encoder-Decoder Neural Network (TEDNet) for independent camera and LiDAR feature extraction, and has been trained and evaluated on the Kitti-Road dataset. Bird's Eye View projections of the camera and LiDAR data are used in this model to perform semantic segmentation on whether each pixel belongs to the road surface. The proposed method performs among other state-of-the-art methods and operates at the same frame-rate as the LiDAR and cameras, so it is adequate for its use in real-time applications.
为了使自动驾驶车辆安全导航,需要对道路表面进行稳健的估计。尽管近年来,自动驾驶移动研究人员将这一目标作为主要目标,但仍然是一个尚未解决的问题,其中相机和LiDAR传感器已经被证明在预测车辆在各种环境中行驶的道路位置、大小和形状方面是足够的。在本文中,我们提出了一个用于准确估计道路表面的全新卷积神经网络模型。此外,我们还进行了一项消融研究,以研究不同编码策略对模型性能的影响,测试了6种稍微不同的神经网络架构。我们的模型基于使用Twin Encoder-Decoder Neural Network(TEDNet)进行独立相机和LiDAR特征提取,并在Kitti-Road数据集上进行训练和评估。在这个模型中,使用了摄像机和LiDAR数据的鸟瞰投影来进行语义分割,以确定每个像素是否属于道路表面。与最先进的算法相比,我们的方法在性能上处于领先地位,并且与LiDAR和相机在相同的帧率下运行,因此它非常适合在实时应用中使用。
https://arxiv.org/abs/2405.08429
Underwater imaging often suffers from low quality due to factors affecting light propagation and absorption in water. To improve image quality, some underwater image enhancement (UIE) methods based on convolutional neural networks (CNN) and Transformer have been proposed. However, CNN-based UIE methods are limited in modeling long-range dependencies, and Transformer-based methods involve a large number of parameters and complex self-attention mechanisms, posing efficiency challenges. Considering computational complexity and severe underwater image degradation, a state space model (SSM) with linear computational complexity for UIE, named WaterMamba, is proposed. We propose spatial-channel omnidirectional selective scan (SCOSS) blocks comprising spatial-channel coordinate omnidirectional selective scan (SCCOSS) modules and a multi-scale feedforward network (MSFFN). The SCOSS block models pixel and channel information flow, addressing dependencies. The MSFFN facilitates information flow adjustment and promotes synchronized operations within SCCOSS modules. Extensive experiments showcase WaterMamba's cutting-edge performance with reduced parameters and computational resources, outperforming state-of-the-art methods on various datasets, validating its effectiveness and generalizability. The code will be released on GitHub after acceptance.
由于影响水下成像光传播和吸收的因素,水下成像通常会导致低质量。为提高图像质量,已经提出了基于卷积神经网络(CNN)和Transformer的一些水下图像增强(UIE)方法。然而,基于CNN的UIE方法在建模长距离依赖方面有限,而基于Transformer的方法参数数量较大且具有复杂的自注意机制,导致效率挑战。在考虑计算复杂性和严重的水下图像退化的情况下,我们提出了一个具有线性计算复杂度的水下图像增强状态空间模型(WaterMamba)。我们提出了包括空间通道坐标全方向选择扫描(SCCOSS)模块和多尺度全向导网络(MSFFN)的时空通道全方向选择扫描(SCOSS)块。SCOSS块建模像素和通道信息流,解决依赖关系。MSFFN促进信息流调整和SCOSS模块内的同步操作。大量实验展示了WaterMamba在较低参数和计算资源下的尖端性能,其在各种数据集上的表现优于最先进的方法,验证了其有效性和通用性。代码将在接受审核后发布到GitHub上。
https://arxiv.org/abs/2405.08419
Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as \textit{SqueezeTime}, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves $+1.2\%$ accuracy and $+80\%$ GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at this https URL and this https URL.
目前用于视频理解的架构主要基于3D卷积块或2D卷积块,并添加了用于时间建模的操作。然而,这些方法都将时间轴视为视频序列的单独维度,需要大量的计算和内存预算,因此限制了它们在移动设备上的使用。在本文中,我们提出了一种将视频序列的时间轴压缩到通道维度,并提出了一个轻量级的移动视频理解网络,称为\textit{SqueezeTime},用于移动视频理解。为了增强所提出的网络的时序建模能力,我们设计了一个通道时学习(CTL)模块来捕捉序列的时变动态。这个模块有两个互补的分支,其中一个是用于时间重要性学习,另一个是用于时间位置恢复能力的分支,以增强跨时间物体建模能力。所提出的SqueezeTime在移动视频理解中非常轻便且快速,具有很高的准确性。在各种视频识别和动作检测基准上进行的广泛实验(即Kinetics400、Kinetics600、HMDB51、AVA2.1和THUMOS14)证明了我们的模型的优越性。例如,我们的SqueezeTime在Kinetics400上实现了比 prior 方法 $+1.2\%$ 的准确性和$+80\%$的GPU吞吐量增益。代码公开可用,请访问以下链接:https://this URL 和 https://this URL。
https://arxiv.org/abs/2405.08344
Objective: Automated segmentation tools are useful for calculating kidney volumes rapidly and accurately. Furthermore, these tools have the power to facilitate large-scale image-based artificial intelligence projects by generating input labels, such as for image registration algorithms. Prior automated segmentation models have largely ignored non-contrast computed tomography (CT) imaging. This work aims to implement and train a deep learning (DL) model to segment the kidneys and cystic renal lesions (CRLs) from non-contrast CT scans. Methods: Manual segmentation of the kidneys and CRLs was performed on 150 non-contrast abdominal CT scans. The data were divided into an 80/20 train/test split and a deep learning (DL) model was trained to segment the kidneys and CRLs. Various scoring metrics were used to assess model performance, including the Dice Similarity Coefficient (DSC), Jaccard Index (JI), and absolute and percent error kidney volume and lesion volume. Bland-Altman (B-A) analysis was performed to compare manual versus DL-based kidney volumes. Results: The DL model achieved a median kidney DSC of 0.934, median CRL DSC of 0.711, and total median study DSC of 0.823. Average volume errors were 0.9% for renal parenchyma, 37.0% for CRLs, and 2.2% overall. B-A analysis demonstrated that DL-based volumes tended to be greater than manual volumes, with a mean bias of +3.0 ml (+/- 2 SD of +/- 50.2 ml). Conclusion: A deep learning model trained to segment kidneys and cystic renal lesions on non-contrast CT examinations was able to provide highly accurate segmentations, with a median kidney Dice Similarity Coefficient of 0.934. Keywords: deep learning; kidney segmentation; artificial intelligence; convolutional neural networks.
目标:自动分割工具对于快速且准确计算肾脏体积非常有用。此外,这些工具还有助于通过生成输入标签,如图像配准算法,促进大规模基于图像的人工智能项目。之前的大多数自动分割模型都忽略了非对比计算断层扫描(CT)成像。本文旨在实现和训练一个深度学习(DL)模型,用于从非对比CT扫描中分割肾脏和肾囊肿(CRLs)。方法:在150个非对比腹部CT扫描上进行了手动肾分割和肾囊肿分割。数据被分为80/20训练/测试集,并训练了一个DL模型来分割肾脏和肾囊肿。各种评分指标被用来评估模型性能,包括Dice相似性系数(DSC)、Jaccard指数(JI)和绝对和百分比肾体积和囊肿体积。Bland-Altman(B-A)分析用于比较手动和DL基于肾体积。结果:DL模型实现了中位肾DSC为0.934,中位肾囊肿DSC为0.711,总中位实验DSC为0.823。平均体积误差分别为肾皮层0.9%,肾囊肿37.0%和总体0.22%。B-A分析表明,基于DL的体积倾向于大于手动体积,平均偏差为+3.0 ml (+/- 2 SD of +/- 50.2 ml)。结论:通过训练深度学习(DL)模型从非对比CT扫描中分割肾脏和肾囊肿,可以实现高度准确的分割,中位肾DSC为0.934。关键词:深度学习;肾脏分割;人工智能;卷积神经网络。
https://arxiv.org/abs/2405.08282
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named \emph{MambaOut} through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at this https URL
Mamba 是一种具有类似于 RNN 式的状态空间模型(SSM)的架构,最近为解决注意力机制的二次复杂度而引入。然而,与卷积和基于注意的模型相比,Mamba 在视觉任务上的表现往往令人失望。在本文中,我们深入研究了 Mamba 的本质,并从理论上得出了结论,即 Mamba 非常适合具有长序列和自回归特性的任务。对于视觉任务,由于图像分类不涉及任何特性,我们假设 Mamba 对这项任务是不必要的;检测和分割任务也不具有自回归特性,但它们仍然遵循长序列特性,因此我们认为值得探索 Mamba 在这些任务上的潜力。为了通过实验验证我们的假设,我们通过堆叠 Mamba 模块并删除其核心词混合器构建了一系列模型,命名为 \emph{MambaOut}。实验结果强烈支持我们的假设。具体来说,我们的 MambaOut 模型在 ImageNet 图像分类中超过了所有视觉 Mamba 模型,表明 Mamba 确实是不必要的。对于检测和分割任务,MambaOut 的性能无法与最先进的视觉 Mamba 模型相媲美,这表明了 Mamba 在长序列视觉任务上的潜力。代码可在此处获得:https:// this URL
https://arxiv.org/abs/2405.07992
Place recognition is the foundation for enabling autonomous systems to achieve independent decision-making and safe operations. It is also crucial in tasks such as loop closure detection and global localization within SLAM. Previous methods utilize mundane point cloud representations as input and deep learning-based LiDAR-based Place Recognition (LPR) approaches employing different point cloud image inputs with convolutional neural networks (CNNs) or transformer architectures. However, the recently proposed Mamba deep learning model, combined with state space models (SSMs), holds great potential for long sequence modeling. Therefore, we developed OverlapMamba, a novel network for place recognition, which represents input range views (RVs) as sequences. In a novel way, we employ a stochastic reconstruction approach to build shift state space models, compressing the visual representation. Evaluated on three different public datasets, our method effectively detects loop closures, showing robustness even when traversing previously visited locations from different directions. Relying on raw range view inputs, it outperforms typical LiDAR and multi-view combination methods in time complexity and speed, indicating strong place recognition capabilities and real-time efficiency.
位置识别是使自动驾驶系统实现独立决策和安全的操作的基础,同时在SLAM任务中(例如闭环检测和全局定位)也非常关键。以前的方法利用乏味的点云表示作为输入,并采用基于深度学习的激光雷达(LiDAR)基于点云图像的识别方法(LPR)或Transformer架构的不同点云图像输入。然而,最近提出的Mamba深度学习模型与状态空间模型(SSMs)相结合,具有很大的长期序列建模潜力。因此,我们开发了OverlapMamba,一种新型的用于位置识别的网络,将输入范围视(RVs)表示为序列。与传统方法不同,我们采用随机重构方法构建了转移状态空间模型,压缩了视觉表示。在三个不同的公开数据集上评估,我们的方法有效地检测到了闭环,即使从不同的方向访问之前访问过的位置时,表现依然稳健。依赖原始范围视图输入,它在时间和速度上优于典型的LiDAR和多视图组合方法,表明具有强大的定位能力和实时效率。
https://arxiv.org/abs/2405.07966
Deep image hashing aims to map input images into simple binary hash codes via deep neural networks and thus enable effective large-scale image retrieval. Recently, hybrid networks that combine convolution and Transformer have achieved superior performance on various computer tasks and have attracted extensive attention from researchers. Nevertheless, the potential benefits of such hybrid networks in image retrieval still need to be verified. To this end, we propose a hybrid convolutional and self-attention deep hashing method known as HybridHash. Specifically, we propose a backbone network with stage-wise architecture in which the block aggregation function is introduced to achieve the effect of local self-attention and reduce the computational complexity. The interaction module has been elaborately designed to promote the communication of information between image blocks and to enhance the visual representations. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that the method proposed in this paper has superior performance with respect to state-of-the-art deep hashing methods. Source code is available this https URL.
深度图像哈希的目的是通过深度神经网络将输入图像映射到简单的二进制哈希码,从而实现有效的 large-scale 图像检索。最近,结合卷积神经网络(CNN)和 Transformer 的混合网络在各种计算机任务上取得了卓越的性能,并吸引了广泛的关注。然而,这类混合网络在图像检索中的潜在好处仍需得到证实。为此,我们提出了一种名为 HybridHash 的混合卷积和自注意力深度哈希方法。具体来说,我们提出了一种具有阶段式架构的骨干网络,其中引入了块聚合函数以实现局部自注意力和降低计算复杂度。交互模块已经详细设计,以促进图像块之间的信息交流和增强视觉表示。我们在三个广泛使用的数据集上进行了全面的实验:CIFAR-10、NUS-WIDE 和 IMAGENET。实验结果表明,本文提出的方法在深度哈希方法方面具有优越的性能。源代码可以从此链接下载。
https://arxiv.org/abs/2405.07524
High-quality images are crucial in remote sensing and UAV applications, but atmospheric haze can severely degrade image quality, making image dehazing a critical research area. Since the introduction of deep convolutional neural networks, numerous approaches have been proposed, and even more have emerged with the development of vision transformers and contrastive/few-shot learning. Simultaneously, papers describing dehazing architectures applicable to various Remote Sensing (RS) domains are also being published. This review goes beyond the traditional focus on benchmarked haze datasets, as we also explore the application of dehazing techniques to remote sensing and UAV datasets, providing a comprehensive overview of both deep learning and prior-based approaches in these domains. We identify key challenges, including the lack of large-scale RS datasets and the need for more robust evaluation metrics, and outline potential solutions and future research directions to address them. This review is the first, to our knowledge, to provide comprehensive discussions on both existing and very recent dehazing approaches (as of 2024) on benchmarked and RS datasets, including UAV-based imagery.
高质量的图像在遥感和无人机应用中至关重要,但大气雾霾会严重破坏图像质量,使图像去雾成为一个关键的研究领域。自深度卷积神经网络的引入,已经提出了许多方法,随着视觉变压器和对比/零样本学习的发展,更多方法也应运而生。同时,描述适用于各种遥感(RS)领域的去雾架构的论文也在不断发表。本综述超越了传统关注基准雾数据集的范围,我们还在遥感和无人机数据上探讨了去雾技术的应用,为这些领域提供了一个全面的深度学习和基于先验方法的研究概述。我们指出了关键挑战,包括缺乏大规模 RS 数据集和需要更健壮的评估指标,并提出了可能的解决方案和未来的研究方向来解决这些挑战。据我们所知,这是第一部关于基准和 RS 数据集上现有和非常最近去雾方法的综合讨论(截至 2024 年)。包括基于 UAV 的图像。
https://arxiv.org/abs/2405.07520
In recent years, deep learning based on Convolutional Neural Networks (CNNs) has achieved remarkable success in many applications. However, their heavy reliance on extensive labeled data and limited generalization ability to unseen classes pose challenges to their suitability for medical image processing tasks. Few-shot learning, which utilizes a small amount of labeled data to generalize to unseen classes, has emerged as a critical research area, attracting substantial attention. Currently, most studies employ a prototype-based approach, in which prototypical networks are used to construct prototypes from the support set, guiding the processing of the query set to obtain the final results. While effective, this approach heavily relies on the support set while neglecting the query set, resulting in notable disparities within the model classes. To mitigate this drawback, we propose a novel Support-Query Prototype Fusion Network (SQPFNet). SQPFNet initially generates several support prototypes for the foreground areas of the support images, thus producing a coarse segmentation mask. Subsequently, a query prototype is constructed based on the coarse segmentation mask, additionally exploiting pattern information in the query set. Thus, SQPFNet constructs high-quality support-query fused prototypes, upon which the query image is segmented to obtain the final refined query mask. Evaluation results on two public datasets, SABS and CMR, show that SQPFNet achieves state-of-the-art performance.
近年来,基于卷积神经网络(CNNs)的深度学习在许多应用领域取得了显著的成功。然而,它们对大量标记数据的高度依赖和对于未见过的类别的有限泛化能力,使得它们在医学图像处理任务上并不适用。少量样本学习,利用少量的标记数据推广到未见过的类别,成为一个关键的研究领域,吸引了大量关注。目前,大多数研究采用基于原型的方法,其中原型网络用于从支持集构建原型,指导查询集的加工以获得最终结果。虽然有效,但这种方法在支持集上过于依赖,而忽略了查询集,导致模型类之间的差异显著。为了减轻这一缺点,我们提出了一个新的支持-查询原型融合网络(SQPFNet)。 SQPFNet首先为支持图像的前景区域生成几个支持原型,从而产生粗分割掩码。接着,基于粗分割掩码构建查询原型,并利用查询集中的模式信息。因此,SQPFNet构建了高质量的支持-查询融合原型,在查询图像上进行分割,以获得最终精化的查询掩码。在两个公开数据集SABS和CMR上的评估结果表明,SQPFNet实现了最先进的性能。
https://arxiv.org/abs/2405.07516
Our research focuses on the critical field of early diagnosis of disease by examining retinal blood vessels in fundus images. While automatic segmentation of retinal blood vessels holds promise for early detection, accurate analysis remains challenging due to the limitations of existing methods, which often lack discrimination power and are susceptible to influences from pathological regions. Our research in fundus image analysis advances deep learning-based classification using eight pre-trained CNN models. To enhance interpretability, we utilize Explainable AI techniques such as Grad-CAM, Grad-CAM++, Score-CAM, Faster Score-CAM, and Layer CAM. These techniques illuminate the decision-making processes of the models, fostering transparency and trust in their predictions. Expanding our exploration, we investigate ten models, including TransUNet with ResNet backbones, Attention U-Net with DenseNet and ResNet backbones, and Swin-UNET. Incorporating diverse architectures such as ResNet50V2, ResNet101V2, ResNet152V2, and DenseNet121 among others, this comprehensive study deepens our insights into attention mechanisms for enhanced fundus image analysis. Among the evaluated models for fundus image classification, ResNet101 emerged with the highest accuracy, achieving an impressive 94.17%. On the other end of the spectrum, EfficientNetB0 exhibited the lowest accuracy among the models, achieving a score of 88.33%. Furthermore, in the domain of fundus image segmentation, Swin-Unet demonstrated a Mean Pixel Accuracy of 86.19%, showcasing its effectiveness in accurately delineating regions of interest within fundus images. Conversely, Attention U-Net with DenseNet201 backbone exhibited the lowest Mean Pixel Accuracy among the evaluated models, achieving a score of 75.87%.
我们的研究重点关注在对 fundus 图像中检查视网膜血管的关键领域。虽然自动分割视网膜血管具有早期诊断疾病的有希望,但现有方法的准确分析仍然具有挑战性,因为它们往往缺乏区分能力和易受病理区域的影响。我们的 fundus 图像分析研究采用基于深度学习的分类方法,利用八个预训练的 CNN 模型。为了增强可解释性,我们利用了 Grad-CAM、Grad-CAM++、Score-CAM、Faster Score-CAM 和 Layer CAM 等可解释 AI 技术。这些技术揭示了模型的决策过程,加强了对其预测的透明度和信任。 在拓展我们的研究方面,我们调查了包括 TransUNet 与 ResNet 骨干网络、Attention U-Net 与 DenseNet 和 ResNet 骨干网络以及 Swin-UNET 在内的十种模型。通过包括 ResNet50V2、ResNet101V2、ResNet152V2 和 DenseNet121 等多样化架构,全面研究深入探讨了 attention 机制在 fundus 图像分析中的应用。 在基金图像分类评估模型中,ResNet101 脱颖而出,其准确率达到了令人印象深刻的 94.17%。在另一端,EfficientNetB0 的准确率最低,为 88.33%。此外,在基金图像分割领域,Swin-Unet 显示了平均像素准确率 86.19%,表明其在对感兴趣区域准确描绘方面非常有效。相反,在评估模型中,Attention U-Net 使用 DenseNet201 骨干网络的准确率最低,为 75.87%。
https://arxiv.org/abs/2405.07338
With the rapid advancement of technologies such as virtual reality, augmented reality, and gesture control, users expect interactions with computer interfaces to be more natural and intuitive. Existing visual algorithms often struggle to accomplish advanced human-computer interaction tasks, necessitating accurate and reliable absolute spatial prediction methods. Moreover, dealing with complex scenes and occlusions in monocular images poses entirely new challenges. This study proposes a network model that performs parallel processing of root-relative grids and root recovery tasks. The model enables the recovery of 3D hand meshes in camera space from monocular RGB images. To facilitate end-to-end training, we utilize an implicit learning approach for 2D heatmaps, enhancing the compatibility of 2D cues across different subtasks. Incorporate the Inception concept into spectral graph convolutional network to explore relative mesh of root, and integrate it with the locally detailed and globally attentive method designed for root recovery exploration. This approach improves the model's predictive performance in complex environments and self-occluded scenes. Through evaluation on the large-scale hand dataset FreiHAND, we have demonstrated that our proposed model is comparable with state-of-the-art models. This study contributes to the advancement of techniques for accurate and reliable absolute spatial prediction in various human-computer interaction applications.
随着诸如虚拟现实、增强现实和手势控制的快速进步,用户期望与计算机界面进行更自然、直观的交互。现有的视觉算法通常很难完成高级的人机交互任务,因此需要准确可靠的绝对空间预测方法。此外,处理单目图像中的复杂场景和遮挡也带来了全新的挑战。本研究提出了一个并行处理根相对网格和根恢复任务的网络模型。该模型能够从单目RGB图像中恢复相机空间中的三维手纹理。为了促进端到端训练,我们利用隐式学习方法对2D热图进行处理,增强不同子任务中2D提示的兼容性。将Inception概念引入到光谱图卷积网络中,以探索根的相对纹理,并将其与局部详细和全局关注方法相结合,用于根恢复探索。这种方法在复杂环境和自闭场景中提高了模型的预测性能。通过在大型手数据集FreiHAND上进行评估,我们证明了与最先进的模型相比,我们所提出的模型具有可比性。本研究为各种人机交互应用中准确可靠绝对空间预测技术的发展做出了贡献。
https://arxiv.org/abs/2405.07167
Precision agriculture involves the application of advanced technologies to improve agricultural productivity, efficiency, and profitability while minimizing waste and environmental impact. Deep learning approaches enable automated decision-making for many visual tasks. However, in the agricultural domain, variability in growth stages and environmental conditions, such as weather and lighting, presents significant challenges to developing deep learning-based techniques that generalize across different conditions. The resource-intensive nature of creating extensive annotated datasets that capture these variabilities further hinders the widespread adoption of these approaches. To tackle these issues, we introduce a semi-self-supervised domain adaptation technique based on deep convolutional neural networks with a probabilistic diffusion process, requiring minimal manual data annotation. Using only three manually annotated images and a selection of video clips from wheat fields, we generated a large-scale computationally annotated dataset of image-mask pairs and a large dataset of unannotated images extracted from video frames. We developed a two-branch convolutional encoder-decoder model architecture that uses both synthesized image-mask pairs and unannotated images, enabling effective adaptation to real images. The proposed model achieved a Dice score of 80.7\% on an internal test dataset and a Dice score of 64.8\% on an external test set, composed of images from five countries and spanning 18 domains, indicating its potential to develop generalizable solutions that could encourage the wider adoption of advanced technologies in agriculture.
精确农业涉及将先进技术应用于提高农业的生产力、效率和盈利能力,同时最小化浪费和对环境的影响。深度学习方法可以自动决策许多视觉任务。然而,在农业领域,生长阶段和环境条件(如天气和光照)的变异性给基于深度学习的技术在各种条件下推广带来了巨大的挑战。创建广泛注释的数据集来捕捉这些变异性进一步阻碍了这些方法的应用。为了应对这些问题,我们引入了一种基于深度卷积神经网络的半自监督领域迁移技术,采用概率扩散过程,不需要手动数据注释。仅使用三个手动标注的图像和来自小麦田的视频剪辑,我们生成了一个大规模计算注释的图像-掩码对和一个大型无注释图像集。我们开发了一个两分支卷积编码器-解码器模型架构,使用合成图像-掩码对和无注释图像,实现了对真实图像的有效适应。与内部测试数据集相比,所提出的模型获得了80.7%的Dice分数,与外部测试集中的数据相比,获得了64.8%的Dice分数,这些分数由来自五个国家的五个领域的图像组成,涵盖了18个领域。这表明该模型具有推动农业领域采用先进技术的能力。
https://arxiv.org/abs/2405.07157
Binary convolutional neural networks (BCNNs) provide a potential solution to reduce the memory requirements and computational costs associated with deep neural networks (DNNs). However, achieving a trade-off between performance and computational resources remains a significant challenge. Furthermore, the fully connected layer of BCNNs has evolved into a significant computational bottleneck. This is mainly due to the conventional practice of excluding the input layer and fully connected layer from binarization to prevent a substantial loss in accuracy. In this paper, we propose a hybrid model named ReActXGB, where we replace the fully convolutional layer of ReActNet-A with XGBoost. This modification targets to narrow the performance gap between BCNNs and real-valued networks while maintaining lower computational costs. Experimental results on the FashionMNIST benchmark demonstrate that ReActXGB outperforms ReActNet-A by 1.47% in top-1 accuracy, along with a reduction of 7.14% in floating-point operations (FLOPs) and 1.02% in model size.
二值卷积神经网络(BCNNs)为解决深度神经网络(DNNs)相关内存需求和计算成本提供了一个潜在的解决方案。然而,在性能和计算资源之间实现平衡仍然是一个重要的挑战。此外,BCNN的全连接层已经演变成一个显著的计算瓶颈。这主要是由于将输入层和全连接层排除在二值化之外以防止在准确度方面出现重大损失的常规做法。在本文中,我们提出了一个名为ReActXGB的混合模型,其中我们将ReActNet-A的全卷积层替换为XGBoost。这个修改的目标是缩小BCNN与真实值网络之间的性能差距,同时保持较低的计算成本。对FashionMNIST基准的实验结果表明,ReActXGB在top-1准确度方面比ReActNet-A提高了1.47%,同时降低了7.14%的浮点运算(FLOPs)和1.02%的模型大小。
https://arxiv.org/abs/2405.08020