CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911
Deep learning methods, especially Convolutional Neural Networks (CNN) and Vision Transformer (ViT), are frequently employed to perform semantic segmentation of high-resolution remotely sensed images. However, CNNs are constrained by their restricted receptive fields, while ViTs face challenges due to their quadratic complexity. Recently, the Mamba model, featuring linear complexity and a global receptive field, has gained extensive attention for vision tasks. In such tasks, images need to be serialized to form sequences compatible with the Mamba model. Numerous research efforts have explored scanning strategies to serialize images, aiming to enhance the Mamba model's understanding of images. However, the effectiveness of these scanning strategies remains uncertain. In this research, we conduct a comprehensive experimental investigation on the impact of mainstream scanning directions and their combinations on semantic segmentation of remotely sensed images. Through extensive experiments on the LoveDA, ISPRS Potsdam, and ISPRS Vaihingen datasets, we demonstrate that no single scanning strategy outperforms others, regardless of their complexity or the number of scanning directions involved. A simple, single scanning direction is deemed sufficient for semantic segmentation of high-resolution remotely sensed images. Relevant directions for future research are also recommended.
深度学习方法,特别是卷积神经网络(CNN)和视觉Transformer(ViT),通常用于对高分辨率遥感图像进行语义分割。然而,CNN受到其有限 receptive field 的限制,而ViT由于其二次复杂度而面临挑战。最近,Mamba模型,具有线性复杂度和全局 receptive field,在视觉任务上受到了广泛关注。在 such tasks 中,图像需要以与Mamba模型兼容的序列形式进行 serialization。许多研究努力探讨扫描策略来序列化图像,旨在增强Mamba模型对图像的理解。然而,这些扫描策略的有效性仍然不确定。在这项研究中,我们对主流扫描方向及其组合对遥感图像语义分割的影响进行全面实验调查。通过在LoveDA、ISPRS Potsdam和ISPRS Vaihingen数据集上的广泛实验,我们证明无论复杂程度如何,没有一种扫描策略能够优于其他扫描策略。我们还推荐了未来研究的方向。
https://arxiv.org/abs/2405.08493
Robust road surface estimation is required for autonomous ground vehicles to navigate safely. Despite it becoming one of the main targets for autonomous mobility researchers in recent years, it is still an open problem in which cameras and LiDAR sensors have demonstrated to be adequate to predict the position, size and shape of the road a vehicle is driving on in different environments. In this work, a novel Convolutional Neural Network model is proposed for the accurate estimation of the roadway surface. Furthermore, an ablation study has been conducted to investigate how different encoding strategies affect model performance, testing 6 slightly different neural network architectures. Our model is based on the use of a Twin Encoder-Decoder Neural Network (TEDNet) for independent camera and LiDAR feature extraction, and has been trained and evaluated on the Kitti-Road dataset. Bird's Eye View projections of the camera and LiDAR data are used in this model to perform semantic segmentation on whether each pixel belongs to the road surface. The proposed method performs among other state-of-the-art methods and operates at the same frame-rate as the LiDAR and cameras, so it is adequate for its use in real-time applications.
为了使自动驾驶车辆安全导航,需要对道路表面进行稳健的估计。尽管近年来,自动驾驶移动研究人员将这一目标作为主要目标,但仍然是一个尚未解决的问题,其中相机和LiDAR传感器已经被证明在预测车辆在各种环境中行驶的道路位置、大小和形状方面是足够的。在本文中,我们提出了一个用于准确估计道路表面的全新卷积神经网络模型。此外,我们还进行了一项消融研究,以研究不同编码策略对模型性能的影响,测试了6种稍微不同的神经网络架构。我们的模型基于使用Twin Encoder-Decoder Neural Network(TEDNet)进行独立相机和LiDAR特征提取,并在Kitti-Road数据集上进行训练和评估。在这个模型中,使用了摄像机和LiDAR数据的鸟瞰投影来进行语义分割,以确定每个像素是否属于道路表面。与最先进的算法相比,我们的方法在性能上处于领先地位,并且与LiDAR和相机在相同的帧率下运行,因此它非常适合在实时应用中使用。
https://arxiv.org/abs/2405.08429
In this paper, we address two critical challenges in the domain of flood detection: the computational expense of large-scale time series change detection and the lack of interpretable decision-making processes on explainable AI (XAI). To overcome these challenges, we proposed an interpretable multi-stage approach to flood detection, IMAFD has been proposed. It provides an automatic, efficient and interpretable solution suitable for large-scale remote sensing tasks and offers insight into the decision-making process. The proposed IMAFD approach combines the analysis of the dynamic time series image sequences to identify images with possible flooding with the static, within-image semantic segmentation. It combines anomaly detection (at both image and pixel level) with semantic segmentation. The flood detection problem is addressed through four stages: (1) at a sequence level: identifying the suspected images (2) at a multi-image level: detecting change within suspected images (3) at an image level: semantic segmentation of images into Land, Water or Cloud class (4) decision making. Our contributions are two folder. First, we efficiently reduced the number of frames to be processed for dense change detection by providing a multi-stage holistic approach to flood detection. Second, the proposed semantic change detection method (stage 3) provides human users with an interpretable decision-making process, while most of the explainable AI (XAI) methods provide post hoc explanations. The evaluation of the proposed IMAFD framework was performed on three datasets, WorldFloods, RavAEn and MediaEval. For all the above datasets, the proposed framework demonstrates a competitive performance compared to other methods offering also interpretability and insight.
在本文中,我们解决了洪水检测领域中的两个关键挑战:大规模时间序列变化检测的计算开销和可解释人工智能(XAI)中决策过程缺乏可解释性。为了克服这些挑战,我们提出了一个可解释的多阶段洪水检测方法,即IMAFD。该方法为大规模遥感任务提供了一个自动、高效和可解释的解决方案,并揭示了决策过程。所提出的IMAFD方法将动态时间序列图像序列的分析结果与静态、图像内语义分割相结合。它将异常检测(在图像和像素级别)与语义分割相结合。通过四个阶段解决了洪水检测问题:(1)在序列级别:识别涉嫌图像(2)在多图像级别:检测涉嫌图像之间的变化(3)在图像级别:将图像分割为土地、水或云类(4)做决策。我们的贡献是两个文件夹。首先,我们通过提供多阶段全局洪水检测方法,有效减少了需要处理的画面数量,从而降低了密集变化检测的计算开销。其次,所提出的语义变化检测方法(阶段3)为用户提供了一个可解释的决策过程,而大多数可解释人工智能(XAI)方法则提供了事后解释。对所提出的IMAFD框架的三种数据集(WorldFloods、RavAEn和MediaEval)的评估表明,与其他提供可解释性和洞察力的方法相比,该框架具有竞争力的性能。
https://arxiv.org/abs/2405.07916
An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes. However, establishing such an ideal framework that is both task-generic and label-efficient poses a challenge in unifying the representation of the same primitive across diverse scenes. The current contrastive 3D pre-training methods typically follow a frame-level consistency, which focuses on the 2D-3D relationships in each detached image. Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict, i.e., the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges, we propose a CSC framework that puts a scene-level semantic consistency in the heart, bridging the connection of the similar semantic segments across various scenes. To achieve this goal, we combine the coherent semantic cues provided by the vision foundation model and the knowledge-rich cross-scene prototypes derived from the complementary multi-modality information. These allow us to train a universal 3D pre-training model that facilitates various downstream tasks with less fine-tuning efforts. Empirically, we achieve consistent improvements over SOTA pre-training approaches in semantic segmentation (+1.4% mIoU), object detection (+1.0% mAP), and panoptic segmentation (+3.0% PQ) using their task-specific 3D network on nuScenes. Code is released at this https URL, hoping to inspire future research.
一种在感知大型动态场景时具有普遍3D表示的有效预训练框架是非常想要的。然而,建立一个既任务通用又标签高效的理想框架,以统一同一原始语义在不同场景中的表示,仍然具有挑战性。当前的对比性3D预训练方法通常遵循帧级的一致性,重点关注每个脱离图像的2D-3D关系。这种不考虑全局一致性的一致性极大地阻碍了达到普遍预训练框架的有前途的路径:(1)跨场景语义自冲突,即来自不同场景的同一语义段之间强烈的碰撞;(2)缺乏一个全局统一的键,将跨场景语义一致性推入3D表示学习。为解决上述挑战,我们提出了一个CSC框架,将场景级别的语义一致性放在核心位置,桥接不同场景中类似语义段之间的连接。为实现这一目标,我们结合了视觉基础模型提供的 coherent 语义线索和互补多模态信息得到的知识丰富的跨场景原型。这允许我们训练一个通用的3D预训练模型,从而无需太多微调努力完成各种下游任务。通过在 nuScenes 上使用他们的任务特定3D网络进行实验,我们在语义分割(+1.4%mIoU)、目标检测(+1.0%mAP)和透视分割(+3.0%PQ)方面实现了与当前最佳预训练方法相当甚至更好的性能。代码发布在https://这个URL上,希望激发未来的研究兴趣。
https://arxiv.org/abs/2405.07201
Vision graph neural networks (ViG) offer a new avenue for exploration in computer vision. A major bottleneck in ViGs is the inefficient k-nearest neighbor (KNN) operation used for graph construction. To solve this issue, we propose a new method for designing ViGs, Dynamic Axial Graph Construction (DAGC), which is more efficient than KNN as it limits the number of considered graph connections made within an image. Additionally, we propose a novel CNN-GNN architecture, GreedyViG, which uses DAGC. Extensive experiments show that GreedyViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification, object detection, instance segmentation, and semantic segmentation tasks. Our smallest model, GreedyViG-S, achieves 81.1% top-1 accuracy on ImageNet-1K, 2.9% higher than Vision GNN and 2.2% higher than Vision HyperGraph Neural Network (ViHGNN), with less GMACs and a similar number of parameters. Our largest model, GreedyViG-B obtains 83.9% top-1 accuracy, 0.2% higher than Vision GNN, with a 66.6% decrease in parameters and a 69% decrease in GMACs. GreedyViG-B also obtains the same accuracy as ViHGNN with a 67.3% decrease in parameters and a 71.3% decrease in GMACs. Our work shows that hybrid CNN-GNN architectures not only provide a new avenue for designing efficient models, but that they can also exceed the performance of current state-of-the-art models.
视觉图神经网络(ViG)为计算机视觉领域提供了一个新的研究方向。ViG的一个主要瓶颈是用于构建图的k-最近邻(KNN)操作。为解决这个问题,我们提出了一个新的方法,动态轴向图构建(DAGC),它比KNN更有效,因为它限制了在图像内考虑的图连接数量。此外,我们还提出了一个新的CNN-GNN架构, GreedyViG,它使用了DAGC。大量的实验证明,GreedyViG在准确率、GANs和参数方面都超过了现有的ViG、CNN和ViT架构。我们的最小模型,GreedyViG-S,在ImageNet-1K上的top-1准确率为81.1%,比ViG和ViHGNN分别高2.9%和2.2%,且参数更少,GANs更少,参数更少。我们的最大模型,GreedyViG-B,在ImageNet上的top-1准确率为83.9%,比ViG高0.2%,参数减少了66.6%,GANs减少了69%。GreedyViG-B在ViHGNN上的top-1准确率也为67.3%,参数减少了71.3%,GANs减少了70%。我们的工作表明,混合CNN-GNN架构不仅为设计高效的模型提供了新的途径,而且还可以超越当前最先进的模型。
https://arxiv.org/abs/2405.06849
Federated learning (FL) offers a privacy-centric distributed learning framework, enabling model training on individual clients and central aggregation without necessitating data exchange. Nonetheless, FL implementations often suffer from non-i.i.d. and long-tailed class distributions across mobile applications, e.g., autonomous vehicles, which leads models to overfitting as local training may converge to sub-optimal. In our study, we explore the impact of data heterogeneity on model bias and introduce an innovative personalized FL framework, Multi-level Personalized Federated Learning (MuPFL), which leverages the hierarchical architecture of FL to fully harness computational resources at various levels. This framework integrates three pivotal modules: Biased Activation Value Dropout (BAVD) to mitigate overfitting and accelerate training; Adaptive Cluster-based Model Update (ACMU) to refine local models ensuring coherent global aggregation; and Prior Knowledge-assisted Classifier Fine-tuning (PKCF) to bolster classification and personalize models in accord with skewed local data with shared knowledge. Extensive experiments on diverse real-world datasets for image classification and semantic segmentation validate that MuPFL consistently outperforms state-of-the-art baselines, even under extreme non-i.i.d. and long-tail conditions, which enhances accuracy by as much as 7.39% and accelerates training by up to 80% at most, marking significant advancements in both efficiency and effectiveness.
翻译:联邦学习(FL)提供了一个以隐私为中心的分布式学习框架,可以实现在不进行数据交换的情况下对单个客户端进行模型训练,并实现集中聚合。然而,FL 的实现通常在移动应用程序中存在非均匀且长尾的类分布,例如自动驾驶车辆,这导致模型在局部训练可能收敛到次优的情况下过拟合。在我们的研究中,我们探讨了数据异质性对模型偏差的影响,并引入了一种创新的多级个性化 FL 框架,称为 Multi-level Personalized Federated Learning(MuPFL),该框架利用了 FL 的分层架构,充分利用各种层面的计算资源。该框架包括三个关键模块:有偏激活值下落(BAVD)以减轻过拟合并加速训练;自适应聚类为基础的模型更新(ACMU)以确保局部模型的全局一致性;以及基于共享知识的类器微调(PKCF),以加强分类并个性化与分不平衡局部数据相关的模型。在不同的现实世界数据集(如图像分类和语义分割)上进行广泛的实验证明,MuPFL 始终优于最先进的基准模型,即使在极端不均匀和长尾条件下,MuPFL 的准确率也提高了 7.39%,训练速度也提高了 80%。这表明,在效率和效果方面,FL 都取得了显著的进步。
https://arxiv.org/abs/2405.06413
Lots of popular calibration methods in medical images focus on classification, but there are few comparable studies on semantic segmentation. In polyp segmentation of medical images, we find most diseased area occupies only a small portion of the entire image, resulting in previous models being not well-calibrated for lesion regions but well-calibrated for background, despite their seemingly better Expected Calibration Error (ECE) scores overall. Therefore, we proposed four-branches calibration network with Mask-Loss and Mask-TS strategies to more focus on the scaling of logits within potential lesion regions, which serves to mitigate the influence of background interference. In the experiments, we compare the existing calibration methods with the proposed Mask Temperature Scaling (Mask-TS). The results indicate that the proposed calibration network outperforms other methods both qualitatively and quantitatively.
在医学图像的分类方法中,大多数关注的是分类,但关于语义分割的类似研究较少。在医学图像的聚类分割中,我们发现大多数病灶区域只占据了整个图像的一小部分,因此以前模型的标定对病灶区域的表现并不好,但它们对背景的表现却很好。尽管如此,它们的总期望标定误差(ECE)分数似乎更好。因此,我们提出了一个四分支的标定网络,包括掩码损失和掩码TS策略,以更关注潜在病灶区域内logits的缩放,从而减轻背景干扰的影响。在实验中,我们将现有的标定方法与提出的掩码温度缩放(Mask-TS)进行比较。结果表明,与其它方法相比,提出的标定网络在质量和数量上都有所优越。
https://arxiv.org/abs/2405.05830
Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue, there exist data representational differences that require additional effort to resolve. In this work, for the first time, we synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS in an open-world, annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation, we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably, we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.
基于事件的语义分割(ESS)是事件相机感知的根本挑战任务。解释和标注事件数据存在困难,这限制了其可扩展性。尽管从图像到事件数据的域迁移有助于减轻此问题,但存在数据表示差异,需要额外努力来解决。在这项工作中,我们首次将来自图像、文本和事件数据领域的信息进行协同,并引入了OpenESS,以在开放世界、注释 efficient 的环境中实现可扩展的 ESS。我们通过将语义丰富的 CLIP 知识从图像-文本对中传递到事件流中来实现这一目标。为了追求更好的跨模态适应,我们提出了帧到事件的对比蒸馏和文本到事件的语义一致性正则化。在流行的 ESS 基准测试上进行的实验结果表明,我们的方法优于现有方法。值得注意的是,在没有使用事件或帧标签的情况下,我们实现了 DDD17 和 DSEC-Semantic 基准测试中的 mIoU 值分别为 53.93% 和 43.31%。
https://arxiv.org/abs/2405.05259
Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
高效的数据利用对于在自动驾驶中提高3D场景理解至关重要,因为过分依赖人工标注的激光雷达点云使得完全监督方法受到挑战。为了解决这个问题,我们的研究扩展了激光雷达语义分割的半监督学习,利用驾驶场景的固有空间先验和多传感器补充,以提高未标注数据集的有效性。我们引入了LaserMix++,一个进化框架,整合了不同来源的激光雷达扫描的激光束操作,并进一步通过激光雷达相机对应关系辅助数据有效的学习。我们的框架旨在通过包括多模态、包括1)多模态激光Mix操作,实现对细粒度跨传感器交互的优化;2)相机到激光雷达特征蒸馏,增强激光雷达特征学习;3)使用开维词表模型的语言驱动知识指导生成辅助监督。LaserMix++的多功能使其能够应用于各种激光雷达表示形式,使其成为一种通用的解决方案。通过理论分析和广泛应用于流行驾驶感知数据集的实验,验证了我们的框架。结果表明,LaserMix++在完全监督替代方案方面取得了显著的优越性,其准确率与五个标注少的关系相当,同时显著提高了仅监督基础。这一重大的进步凸显了半监督方法在减少基于激光雷达的3D场景理解系统中过度依赖全面标注数据中的潜力。
https://arxiv.org/abs/2405.05258
Weakly supervised semantic segmentation (WSSS) aims at learning a semantic segmentation model with only image-level tags. Despite intensive research on deep learning approaches over a decade, there is still a significant performance gap between WSSS and full semantic segmentation. Most current WSSS methods always focus on a limited single image (pixel-wise) information while ignoring the valuable inter-image (semantic-wise) information. From this perspective, a novel end-to-end WSSS framework called DSCNet is developed along with two innovations: i) pixel-wise group contrast and semantic-wise graph contrast are proposed and introduced into the WSSS framework; ii) a novel dual-stream contrastive learning (DSCL) mechanism is designed to jointly handle pixel-wise and semantic-wise context information for better WSSS performance. Specifically, the pixel-wise group contrast learning (PGCL) and semantic-wise graph contrast learning (SGCL) tasks form a more comprehensive solution. Extensive experiments on PASCAL VOC and MS COCO benchmarks verify the superiority of DSCNet over SOTA approaches and baseline models.
弱监督语义分割(WSSS)旨在学习仅基于图像级别的标签的语义分割模型。尽管在过去的十年里对深度学习方法进行了广泛研究,但WSSS和完整语义分割之间的性能差距仍然很大。大多数当前的WSSS方法始终关注有限单个图像(像素级)信息,而忽略了宝贵的跨图像(语义级)信息。从这方面来看,与两个创新相结合,我们提出了一个名为DSCNet的新端到端WSSS框架:i)提出了像素级组内对比和语义级图内对比;ii)设计了一种新颖的双流对比学习(DSCL)机制,以更好地处理像素级和语义级上下文信息,从而提高WSSS性能。具体来说,像素级组内对比学习(PGCL)和语义级图内对比学习(SGCL)任务组成更全面的解决方案。在PASCAL VOC和MS COCO基准上进行的实验证实了DSCNet相对于当前最先进的方法的优越性。
https://arxiv.org/abs/2405.04913
Satellite imagery has played an increasingly important role in post-disaster building damage assessment. Unfortunately, current methods still rely on manual visual interpretation, which is often time-consuming and can cause very low accuracy. To address the limitations of manual interpretation, there has been a significant increase in efforts to automate the process. We present a solution that performs the two most important tasks in building damage assessment, segmentation and classification, through deep-learning models. We show our results submitted as part of the xView2 Challenge, a competition to design better models for identifying buildings and their damage level after exposure to multiple kinds of natural disasters. Our best model couples a building identification semantic segmentation convolutional neural network (CNN) to a building damage classification CNN, with a combined F1 score of 0.66, surpassing the xView2 challenge baseline F1 score of 0.28. We find that though our model was able to identify buildings with relatively high accuracy, building damage classification across various disaster types is a difficult task due to the visual similarity between different damage levels and different damage distribution between disaster types, highlighting the fact that it may be important to have a probabilistic prior estimate regarding disaster damage in order to obtain accurate predictions.
卫星影像在灾后建筑损害评估中扮演着越来越重要的角色。然而,目前的评估方法仍然依赖于人工视觉解释,这通常需要花费大量时间,并可能导致非常低的精度。为解决手动解释的局限性,已经加大了自动化过程的努力。我们提出了一个解决方案,通过深度学习模型执行建筑损害评估中的两个最重要的任务:分割和分类。我们在xView2挑战中展示了我们的结果,该挑战旨在为识别在多种自然灾害中受损的建筑和其损害程度提供更好的模型。我们最好的模型将具有建筑识别的语义分割卷积神经网络(CNN)与建筑损害分类卷积神经网络相结合,F1分数为0.66,超过了xView2挑战基线F1分数0.28。我们发现,尽管我们的模型能够以相对较高的准确度识别建筑物,但不同灾害类型之间建筑损害的分类仍然具有困难,因为不同灾害类型的损害水平和损害分布之间存在视觉相似性,这表明在获得准确预测之前,关于灾害损害的概率先验估计可能是重要的。
https://arxiv.org/abs/2405.04800
Mapping agencies are increasingly adopting Aerial Lidar Scanning (ALS) as a new tool to monitor territory and support public policies. Processing ALS data at scale requires efficient point classification methods that perform well over highly diverse territories. To evaluate them, researchers need large annotated Lidar datasets, however, current Lidar benchmark datasets have restricted scope and often cover a single urban area. To bridge this data gap, we present the FRench ALS Clouds from TArgeted Landscapes (FRACTAL) dataset: an ultra-large-scale aerial Lidar dataset made of 100,000 dense point clouds with high-quality labels for 7 semantic classes and spanning 250 km$^2$. FRACTAL is built upon France's nationwide open Lidar data. It achieves spatial and semantic diversity via a sampling scheme that explicitly concentrates rare classes and challenging landscapes from five French regions. It should support the development of 3D deep learning approaches for large-scale land monitoring. We describe the nature of the source data, the sampling workflow, the content of the resulting dataset, and provide an initial evaluation of segmentation performance using a performant 3D neural architecture.
地图机构 increasingly采用航空激光扫描(ALS)作为监测领土和支持公共政策的新工具。在处理大规模 ALS 数据时,需要高效的分点分类方法,这些方法在高度多样化的领土上表现良好。为了评估它们,研究人员需要大型注释的激光数据集,然而,当前的激光基准数据集具有有限的覆盖范围,通常只覆盖单个城市地区。为了填补这个数据差距,我们提出了由Targeted Landscapes (FRACTAL)团队制作的FRench ALS Clouds数据集:一个由100,000个密集点云组成,为7个语义类别的超大规模航空激光数据集,面积超过250平方公里。FRACTAL是基于法国全国开放式激光数据构建的。它通过一种明确的抽样方案,专门集中来自五个法国地区的罕见类群和具有挑战性的地貌,实现了空间和语义多样性。它应该支持大规模土地监测的3D深度学习方法的开发。我们描述了原始数据的性质、抽样工作流程、生成数据的内容,并使用性能出色的3D神经架构对分割性能进行了初步评估。
https://arxiv.org/abs/2405.04634
Aphid infestations are one of the primary causes of extensive damage to wheat and sorghum fields and are one of the most common vectors for plant viruses, resulting in significant agricultural yield losses. To address this problem, farmers often employ the inefficient use of harmful chemical pesticides that have negative health and environmental impacts. As a result, a large amount of pesticide is wasted on areas without significant pest infestation. This brings to attention the urgent need for an intelligent autonomous system that can locate and spray sufficiently large infestations selectively within the complex crop canopies. We have developed a large multi-scale dataset for aphid cluster detection and segmentation, collected from actual sorghum fields and meticulously annotated to include clusters of aphids. Our dataset comprises a total of 54,742 image patches, showcasing a variety of viewpoints, diverse lighting conditions, and multiple scales, highlighting its effectiveness for real-world applications. In this study, we trained and evaluated four real-time semantic segmentation models and three object detection models specifically for aphid cluster segmentation and detection. Considering the balance between accuracy and efficiency, Fast-SCNN delivered the most effective segmentation results, achieving 80.46% mean precision, 81.21% mean recall, and 91.66 frames per second (FPS). For object detection, RT-DETR exhibited the best overall performance with a 61.63% mean average precision (mAP), 92.6% mean recall, and 72.55 on an NVIDIA V100 GPU. Our experiments further indicate that aphid cluster segmentation is more suitable for assessing aphid infestations than using detection models.
蚜虫灾害是导致小麦和玉米田遭受严重破坏的主要原因之一,也是植物病毒最普遍的传播媒介,导致大量农业产量损失。为解决这个问题,农民通常会采用对有害化学农药的低效利用,这对健康和环境都有负面影响。因此,在无重大病虫害的地区,大量的农药被浪费在无明显病虫害的区域上。这使得人们更加关注迫切需要一种智能自主系统,可以在复杂的作物叶片中准确、选择性地定位和喷洒足够大的蚜虫群。 我们开发了一个大型的多尺度蚜虫聚类检测和分割数据集,从实际的玉米田中收集,并精心注释以包括蚜虫聚类。我们的数据集包括54,742个图像补丁,展示了各种视角、不同的光照条件和多个尺度,突出了其在现实应用中的有效性。 在本研究中,我们训练和评估了四种实时语义分割模型和三种专门用于蚜虫聚类检测和检测的对象检测模型。在准确性和效率之间取得平衡后,Fast-SCNN取得了最有效的分割结果,达到80.46%的均方精度(mAP)、81.21%的均召回率和91.66帧每秒(FPS)。在物体检测方面,RT-DETR在平均精度(mAP)、总体召回率和NVIDIA V100 GPU上的得分均最高。我们的实验进一步表明,蚜虫聚类分割更适合评估蚜虫灾害,而不是使用检测模型。
https://arxiv.org/abs/2405.04305
Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.
跨模态知识转移增强在激光雷达语义分割中改善点云表示学习。尽管具有潜在优势,但\textit{弱教师挑战}源于重复且缺乏多样性的车载相机图像以及稀疏且不准确的地面真实标签。为了应对这个问题,我们提出了高效的图像到激光雷达知识转移(ELiTe)范式。ELiTe引入了来自视觉基础模型(VFM)的补丁到点的多级知识蒸馏,在多样开放世界图像上进行了广泛训练。这使得能够在模态之间有效地进行知识传递。ELiTe采用参数高效的微调来加强VFM教师,并加速大规模模型训练,同时最小化成本。此外,我们还引入了基于伪标签生成的分割 anything模型,以增强低质量图像标签,促进稳健的语义表示。ELiTe通过显著更少的参数在SemanticKITTI基准上取得了最先进的性能,超过了实时推理模型。我们的方法通过显著更少的参数证实了其有效性和效率。
https://arxiv.org/abs/2405.04121
Segment Anything Model (SAM) has achieved impressive performance in many computer vision tasks. However, as a large-scale model, the immense memory and computation costs hinder its practical deployment. In this paper, we propose a post-training quantization (PTQ) framework for Segment Anything Model, namely PTQ4SAM. First, we investigate the inherent bottleneck of SAM quantization attributed to the bimodal distribution in post-Key-Linear activations. We analyze its characteristics from both per-tensor and per-channel perspectives, and propose a Bimodal Integration strategy, which utilizes a mathematically equivalent sign operation to transform the bimodal distribution into a relatively easy-quantized normal distribution offline. Second, SAM encompasses diverse attention mechanisms (i.e., self-attention and two-way cross-attention), resulting in substantial variations in the post-Softmax distributions. Therefore, we introduce an Adaptive Granularity Quantization for Softmax through searching the optimal power-of-two base, which is hardware-friendly. Extensive experimental results across various vision tasks (instance segmentation, semantic segmentation and object detection), datasets and model variants show the superiority of PTQ4SAM. For example, when quantizing SAM-L to 6-bit, we achieve lossless accuracy for instance segmentation, about 0.5\% drop with theoretical 3.9$\times$ acceleration. The code is available at \url{this https URL}.
Segment Anything Model (SAM) 在许多计算机视觉任务中取得了令人印象深刻的性能。然而,作为一个大型模型,其巨大的内存和计算成本限制了其实际部署。在本文中,我们提出了一个后训练量化(PTQ)框架,称为PTQ4SAM。首先,我们研究了SAM量化中由于后线性激活的双极分布所导致的固有瓶颈。从每个张量级和通道级分析其特性,并提出了双极整合策略,利用等效的符号操作将双极分布转化为相对容易量化的正常分布。其次,SAM涵盖了多种关注机制(即自注意力和双边跨注意),导致后归一化分布具有很大的变化。因此,我们通过搜索最优的二进制基数来引入Adaptive Granularity Quantization for Softmax。在各种视觉任务(实例分割,语义分割和目标检测),数据集和模型变体上进行广泛的实验结果表明,PTQ4SAM具有优越性。例如,将SAM-L量化为6位时,我们实现了一个实例分割的无损失准确度,大约比理论3.9$\times$加速减少了0.5%。代码可在此处访问:https://this URL。
https://arxiv.org/abs/2405.03144
The volume of unlabelled Earth observation (EO) data is huge, but many important applications lack labelled training data. However, EO data offers the unique opportunity to pair data from different modalities and sensors automatically based on geographic location and time, at virtually no human labor cost. We seize this opportunity to create a diverse multi-modal pretraining dataset at global scale. Using this new corpus of 1.2 million locations, we propose a Multi-Pretext Masked Autoencoder (MP-MAE) approach to learn general-purpose representations for optical satellite images. Our approach builds on the ConvNeXt V2 architecture, a fully convolutional masked autoencoder (MAE). Drawing upon a suite of multi-modal pretext tasks, we demonstrate that our MP-MAE approach outperforms both MAEs pretrained on ImageNet and MAEs pretrained on domain-specific satellite images. This is shown on several downstream tasks including image classification and semantic segmentation. We find that multi-modal pretraining notably improves the linear probing performance, e.g. 4pp on BigEarthNet and 16pp on So2Sat, compared to pretraining on optical satellite images only. We show that this also leads to better label and parameter efficiency which are crucial aspects in global scale applications.
未标记的地球观测(EO)数据的体积巨大,但许多重要的应用缺乏标记的训练数据。然而,EO数据为根据地理位置和时间自动对不同模式和传感器数据进行对齐提供了独特的机会,几乎不需要任何人类劳动成本。我们抓住了这个机会,在全球范围内创建了一个多样化的多模态预训练数据集。利用这个包含1200万个位置的新数据集,我们提出了一个多预文本掩码自编码器(MP-MAE)方法,用于学习光学卫星图像的通用表示。我们的方法基于ConvNeXt V2架构,这是一个完全卷积掩码自编码器(MAE)。利用一系列多模态预文本任务,我们证明了我们的MP-MAE方法在ImageNet上预训练的MAEs和预训练在领域特定卫星图像上的MAEs之间都表现优异。这在几个下游任务中得到了证实,包括图像分类和语义分割。我们发现,多模态预训练显著提高了线性探测性能,例如在BigEarthNet上的4pp和So2Sat上的16pp,而仅预训练在光学卫星图像上。我们还证明了这还导致了更好的标签和参数效率,这是在全局应用中至关重要的一些方面。
https://arxiv.org/abs/2405.02771
Advancements in machine learning, computer vision, and robotics have paved the way for transformative solutions in various domains, particularly in agriculture. For example, accurate identification and segmentation of fruits from field images plays a crucial role in automating jobs such as harvesting, disease detection, and yield estimation. However, achieving robust and precise infield fruit segmentation remains a challenging task since large amounts of labeled data are required to handle variations in fruit size, shape, color, and occlusion. In this paper, we develop a few-shot semantic segmentation framework for infield fruits using transfer learning. Concretely, our work is aimed at addressing agricultural domains that lack publicly available labeled data. Motivated by similar success in urban scene parsing, we propose specialized pre-training using a public benchmark dataset for fruit transfer learning. By leveraging pre-trained neural networks, accurate semantic segmentation of fruit in the field is achieved with only a few labeled images. Furthermore, we show that models with pre-training learn to distinguish between fruit still on the trees and fruit that have fallen on the ground, and they can effectively transfer the knowledge to the target fruit dataset.
机器学习、计算机视觉和机器人技术的发展为各个领域带来了 transformative 解决方案,尤其是在农业领域。例如,准确从田间图像中识别和分割水果在自动化诸如采摘、疾病检测和产量估计等任务中扮演着关键角色。然而,实现稳健且精确的田间水果分割仍然具有挑战性,因为需要大量标记数据来处理水果的大小、形状、颜色和遮挡的变异。在本文中,我们为田间水果使用迁移学习开发了一个几 shot semantic segmentation 框架。具体来说,我们的工作旨在解决缺乏公开可用标记数据的农业领域。受到城市场景解析的成功启发,我们提出了使用公共基准数据集进行水果转移学习的专用预训练方案。通过利用预训练的神经网络,可以在仅几张标记图片的情况下实现水果在田间的准确语义分割。此外,我们还证明了经过预训练的模型能够区分仍然在树上的水果和已经掉在地上的水果,并且可以有效地将知识传递到目标水果数据集中。
https://arxiv.org/abs/2405.02556
Constructing high-definition (HD) maps is a crucial requirement for enabling autonomous driving. In recent years, several map segmentation algorithms have been developed to address this need, leveraging advancements in Bird's-Eye View (BEV) perception. However, existing models still encounter challenges in producing realistic and consistent semantic map layouts. One prominent issue is the limited utilization of structured priors inherent in map segmentation masks. In light of this, we propose DiffMap, a novel approach specifically designed to model the structured priors of map segmentation masks using latent diffusion model. By incorporating this technique, the performance of existing semantic segmentation methods can be significantly enhanced and certain structural errors present in the segmentation outputs can be effectively rectified. Notably, the proposed module can be seamlessly integrated into any map segmentation model, thereby augmenting its capability to accurately delineate semantic information. Furthermore, through extensive visualization analysis, our model demonstrates superior proficiency in generating results that more accurately reflect real-world map layouts, further validating its efficacy in improving the quality of the generated maps.
构建高清晰度(HD)地图是实现自动驾驶的关键要求。近年来,为了解决这个问题,已经开发了几种地图分割算法,利用了Bird's-Eye View(BEV)感知技术的发展。然而,现有的模型在生成逼真的且一致的语义图布局方面仍然遇到困难。一个突出的问题是,地图分割掩码中固有结构的优先利用程度有限。鉴于这一点,我们提出了DiffMap,一种专门利用潜在扩散模型来建模地图分割掩码结构优先级的全新方法。通过引入这项技术,现有语义分割方法的性能可以显著提高,而且分割输出中存在的某些结构错误可以通过有效矫正来消除。值得注意的是,所提出的模块可以轻松地集成到任何地图分割模型中,从而增强其准确描绘语义信息的能力。此外,通过广泛的可视化分析,我们的模型证明了其在生成更准确反映真实世界地图布局的结果方面具有卓越的性能,进一步验证了其在提高生成的地图质量方面的有效性。
https://arxiv.org/abs/2405.02008
Deep learning has made significant progress in computer vision, specifically in image classification, object detection, and semantic segmentation. The skip connection has played an essential role in the architecture of deep neural networks,enabling easier optimization through residual learning during the training stage and improving accuracy during testing. Many neural networks have inherited the idea of residual learning with skip connections for various tasks, and it has been the standard choice for designing neural networks. This survey provides a comprehensive summary and outlook on the development of skip connections in deep neural networks. The short history of skip connections is outlined, and the development of residual learning in deep neural networks is surveyed. The effectiveness of skip connections in the training and testing stages is summarized, and future directions for using skip connections in residual learning are discussed. Finally, we summarize seminal papers, source code, models, and datasets that utilize skip connections in computer vision, including image classification, object detection, semantic segmentation, and image reconstruction. We hope this survey could inspire peer researchers in the community to develop further skip connections in various forms and tasks and the theory of residual learning in deep neural networks. The project page can be found at this https URL
深度学习在计算机视觉领域取得了显著进展,尤其是在图像分类、目标检测和语义分割方面。跳转连接在深度神经网络的架构中发挥了关键作用,通过在训练阶段通过残差学习进行更简单的优化,并在测试阶段提高准确性。许多神经网络都继承了残差学习与跳转连接的想法,将其作为设计神经网络的标准选择。 本次调查对跳转连接在深度神经网络中的发展进行了全面的概括和展望。首先简要介绍了跳转连接的短史,然后调查了在深度神经网络中残差学习的开发。总结了跳转连接在训练和测试阶段的有效性,并讨论了在残差学习中将跳转连接用于未来研究的方向。最后,我们总结了在计算机视觉领域使用跳转连接的一些论文、源代码、模型和数据集。我们希望能激励社区中的同行研究者在各种形式和任务上进一步发展跳转连接,并探讨深度神经网络中残差学习的理论。项目页面可以通过这个链接找到:https://github.com/your_username/project_name
https://arxiv.org/abs/2405.01725