Portrait images typically consist of a salient person against diverse backgrounds. With the development of mobile devices and image processing techniques, users can conveniently capture portrait images anytime and anywhere. However, the quality of these portraits may suffer from the degradation caused by unfavorable environmental conditions, subpar photography techniques, and inferior capturing devices. In this paper, we introduce a dual-branch network for portrait image quality assessment (PIQA), which can effectively address how the salient person and the background of a portrait image influence its visual quality. Specifically, we utilize two backbone networks (\textit{i.e.,} Swin Transformer-B) to extract the quality-aware features from the entire portrait image and the facial image cropped from it. To enhance the quality-aware feature representation of the backbones, we pre-train them on the large-scale video quality assessment dataset LSVQ and the large-scale facial image quality assessment dataset GFIQA. Additionally, we leverage LIQE, an image scene classification and quality assessment model, to capture the quality-aware and scene-specific features as the auxiliary features. Finally, we concatenate these features and regress them into quality scores via a multi-perception layer (MLP). We employ the fidelity loss to train the model via a learning-to-rank manner to mitigate inconsistencies in quality scores in the portrait image quality assessment dataset PIQ. Experimental results demonstrate that the proposed model achieves superior performance in the PIQ dataset, validating its effectiveness. The code is available at \url{this https URL}.
肖像图像通常由一个突出的人物和多种不同的背景组成。随着移动设备的发展和图像处理技术的不断发展,用户可以随时随地方便地捕捉到肖像图像。然而,这些肖像可能会受到不良环境条件、拍摄技巧和低质量采集设备等因素引起的质量下降的影响。在本文中,我们提出了一个用于肖像图像质量评估(PIQA)的双分支网络,可以有效地解决突出的人物和肖像图像背景如何影响其视觉质量的问题。具体来说,我们使用两个骨干网络(即Swin Transformer-B)从整个肖像图像和从其中提取的面部图像中提取质量感知特征。为了提高骨干网络的质量感知特征表示,我们在LSVQ和GFIQA等大规模视频质量评估数据集上进行预训练。此外,我们还利用LIQE,一种图像场景分类和质量评估模型,作为辅助特征来捕捉质量感知和场景特定的特征。最后,我们通过多感知层(MLP)将这些特征进行特征串联并对其进行回归,并通过一个多感知层(MLP)将特征和质量评分回归到质量分数。我们使用可靠性损失来通过学习排序的方式来训练模型,以减轻肖像图像质量评估数据集中质量评分不一致的问题。实验结果表明,与原始数据集相比,所提出的模型在PIQA数据集上取得了卓越的性能,验证了其有效性。代码可在此处访问:\url{this <https:// this URL>.
https://arxiv.org/abs/2405.08555
Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance. Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC). These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks. The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.
声场分类(ASC)在现实生活中具有非常重要的意义。然而,基于深度学习的ASC方法目前还不够轻量级,并且其性能也不能令人满意。为了解决这些问题,我们提出了一个深度可分离分蒸馏网络。首先,网络在时域对离散余弦图进行高-低频分解,从而大大降低计算复杂度,同时保持模型性能。其次,我们专门设计三种轻量级的ASC操作符,包括分离卷积(SC)、正交分离卷积(OSC)和分离部分卷积(SPC)。这些操作符在声场分类任务中具有高度高效的特征提取能力。实验结果表明,与目前流行的深度学习方法相比,所提出的方法实现了9.8%的性能提升,同时具有更小的参数数量和计算复杂度。
https://arxiv.org/abs/2405.03567
This paper presents a baseline approach and an experimental protocol for a specific content verification problem: detecting discrepancies between the audio and video modalities in multimedia content. We first design and optimize an audio-visual scene classifier, to compare with existing classification baselines that use both modalities. Then, by applying this classifier separately to the audio and the visual modality, we can detect scene-class inconsistencies between them. To facilitate further research and provide a common evaluation platform, we introduce an experimental protocol and a benchmark dataset simulating such inconsistencies. Our approach achieves state-of-the-art results in scene classification and promising outcomes in audio-visual discrepancies detection, highlighting its potential in content verification applications.
本文提出了一种针对多媒体内容验证问题的基线方法和实验协议:检测音频和视频模态之间的差异。首先,我们设计并优化了一个音频-视频场景分类器,以与使用两种模态的现有分类基线进行比较。然后,通过分别应用该分类器到音频和视觉模态,我们可以检测它们之间的场景类不一致。为了促进进一步的研究并为内容验证应用提供共同的评估平台,我们引入了一个实验方案和一个模拟这种不一致的基准数据集。我们的方法在场景分类方面达到了最先进的水平,在音频-视觉不一致检测方面取得了有前景的结果,突出了其在内容验证应用中的潜力。
https://arxiv.org/abs/2405.00384
Convolutional neural networks (ConvNets) have been successfully applied to satellite image scene classification. Human-labeled training datasets are essential for ConvNets to perform accurate classification. Errors in human-annotated training datasets are unavoidable due to the complexity of satellite images. However, the distribution of real-world human-annotated label noises on remote sensing images and their impact on ConvNets have not been investigated. To fill this research gap, this study, for the first time, collected real-world labels from 32 participants and explored how their annotated label noise affect three representative ConvNets (VGG16, GoogleNet, and ResNet-50) for remote sensing image scene classification. We found that: (1) human-annotated label noise exhibits significant class and instance dependence; (2) an additional 1% of human-annotated label noise in training data leads to 0.5% reduction in the overall accuracy of ConvNets classification; (3) the error pattern of ConvNet predictions was strongly correlated with that of participant's labels. To uncover the mechanism underlying the impact of human labeling errors on ConvNets, we further compared it with three types of simulated label noise: uniform noise, class-dependent noise and instance-dependent noise. Our results show that the impact of human-annotated label noise on ConvNets significantly differs from all three types of simulated label noise, while both class dependence and instance dependence contribute to the impact of human-annotated label noise on ConvNets. These observations necessitate a reevaluation of the handling of noisy labels, and we anticipate that our real-world label noise dataset would facilitate the future development and assessment of label-noise learning algorithms.
卷积神经网络(ConvNets)已经被成功地应用于卫星图像场景分类。人类标注的数据集对ConvNets进行准确分类至关重要。由于卫星图像的复杂性,人类标注的数据集中存在错误是不可避免的。然而,人类标注数据集中的真实世界噪点对ConvNets的影响仍未被研究。为了填补这一研究空白,本研究首次收集了32个参与者的真实世界标签,并研究了他们的标注噪声如何影响遥感图像场景分类的三个代表ConvNets(VGG16,GoogleNet和ResNet-50)。我们发现: (1)人类标注的标签噪声表现出显著的分类和实例相关性; (2)在训练数据中额外1%的人类标注标签噪音会导致ConvNets分类准确率降低0.5%; (3)ConvNet预测的错误模式与参与者的标签高度相关。为了揭示人类标注误差对ConvNets的影响背后的机制,我们进一步将该研究与其他三种模拟标签噪音进行了比较:均匀噪音,类别相关噪音和实例相关噪音。我们的结果表明,人类标注标签噪音对ConvNets的影响与所有三种模拟标签噪音都有显著不同,而分类相关度和实例相关度都参与了人类标注标签噪音对ConvNets的影响。这些观察结果要求重新评估噪声标签的处理方式,我们预计,我们的真实世界标签噪音数据集将为未来的标签噪音学习算法的发展和评估提供促进。
https://arxiv.org/abs/2305.12106
As AI workloads increase in scope, generalization capability becomes challenging for small task-specific models and their demand for large amounts of labeled training samples increases. On the contrary, Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning and have been shown to adapt to various tasks with minimal fine-tuning. Although large FMs have demonstrated significant impact in natural language processing and computer vision, efforts toward FMs for geospatial applications have been restricted to smaller size models, as pretraining larger models requires very large computing resources equipped with state-of-the-art hardware accelerators. Current satellite constellations collect 100+TBs of data a day, resulting in images that are billions of pixels and multimodal in nature. Such geospatial data poses unique challenges opening up new opportunities to develop FMs. We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data. We studied from end-to-end the performance and impact in the solution by scaling the model size. Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy when comparing a 100M parameter model. Moreover, we detail performance experiments on the Frontier supercomputer, America's first exascale system, where we study different model and data parallel approaches using PyTorch's Fully Sharded Data Parallel library. Specifically, we study variants of the Vision Transformer architecture (ViT), conducting performance analysis for ViT models with size up to 15B parameters. By discussing throughput and performance bottlenecks under different parallelism configurations, we offer insights on how to leverage such leadership-class HPC resources when developing large models for geospatial imagery applications.
随着AI工作负载的增加,对于小任务特定模型的泛化能力变得具有挑战性,同时它们对大量标注训练样本的需求也在增加。相反,通过自监督学习使用互联网规模的无标注数据训练的基模型(FMs)已经证明了在各种任务上具有最小的微调适应性。尽管大型FM在自然语言处理和计算机视觉方面取得了显著影响,但针对地理应用的FM努力主要限制在小模型上,因为使用大模型进行预训练需要配备最先进的硬件加速器所需的大规模计算资源。 目前,卫星星座每天收集100+TB的数据,导致图像具有数十亿个像素和高多模态性质。这样的地理空间数据带来了独特的挑战,同时也为开发FM提供了新的机会。我们通过在公共数据上预训练来研究亿规模FM和HPC训练剖面,用于地理应用。我们研究了从端到端解决方案的性能和影响,通过扩展模型大小进行缩放。 我们的较大3B参数尺寸模型在比较100M参数模型时,实现多达30%的Top1场景分类精度提升。此外,我们在美国第一台每秒千万亿次浮点运算超级计算机(Frontier)上详细研究了使用PyTorch的全分片数据并行库进行不同模型和数据并行方法的研究。具体来说,我们研究了Vision Transformer架构(ViT)的变体,对大小达到15B参数的ViT模型进行了性能分析。通过讨论不同并行配置下的吞吐量和性能瓶颈,我们提供了关于如何利用这些领导级HPC资源在为地理影像应用开发大型模型时如何充分利用它们的见解。
https://arxiv.org/abs/2404.11706
Indoor scenes are usually characterized by scattered objects and their relationships, which turns the indoor scene classification task into a challenging computer vision task. Despite the significant performance boost in classification tasks achieved in recent years, provided by the use of deep-learning-based methods, limitations such as inter-category ambiguity and intra-category variation have been holding back their performance. To overcome such issues, gathering semantic information has been shown to be a promising source of information towards a more complete and discriminative feature representation of indoor scenes. Therefore, the work described in this paper uses both semantic information, obtained from object detection, and semantic segmentation techniques. While object detection techniques provide the 2D location of objects allowing to obtain spatial distributions between objects, semantic segmentation techniques provide pixel-level information that allows to obtain, at a pixel-level, a spatial distribution and shape-related features of the segmentation categories. Hence, a novel approach that uses a semantic segmentation mask to provide Hu-moments-based segmentation categories' shape characterization, designated by Segmentation-based Hu-Moments Features (SHMFs), is proposed. Moreover, a three-main-branch network, designated by GOS$^2$F$^2$App, that exploits deep-learning-based global features, object-based features, and semantic segmentation-based features is also proposed. GOS$^2$F$^2$App was evaluated in two indoor scene benchmark datasets: SUN RGB-D and NYU Depth V2, where, to the best of our knowledge, state-of-the-art results were achieved on both datasets, which present evidences of the effectiveness of the proposed approach.
室内场景通常具有零散的物体及其关系,这使得室内场景分类任务成为具有挑战性的计算机视觉任务。尽管近年来基于深度学习的方法在分类任务方面取得了显著的性能提升,但类内模糊性和类间变异性等限制仍然阻碍了其性能。为了克服这些问题,通过收集语义信息来获得更完整和具有区分性的室内场景特征,已经证明是一种有前景的方法。因此,本文的工作既利用了从物体检测中获得的语义信息,也利用了语义分割技术。虽然物体检测技术提供了物体在二维位置,以便获得物体之间的空间分布,语义分割技术提供了像素级的关于分割类别形状和关系的信息,因此,本文提出了一种利用语义分割掩码提供基于Hu-moments的分割类别形状描述的新方法,被称为基于分区的Hu-Moments特征(SHMFs)。此外,还提出了一个利用基于深度学习的全局特征、基于物体的特征和语义分割基于特征的三分支网络,该网络被称为GOS$^2$F$^2$App。GOS$^2$F$^2$App在两个室内场景基准数据集:SUN RGB-D和NYU Depth V2上进行了评估,据我们所知,在两个数据集上都实现了最先进的性能,这证明了所提出方法的有效性。
https://arxiv.org/abs/2404.07739
In the realm of geospatial analysis, the diversity of remote sensors, encompassing both optical and microwave technologies, offers a wealth of distinct observational capabilities. Recognizing this, we present msGFM, a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations, our model employs an innovative cross-sensor pretraining approach in masked image modeling, enabling the synthesis of joint representations from diverse sensors. msGFM, incorporating four remote sensors, upholds strong performance, forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification, segmentation, cloud removal, and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models, paving the way for more advanced geospatial capabilities.
在地理分析领域,遥感的多样性,包括光学和微波技术,提供了丰富的独特观测能力。意识到这一点,我们提出了msGFM,一个多传感器地理基础模型,有效地将四个关键传感器模态的数据统一在一起。这个集成涵盖了200,000个多传感器图像的广泛数据集。msGFM特别擅长处理成对和无对传感器数据。对于来自相同地理位置的数据,我们的模型采用了一种创新性的跨传感器预训练方法,实现从不同传感器合成联合表示。msGFM,包括四个遥感器,具有强大的性能,形成了一个适用于各种传感器类型的综合模型。msGFM在各种单传感器和多传感器下游任务中表现出了卓越的性能。这些包括场景分类、分割、云删除和锐化。我们研究的关键发现是,自然图像生成的表示并不总是与地理遥感器的独特特征相兼容,突显了该领域现有表示的局限性。我们的工作可以为开发多传感器地理预训练模型提供指导,为更先进的空间技术铺平道路。
https://arxiv.org/abs/2404.01260
Remote sensing image classification forms the foundation of various understanding tasks, serving a crucial function in remote sensing image interpretation. The recent advancements of Convolutional Neural Networks (CNNs) and Transformers have markedly enhanced classification accuracy. Nonetheless, remote sensing scene classification remains a significant challenge, especially given the complexity and diversity of remote sensing scenarios and the variability of spatiotemporal resolutions. The capacity for whole-image understanding can provide more precise semantic cues for scene discrimination. In this paper, we introduce RSMamba, a novel architecture for remote sensing image classification. RSMamba is based on the State Space Model (SSM) and incorporates an efficient, hardware-aware design known as the Mamba. It integrates the advantages of both a global receptive field and linear modeling complexity. To overcome the limitation of the vanilla Mamba, which can only model causal sequences and is not adaptable to two-dimensional image data, we propose a dynamic multi-path activation mechanism to augment Mamba's capacity to model non-causal data. Notably, RSMamba maintains the inherent modeling mechanism of the vanilla Mamba, yet exhibits superior performance across multiple remote sensing image classification datasets. This indicates that RSMamba holds significant potential to function as the backbone of future visual foundation models. The code will be available at \url{this https URL}.
遥感图像分类是各种理解任务的基石,在遥感图像解释中具有关键作用。最近卷积神经网络(CNN)和Transformer的进步显著提高了分类准确性。然而,遥感场景分类仍然是一个重要的挑战,尤其是在遥感场景的复杂性和多样性以及时空分辨率的不确定性方面。全图理解能力可以提供场景区分的更精确的语义线索。在本文中,我们引入了RSMamba,一种新型的遥感图像分类架构。RSMamba基于状态空间模型(SSM),并采用了一种高效且硬件意识的设计,称为Mamba。它整合了全局感受野和线性建模复杂度的优势。为了克服普通Mamba的局限性(只能建模因果序列,不适用于二维图像数据),我们提出了动态多路径激活机制来增强Mamba的建模非因果数据的能力。值得注意的是,RSMamba保留了普通Mamba的固有建模机制,同时在多个遥感图像分类数据集上表现出卓越的性能。这表明,RSMamba具有成为未来视觉基础模型 backbone的重要潜力。代码将公开在 \url{这个链接} 上。
https://arxiv.org/abs/2403.19654
In the realm of Federated Learning (FL) applied to remote sensing image classification, this study introduces and assesses several innovative communication strategies. Our exploration includes feature-centric communication, pseudo-weight amalgamation, and a combined method utilizing both weights and features. Experiments conducted on two public scene classification datasets unveil the effectiveness of these strategies, showcasing accelerated convergence, heightened privacy, and reduced network information exchange. This research provides valuable insights into the implications of feature-centric communication in FL, offering potential applications tailored for remote sensing scenarios.
在应用于远程 sensing图像分类领域的联邦学习(FL)领域,本文介绍并评估了几个创新性的通信策略。我们的探索包括基于特征的通信、伪权重合并和结合使用权重和特征的方法。在两个公开场景分类数据集上进行的实验揭示了这些策略的有效性,展示了加速收敛、提高隐私和减少网络信息交互。这项研究为FL中基于特征的通信提供了宝贵的洞见,为远程 sensing场景提供了潜在的应用。
https://arxiv.org/abs/2403.13575
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. Pretraining is an active research topic, encompassing supervised and self-supervised learning methods to initialize model weights effectively. However, transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. In this study, we explore the Multi-Task Pretraining (MTP) paradigm for RS foundation models to address this issue. Using a shared encoder and task-specific decoder architecture, we conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. MTP supports both convolutional neural networks and vision transformer foundation models with over 300 million parameters. The pretrained models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection. Extensive experiments across 14 datasets demonstrate the superiority of our models over existing ones of similar size and their competitive performance compared to larger state-of-the-art models, thus validating the effectiveness of MTP.
基于先验模型的 Remote Sensing (RS) 领域已经发生了变革,通过增强各种图像解释任务,使先验模型成为了 Remote Sensing 领域的一种重要工具。预训练是一个活跃的研究课题,涵盖了监督学习和自监督学习方法,以有效地初始化模型权重。然而,将预训练模型应用于下游任务可能会因为它们将预训练建模为图像分类或目标识别任务而遇到任务差异。在这项研究中,我们探讨了 Multi-Task Pretraining (MTP) 范式,以解决这一问题。我们使用共享编码器和支持特定任务解码器的设计,在 SAMRS 数据集上进行多任务监督预训练,包括语义分割、实例分割和旋转物体检测。MTP 支持超过 300 亿参数的卷积神经网络和视觉 Transformer 基础模型。预训练模型在各种 RS 下游任务上进行微调,例如场景分类、水平物体检测、语义分割和变化检测。在 14 个数据集上的大量实验证明,我们的模型在大小相似的情况下优于现有模型,并且与更大的先进模型具有竞争力的性能,从而验证了 MTP 的有效性。
https://arxiv.org/abs/2403.13430
Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing & Exploited Children every year, and over 80% originated from online sources. Therefore, investigation centers and clearinghouses cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene recognition task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to target tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.
21世纪犯罪分为虚拟和现实世界。然而,前者已成为对人们福祉和安全的全球威胁。它所呈现的挑战必须通过全球合作来应对,我们比以往任何时候都更依赖自动且可靠的工具来对抗网络犯罪的不断增长。每年向美国国家失踪和受控儿童中心提交超过1000万份儿童性虐待报告,其中超过80%来自在线来源。因此,调查中心和清除中心无法手动处理和正确调查所有图像。鉴于这一点,可靠的自动化工具在安全且高效地处理这种数据方面至关重要。在这种情况下,场景识别任务在环境中寻找上下文线索,能够无要求地组队和分类儿童性虐待数据。对处理儿童性虐待图像的稀缺性和限制,导致了一种自我监督学习,这是一种利用未标记数据产生强大表示的机器学习方法。这项工作表明,基于场景的深度学习模型经过预训练后可以达到在我们 indoor 场景分类任务上实现71.6%的平衡准确度,并且平均比完全监督版本的表现更好。我们与巴西联邦警察专家合作,对我们的 indoor 分类模型在实际儿童虐待材料上进行评估。结果显示,广泛使用的场景数据中观察到的特征与敏感材料中描绘的特征之间存在明显的差异。
https://arxiv.org/abs/2403.01183
Most state-of-the-art computer vision models heavily depend on data. However, many datasets exhibit extreme class imbalance which has been shown to negatively impact model performance. Among the training-time and data-generation solutions that have been explored, one subset that leverages existing data is importance sampling. A good deal of this work focuses primarily on the CIFAR-10 and CIFAR-100 datasets which fail to be representative of the scale, composition, and complexity of current state-of-the-art datasets. In this work, we explore and compare three techniques that derive from importance sampling: loss reweighting, undersampling, and oversampling. Specifically, we compare the effect of these techniques on the performance of two encoders on an impactful satellite imagery dataset, Planet's Amazon Rainforest dataset, in preparation for another work. Furthermore, we perform supplemental experimentation on a scene classification dataset, ADE20K, to test on a contrasting domain and clarify our results. Across both types of encoders, we find that up-weighting the loss for and undersampling has a negigible effect on the performance on underrepresented classes. Additionally, our results suggest oversampling generally improves performance for the same underrepresented classes. Interestingly, our findings also indicate that there may exist some redundancy in data in the Planet dataset. Our work aims to provide a foundation for further work on the Planet dataset and similar domain-specific datasets. We open-source our code at this https URL for future work on other satellite imagery datasets as well.
大多数最先进的计算机视觉模型非常依赖数据。然而,许多数据集表现出极端的类不平衡,已经被证明会严重影响模型的性能。在已经探索的训练时间和数据生成解决方案中,利用现有数据的一个子集是重要性抽样。这个工作的大部分主要集中在CIFAR-10和CIFAR-100数据集上,这些数据集无法代表当前最先进数据集的规模、组成和复杂性。在这篇工作中,我们探讨并比较了三个基于重要性抽样的技术:损失重新加权、欠采样和过采样。具体来说,我们比较了这些技术对两个具有影响力的卫星图像数据集(Planet的Amazon雨林数据集)的性能影响,为另一项工作做准备。此外,我们还对另一个场景分类数据集ADE20K进行了补充实验,以测试对比领域并阐明我们的结果。在两种类型的编码器中,我们发现,为正负样本加权损失和欠采样对表现不佳的类别的性能没有负面影响。此外,我们的结果还表明,过度抽样通常会提高同一类别的性能。有趣的是,我们的研究还表明,Planet数据集中的数据可能存在冗余。我们的工作旨在为研究Planet数据集以及类似领域数据提供基础。我们将代码开源在https:// this URL,供未来对其他卫星图像数据集进行更多研究。
https://arxiv.org/abs/2402.18742
Multi-modal sensor data fusion takes advantage of complementary or reinforcing information from each sensor and can boost overall performance in applications such as scene classification and target detection. This paper presents a new method for fusing multi-modal and multi-resolution remote sensor data without requiring pixel-level training labels, which can be difficult to obtain. Previously, we developed a Multiple Instance Multi-Resolution Fusion (MIMRF) framework that addresses label uncertainty for fusion, but it can be slow to train due to the large search space for the fuzzy measures used to integrate sensor data sources. We propose a new method based on binary fuzzy measures, which reduces the search space and significantly improves the efficiency of the MIMRF framework. We present experimental results on synthetic data and a real-world remote sensing detection task and show that the proposed MIMRF-BFM algorithm can effectively and efficiently perform multi-resolution fusion given remote sensing data with uncertainty.
多模态传感器数据的融合利用每个传感器互补或增强的信息,可以在诸如场景分类和目标检测等应用中提高整体性能。本文提出了一种不需要像素级训练标签的新方法来融合多模态和多分辨率远程传感器数据,这是难以获得的。之前,我们开发了一种Multiple Instance Multi-Resolution Fusion (MIMRF)框架,用于解决融合时的标签不确定性,但由于使用的模糊度量具有较大的搜索空间,训练可能变得缓慢。我们提出了一种基于二进制模糊度量的全新方法,这减少了搜索空间,显著提高了MIMRF框架的效率。我们在合成数据和现实世界的遥感检测任务上进行了实验,并证明了所提出的MIMRF-BFM算法可以有效且高效地对具有不确定性的遥感数据进行多分辨率融合。
https://arxiv.org/abs/2402.05045
Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is domain shift caused by a distribution gap between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Although this task in recent years has achieved substantial progress in device generalization, the challenge of domain shift between different regions, involving characteristics such as time, space, culture, and language, remains insufficiently explored at present. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data. Therefore, we introduce the task Semi-supervised Acoustic Scene Classification under Domain Shift in the ICME 2024 Grand Challenge. We encourage participants to innovate with semi-supervised learning techniques, aiming to develop more robust ASC models under domain shift.
声景分类(ASC)是计算听觉场景分析中的一个关键研究问题,旨在识别环境的独特声学特征。ASC任务的挑战之一是训练和测试数据之间分布差异导致的领域转移。自2018年以来,ASC挑战的重点在于在不同记录设备上推广ASC模型。尽管在近年来设备泛化方面取得了实质性进展,但不同区域之间领域的转移问题(包括时间、空间、文化和语言等特征)仍然没有被充分研究。此外,考虑到现实世界中大量未标记的声景数据,研究如何利用这些未标记数据的可能性非常重要。因此,我们在ICME 2024大挑战中引入了域移下 semi-supervised Acoustic Scene 分类任务。我们鼓励参与者使用半监督学习技术创新,以开发在领域移除下更健壮的ASC模型。
https://arxiv.org/abs/2402.02694
Deep neural networks have achieved promising progress in remote sensing (RS) image classification, for which the training process requires abundant samples for each class. However, it is time-consuming and unrealistic to annotate labels for each RS category, given the fact that the RS target database is increasing dynamically. Zero-shot learning (ZSL) allows for identifying novel classes that are not seen during training, which provides a promising solution for the aforementioned problem. However, previous ZSL models mainly depend on manually-labeled attributes or word embeddings extracted from language models to transfer knowledge from seen classes to novel classes. Besides, pioneer ZSL models use convolutional neural networks pre-trained on ImageNet, which focus on the main objects appearing in each image, neglecting the background context that also matters in RS scene classification. To address the above problems, we propose to collect visually detectable attributes automatically. We predict attributes for each class by depicting the semantic-visual similarity between attributes and images. In this way, the attribute annotation process is accomplished by machine instead of human as in other methods. Moreover, we propose a Deep Semantic-Visual Alignment (DSVA) that take advantage of the self-attention mechanism in the transformer to associate local image regions together, integrating the background context information for prediction. The DSVA model further utilizes the attribute attention maps to focus on the informative image regions that are essential for knowledge transfer in ZSL, and maps the visual images into attribute space to perform ZSL classification. With extensive experiments, we show that our model outperforms other state-of-the-art models by a large margin on a challenging large-scale RS scene classification benchmark.
深度神经网络在远距离感知(RS)图像分类方面取得了令人鼓舞的进展,对于需要每个类别丰富样本的训练过程来说,这需要耗费大量的时间和精力。然而,由于RS目标数据库正在不断增加,因此为每个RS类别注释标签是非常耗时的,而且不现实的想法。零样本学习(ZSL)允许在训练过程中识别出在训练过程中看不到的新类别的样本,为上述问题提供了一个有前景的解决方案。然而,之前的ZSL模型主要依赖于手动标注的属性或从语言模型中提取的词向量来转移知识从见过的类别到新类别的知识。此外,先驱的ZSL模型使用预训练于ImageNet的卷积神经网络,重点关注每个图像中出现的主要物体,忽视了背景上下文信息对于RS场景分类同样重要的事实。为了解决上述问题,我们提出了自动收集视觉可检测属性的方法。我们通过描绘属性与图像之间的语义-视觉相似性来预测每个类的属性。这样,属性标注过程由机器完成,而不是由人类完成,类似于其他方法。此外,我们提出了一个Deep Semantic-Visual Alignment(DSVA)模型,该模型利用Transformer中的自注意力机制将局部图像区域关联起来,并整合背景上下文信息进行预测。DSVA模型还利用属性注意力图来关注ZSL中对于知识传递至关重要且有用的图像区域,并将视觉图像映射到属性空间进行ZSL分类。通过大量实验,我们发现我们的模型在具有挑战性的大规模RS场景分类基准上显著优于其他最先进的模型。
https://arxiv.org/abs/2402.02094
Although neural models have achieved remarkable performance, they still encounter doubts due to the intransparency. To this end, model prediction explanation is attracting more and more attentions. However, current methods rarely incorporate external knowledge and still suffer from three limitations: (1) Neglecting concept completeness. Merely selecting concepts may not sufficient for prediction. (2) Lacking concept fusion. Failure to merge semantically-equivalent concepts. (3) Difficult in manipulating model behavior. Lack of verification for explanation on original model. To address these issues, we propose a novel knowledge-aware neuron interpretation framework to explain model predictions for image scene classification. Specifically, for concept completeness, we present core concepts of a scene based on knowledge graph, ConceptNet, to gauge the completeness of concepts. Our method, incorporating complete concepts, effectively provides better prediction explanations compared to baselines. Furthermore, for concept fusion, we introduce a knowledge graph-based method known as Concept Filtering, which produces over 23% point gain on neuron behaviors for neuron interpretation. At last, we propose Model Manipulation, which aims to study whether the core concepts based on ConceptNet could be employed to manipulate model behavior. The results show that core concepts can effectively improve the performance of original model by over 26%.
尽管神经网络模型已经取得了显著的性能,但它们仍然会因透明度而遇到怀疑。因此,模型预测解释吸引了越来越多的关注。然而,目前的做法很少 incorporating外部知识,仍然存在三个限制:(1)忽视概念的完整性。仅仅选择概念可能不足以进行预测。(2)缺乏概念融合。未能将语义上等价的观念进行合并。(3)难以操纵模型行为。在原始模型上进行解释缺乏验证。为解决这些问题,我们提出了一个新颖的知识感知神经元解释框架,用于解释图像场景分类模型的预测。具体来说,基于知识图谱和ConceptNet,我们提出了场景核心概念,以衡量概念的完整性。我们的方法,结合了完整概念,在基线模型上提供了更好的预测解释。此外,为了概念融合,我们引入了一种基于知识图谱的方法,称为概念过滤,该方法在神经元行为上产生了超过23%的点增益。最后,我们提出了模型操作,旨在研究是否基于ConceptNet的概念核心概念可以用于操纵模型行为。结果显示,通过结合概念,可以显著提高原始模型的性能,超过26%。
https://arxiv.org/abs/2401.15820
In this work, we aim to establish a Bayesian adaptive learning framework by focusing on estimating latent variables in deep neural network (DNN) models. Latent variables indeed encode both transferable distributional information and structural relationships. Thus the distributions of the source latent variables (prior) can be combined with the knowledge learned from the target data (likelihood) to yield the distributions of the target latent variables (posterior) with the goal of addressing acoustic mismatches between training and testing conditions. The prior knowledge transfer is accomplished through Variational Bayes (VB). In addition, we also investigate Maximum a Posteriori (MAP) based Bayesian adaptation. Experimental results on device adaptation in acoustic scene classification show that our proposed approaches can obtain good improvements on target devices, and consistently outperforms other cut-edging algorithms.
在这项工作中,我们旨在通过专注于在深度神经网络(DNN)模型中估计潜在变量来建立一个贝叶斯自适应学习框架。事实上,潜在变量确实编码了可转移的分布信息和结构关系。因此,源潜在变量的分布(先验)可以与目标数据(后验)知识相结合,以产生目标潜在变量的分布,旨在解决训练和测试条件之间的声学不匹配。先验知识传递是通过Variational Bayes(VB)实现的。此外,我们还研究了基于最大后验概率(MAP)的贝叶斯自适应。 在音频场景分类设备的实验结果表明,我们提出的方法在目标设备上可以获得很好的改进,并且 consistently优于其他削减算法。
https://arxiv.org/abs/2401.13766
Computer-based scene understanding has influenced fields ranging from urban planning to autonomous vehicle performance, yet little is known about how well these technologies work across social differences. We investigate the biases of deep convolutional neural networks (dCNNs) in scene classification, using nearly one million images from global and US sources, including user-submitted home photographs and Airbnb listings. We applied statistical models to quantify the impact of socioeconomic indicators such as family income, Human Development Index (HDI), and demographic factors from public data sources (CIA and US Census) on dCNN performance. Our analyses revealed significant socioeconomic bias, where pretrained dCNNs demonstrated lower classification accuracy, lower classification confidence, and a higher tendency to assign labels that could be offensive when applied to homes (e.g., "ruin", "slum"), especially in images from homes with lower socioeconomic status (SES). This trend is consistent across two datasets of international images and within the diverse economic and racial landscapes of the United States. This research contributes to understanding biases in computer vision, emphasizing the need for more inclusive and representative training datasets. By mitigating the bias in the computer vision pipelines, we can ensure fairer and more equitable outcomes for applied computer vision, including home valuation and smart home security systems. There is urgency in addressing these biases, which can significantly impact critical decisions in urban development and resource allocation. Our findings also motivate the development of AI systems that better understand and serve diverse communities, moving towards technology that equitably benefits all sectors of society.
基于计算机的场景理解已经影响了包括城市规划、自动驾驶汽车性能在内的各个领域,然而目前还很少有人知道这些技术在不同社会差异下的表现。我们研究了深度卷积神经网络(dCNNs)在场景分类中的偏见,使用了来自全球和美国的近100万张图像,包括用户提交的家居照片和Airbnb列表。我们应用统计模型来量化来自公共数据源(CIA和美国人口普查局)的社会经济发展指标(如家庭收入、人类发展指数,人口统计因素)对dCNN性能的影响。我们的分析揭示了显著的社会经济偏见,即预训练的dCNNs在分类准确性、分类信心和将标签分配给可能冒犯住宅的倾向方面都较低(例如,“脏乱”、“贫民窟”等),特别是在社会经济地位较低的住宅(SES)中。这一趋势在两个国际图像数据集和美国的多样经济和种族景观中都是一致的。这项研究为计算机视觉中的偏见提供了更深入的理解,强调了需要创建更包容和代表性训练数据集的必要性。通过减轻计算机视觉流程中的偏见,我们可以确保应用计算机视觉获得更公平和平等的结果,包括住宅估值和智能家居安全系统。解决这些偏见的问题具有紧迫性,这可能会对城市发展和资源分配产生重大影响。我们的研究还推动了开发更了解和服务于多样社区的AI系统,朝着实现让所有社会各阶层都受益的技术方向发展。
https://arxiv.org/abs/2401.13097
Image summary, an abridged version of the original visual content, can be used to represent the scene. Thus, tasks such as scene classification, identification, indexing, etc., can be performed efficiently using the unique summary. Saliency is the most commonly used technique for generating the relevant image summary. However, the definition of saliency is subjective in nature and depends upon the application. Existing saliency detection methods using RGB-D data mainly focus on color, texture, and depth features. Consequently, the generated summary contains either foreground objects or non-stationary objects. However, applications such as scene identification require stationary characteristics of the scene, unlike state-of-the-art methods. This paper proposes a novel volumetric saliency-guided framework for indoor scene classification. The results highlight the efficacy of the proposed method.
图像摘要,是对原始视觉内容的一个简要概述,可以用来表示场景。因此,场景分类、识别、索引等任务可以使用独特的摘要来高效执行。最常见的生成相关图像摘要的技术是显著性。然而,显著性的定义在本质上是有主观性的,并取决于应用场景。使用RGB-D数据现有的 saliency 检测方法主要关注颜色、纹理和深度特征。因此,生成的摘要包含前景物体或非稳定物体。然而,场景识别应用程序需要场景的静止特性,而与现有方法不同。本文提出了一种新颖的体积显著性引导的室内场景分类框架。结果突出了所提出方法的有效性。
https://arxiv.org/abs/2401.16227
Deep learning models are essential for scene classification, change detection, land cover segmentation, and other remote sensing image understanding tasks. Most backbones of existing remote sensing deep learning models are typically initialized by pre-trained weights obtained from ImageNet pre-training (IMP). However, domain gaps exist between remote sensing images and natural images (e.g., ImageNet), making deep learning models initialized by pre-trained weights of IMP perform poorly for remote sensing image understanding. Although some pre-training methods are studied in the remote sensing community, current remote sensing pre-training methods face the problem of vague generalization by only using remote sensing images. In this paper, we propose a novel remote sensing pre-training framework, Generic Knowledge Boosted Remote Sensing Pre-training (GeRSP), to learn robust representations from remote sensing and natural images for remote sensing understanding tasks. GeRSP contains two pre-training branches: (1) A self-supervised pre-training branch is adopted to learn domain-related representations from unlabeled remote sensing images. (2) A supervised pre-training branch is integrated into GeRSP for general knowledge learning from labeled natural images. Moreover, GeRSP combines two pre-training branches using a teacher-student architecture to simultaneously learn representations with general and special knowledge, which generates a powerful pre-trained model for deep learning model initialization. Finally, we evaluate GeRSP and other remote sensing pre-training methods on three downstream tasks, i.e., object detection, semantic segmentation, and scene classification. The extensive experimental results consistently demonstrate that GeRSP can effectively learn robust representations in a unified manner, improving the performance of remote sensing downstream tasks.
深度学习模型对于场景分类、变化检测、土地覆盖分割等遥感图像理解任务至关重要。现有的遥感深度学习模型的骨干网络通常通过从ImageNet预训练中获得的预训练权重初始化。然而,遥感图像与自然图像之间存在领域差异(例如,ImageNet),因此仅通过遥感图像预训练的权重初始化的深度学习模型在遥感图像理解任务上表现不佳。尽管在遥感领域有一些预训练方法的研究,但现有的遥感预训练方法仅通过遥感图像无法解决领域差异问题。在本文中,我们提出了一个新颖的遥感预训练框架,通用知识增强遥感预训练(GeRSP),以从遥感图像和自然图像中学习稳健的表示来进行遥感理解任务。GeRSP包含两个预训练分支:(1)采用自监督预训练分支从未标注的遥感图像中学习领域相关的表示。(2)将监督预训练分支集成到GeRSP中,从标注的自然图像中学习通用知识。此外,GeRSP使用师生架构将两个预训练分支同时学习具有通用和特殊知识的表示,从而生成一个强大的预训练模型,用于深度学习模型的初始化。最后,我们对GeRSP和其他遥感预训练方法在三个下游任务上进行了评估,即目标检测、语义分割和场景分类。大量实验结果一致证明,GeRSP可以在统一的方式下有效学习稳健的表示,从而提高遥感下游任务的性能。
https://arxiv.org/abs/2401.04614