Deep learning classifiers are prone to latching onto dominant confounders present in a dataset rather than on the causal markers associated with the target class, leading to poor generalization and biased predictions. Although explainability via counterfactual image generation has been successful at exposing the problem, bias mitigation strategies that permit accurate explainability in the presence of dominant and diverse artifacts remain unsolved. In this work, we propose the DeCoDEx framework and show how an external, pre-trained binary artifact detector can be leveraged during inference to guide a diffusion-based counterfactual image generator towards accurate explainability. Experiments on the CheXpert dataset, using both synthetic artifacts and real visual artifacts (support devices), show that the proposed method successfully synthesizes the counterfactual images that change the causal pathology markers associated with Pleural Effusion while preserving or ignoring the visual artifacts. Augmentation of ERM and Group-DRO classifiers with the DeCoDEx generated images substantially improves the results across underrepresented groups that are out of distribution for each class. The code is made publicly available at this https URL.
深度学习分类器容易在数据集中固有的主导混淆因素上留下印象,而不是在目标类别的相关因果标记上,导致泛化差和有偏预测。尽管通过反事实图像生成来解释该问题已经取得成功,但允许在主导和多样异常物中实现准确解释的偏差减轻策略仍然是一个未解决的问题。在本文中,我们提出了DeCoDEx框架,并展示了如何在外部预训练的二进制异常物检测器的基础上,在推理过程中指导扩散式反事实图像生成器走向准确解释。在CheXpert数据集上进行的实验(使用合成异常物和真实视觉异常物)表明,与该方法相结合,可以成功生成反事实图像,这些图像在改变与胸膜积液相关的因果病理特征的同时,保留或忽略视觉异常。使用DeCoDEx生成的图像对ERM和Group-DRO分类器的扩展显著提高了分布不寻常类别的结果。代码可在此处公开访问:https://this URL。
https://arxiv.org/abs/2405.09288
The ambiguous appearance, tiny scale, and fine-grained classes of objects in remote sensing imagery inevitably lead to the noisy annotations in category labels of detection dataset. However, the effects and treatments of the label noises are underexplored in modern oriented remote sensing object detectors. To address this issue, we propose a robust oriented remote sensing object detection method through dynamic loss decay (DLD) mechanism, inspired by the two phase ``early-learning'' and ``memorization'' learning dynamics of deep neural networks on clean and noisy samples. To be specific, we first observe the end point of early learning phase termed as EL, after which the models begin to memorize the false labels that significantly degrade the detection accuracy. Secondly, under the guidance of the training indicator, the losses of each sample are ranked in descending order, and we adaptively decay the losses of the top K largest ones (bad samples) in the following epochs. Because these large losses are of high confidence to be calculated with wrong labels. Experimental results show that the method achieves excellent noise resistance performance tested on multiple public datasets such as HRSC2016 and DOTA-v1.0/v2.0 with synthetic category label noise. Our solution also has won the 2st place in the "fine-grained object detection based on sub-meter remote sensing imagery" track with noisy labels of 2023 National Big Data and Computing Intelligence Challenge.
远程 sensing图像中模糊的景象、微小的尺度和精细的类别的物体必然会导致检测数据集中的类别标签噪声。然而,在现代面向对象的远程感测物体检测器中,对标签噪音的影响和处理方法仍然没有被深入研究。为解决这个问题,我们提出了一种通过动态损失衰减(DLD)机制的稳健面向对象的远程感测物体检测方法,灵感来自深度神经网络在干净和噪音样本上的“早期学习”和“记忆”学习动态。具体来说,我们首先观察到早学习阶段的结束点,即EL,然后模型开始显著降低检测准确度的虚假标签。其次,在训练指标的指导下,将每个样本的损失按照降序排列,并在后续 epoch 中自适应地衰减最大 K 个(坏样本)的损失。因为这些大损失对错误标签计算具有很高的信心。实验结果表明,该方法在多个公共数据集如HRSC2016和DOTA-v1.0/v2.0上具有出色的噪音抗性表现。我们的解决方案还在2023年全国大数据和计算智能挑战中获得了“基于亚米级遥感图像的细粒度物体检测”的2nd place。
https://arxiv.org/abs/2405.09024
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911
Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm, detecting objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms. Code is available at this https URL.
许多基于查询的3D多对象跟踪(MOT)方法采用了关注点的跟踪范式,利用跟踪查询进行身份一致的检测,利用对象查询进行身份无关的跟踪生成。然而,关注点的跟踪范式将检测和跟踪查询在同一个嵌入中纠缠在一起,对于检测和跟踪任务来说不是最优解。其他方法类似于跟踪-by-detection范式,使用解耦的跟踪和检测查询然后进行后续的相关联来检测物体。然而,这些方法并未利用检测和关联任务之间的协同作用。通过结合这两种范式的优势,我们引入了ADA-Track,一种从多视角摄像机视角的3D MOT的新型端到端框架。我们基于边缘增强交叉注意力的可学习数据关联模块,利用外观和几何特征。此外,我们将该关联模块集成到基于DETR的3D检测器的解码层中,实现同时检测和查询到图像的交叉注意。通过堆叠这些解码层,查询在检测和关联任务上进行 alternating refine,有效利用了任务依赖关系。我们在nuScenes数据集上评估我们的方法,并证明了与前两种范式相比,我们的方法具有优势。代码可在此处下载:https://www.xxx.com/。
https://arxiv.org/abs/2405.08909
Datasets labelled by human annotators are widely used in the training and testing of machine learning models. In recent years, researchers are increasingly paying attention to label quality. However, it is not always possible to objectively determine whether an assigned label is correct or not. The present work investigates this ambiguity in the annotation of autonomous driving datasets as an important dimension of data quality. Our experiments show that excluding highly ambiguous data from the training improves model performance of a state-of-the-art pedestrian detector in terms of LAMR, precision and F1 score, thereby saving training time and annotation costs. Furthermore, we demonstrate that, in order to safely remove ambiguous instances and ensure the retained representativeness of the training data, an understanding of the properties of the dataset and class under investigation is crucial.
数据集是由人类注释者标记的 labeled 数据集在机器学习模型的训练和测试中得到了广泛应用。近年来,研究者们越来越关注标签的质量。然而,确定分配给任务的标签是否正确并不总是可能的。本文研究了自动驾驶数据集注释中的不确定性作为一个重要数据质量维度。我们的实验结果表明,从训练中排除高度 ambiguous 的数据可以提高最先进的行人检测模型(LAMM)的精度、召回率和 F1 分数,从而节省训练时间和标注成本。此外,我们还证明了,为了安全地移除歧义实例并确保训练数据的保留代表性,了解数据集及其所属类的特性至关重要。
https://arxiv.org/abs/2405.08794
The nature of diversity in real-world environments necessitates neural network models to expand from closed category settings to accommodate novel emerging categories. In this paper, we study the open-vocabulary object detection (OVD), which facilitates the detection of novel object classes under the supervision of only base annotations and open-vocabulary knowledge. However, we find that the inadequacy of neighboring relationships between regions during the alignment process inevitably constrains the performance on recent distillation-based OVD strategies. To this end, we propose Neighboring Region Attention Alignment (NRAA), which performs alignment within the attention mechanism of a set of neighboring regions to boost the open-vocabulary inference. Specifically, for a given proposal region, we randomly explore the neighboring boxes and conduct our proposed neighboring region attention (NRA) mechanism to extract relationship information. Then, this interaction information is seamlessly provided into the distillation procedure to assist the alignment between the detector and the pre-trained vision-language models (VLMs). Extensive experiments validate that our proposed model exhibits superior performance on open-vocabulary benchmarks.
现实环境中的多样性需要神经网络模型从封闭的类别设置扩展到以容纳新颖的浮现类别。在本文中,我们研究了开放词汇对象检测(OVD),它通过仅基于基本注释的监督来检测新颖的对象类别。然而,我们发现,在配准过程中,区域之间相邻关系的不足会必然限制最近基于蒸馏的OD策略的性能。为此,我们提出了邻居区域注意对齐(NRAA),它通过一组邻居区域的注意机制在注意力机制内进行对齐,以提高开放词汇的推理。具体来说,对于给定的建议区域,我们随机探索邻居框,并执行我们提出的邻居区域注意(NRA)机制来提取关系信息。然后,这种交互信息被无缝地提供到蒸馏过程中,以协助检测器与预训练的视觉语言模型(VLMs)之间的对齐。大量实验证实,与开放词汇基准相比,我们提出的模型具有卓越的性能。
https://arxiv.org/abs/2405.08593
In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors.
近年来,深度学习极大地简化了生成逼真假脸图像的过程。为了意识到这种伪造技术的危险性,研究人员开发了各种工具来检测这些伪造技术。然而,没有一个工具问到这个问题:什么数字编辑会使得真实照片脸孔图像伪造,而其他工具不会?在本文中,我们将面部伪造置于语义背景下,并定义了能够超过人类判断阈值的语义人脸属性编辑的方法是面部伪造的源头。在受到新定义的指导下,我们构建了一个大规模的人脸伪造图像数据集,其中每个图像都与一个由层次图组织起来的标签集相关联。我们的数据集使得有两个新的测试协议可以探究面部伪造检测器的泛化能力。此外,我们提出了一个语义导向的面部伪造检测方法,它抓住了标签关系并优先考虑了主要任务(即真实或伪造脸孔检测)。我们证明了所提出的数据集成功地揭示了当前检测器的缺陷作为测试集,并且在训练集上持续改进了它们的泛化能力。此外,我们还证明了我们的语义导向方法比传统的二分类和多分类分类基础检测器具有优越性。
https://arxiv.org/abs/2405.08487
In this work, we introduce a novel method for calculating the 6DoF pose of an object using a single RGB-D image. Unlike existing methods that either directly predict objects' poses or rely on sparse keypoints for pose recovery, our approach addresses this challenging task using dense correspondence, i.e., we regress the object coordinates for each visible pixel. Our method leverages existing object detection methods. We incorporate a re-projection mechanism to adjust the camera's intrinsic matrix to accommodate cropping in RGB-D images. Moreover, we transform the 3D object coordinates into a residual representation, which can effectively reduce the output space and yield superior performance. We conducted extensive experiments to validate the efficacy of our approach for 6D pose estimation. Our approach outperforms most previous methods, especially in occlusion scenarios, and demonstrates notable improvements over the state-of-the-art methods. Our code is available on this https URL.
在这项工作中,我们提出了一种使用单张RGB-D图像计算物体6DoF姿态的新方法。与现有的方法不同,它们要么直接预测物体的姿态,要么依赖于稀疏的关键点来进行姿态恢复。我们的方法通过密集匹配解决了这一具有挑战性的任务,即我们对于每个可见像素回归物体的坐标。我们的方法依赖于现有的物体检测方法。我们引入了一个重投影机制来调整相机的固有矩阵以适应RGB-D图像的裁剪。此外,我们将3D物体坐标转换为残差表示,可以有效地降低输出空间并产生卓越的性能。我们对我们的方法在6DoF姿态估计方面的有效性进行了广泛的实验验证。与大多数先前的方法相比,我们的方法在遮挡场景中表现优异,并显著超越了最先进的 methods。我们的代码可以在这个https:// URL上找到。
https://arxiv.org/abs/2405.08483
Reports regarding the misuse of $\textit{Generative AI}$ ($\textit{GenAI}$) to create harmful deepfakes are emerging daily. Recently, defensive watermarking, which enables $\textit{GenAI}$ providers to hide fingerprints in their images to later use for deepfake detection, has been on the rise. Yet, its potential has not been fully explored. We present $\textit{UnMarker}$ -- the first practical $\textit{universal}$ attack on defensive watermarking. Unlike existing attacks, $\textit{UnMarker}$ requires no detector feedback, no unrealistic knowledge of the scheme or similar models, and no advanced denoising pipelines that may not be available. Instead, being the product of an in-depth analysis of the watermarking paradigm revealing that robust schemes must construct their watermarks in the spectral amplitudes, $\textit{UnMarker}$ employs two novel adversarial optimizations to disrupt the spectra of watermarked images, erasing the watermarks. Evaluations against the $\textit{SOTA}$ prove its effectiveness, not only defeating traditional schemes while retaining superior quality compared to existing attacks but also breaking $\textit{semantic}$ watermarks that alter the image's structure, reducing the best detection rate to $43\%$ and rendering them useless. To our knowledge, $\textit{UnMarker}$ is the first practical attack on $\textit{semantic}$ watermarks, which have been deemed the future of robust watermarking. $\textit{UnMarker}$ casts doubts on the very penitential of this countermeasure and exposes its paradoxical nature as designing schemes for robustness inevitably compromises other robustness aspects.
关于$\textit{Generative AI}$($\textit{GenAI}$)用于制作有害深度伪造的报告每天都在增加。最近,防御性水印标记(Defensive Watermarking)激增,它使$\textit{GenAI}$提供商能够在他们的图像中隐藏指纹,以便稍后用于深度伪造检测。然而,它的潜力还没有完全发挥出来。我们提出了$\textit{UnMarker}$——第一个针对防御性水印标记的实际通用攻击。与现有攻击不同,$\textit{UnMarker}$不需要检测器反馈,不需要对攻击方案或类似模型的不切实际知识,也不需要高级去噪管道,这些可能并不存在。相反,它是通过深入分析水印范式揭示出,具有弹性的方案必须在频谱幅度上构建水印,$\textit{UnMarker}$采用两种新颖的对抗性优化来干扰水印图像的频谱,消除水印。与当前最佳攻击($\textit{SOTA}$)的评估证明其有效性,不仅在对传统攻击的胜利中保持卓越的质量,而且打破了改变图像结构的“语义”水印,将最佳检测率降到43%,使它们变得毫无用处。据我们所知,$\textit{UnMarker}$是第一个针对“语义”水印的实际攻击,这些水印被认为将是未来具有弹性的水印方案。$\textit{UnMarker}$使人们对这一补救措施的惩罚产生怀疑,并揭示了其自相矛盾的性质,即为了设计具有弹性的方案,必然会牺牲其他方面的可靠性。
https://arxiv.org/abs/2405.08363
With the rapid advancement of generative AI, multimodal deepfakes, which manipulate both audio and visual modalities, have drawn increasing public concern. Currently, deepfake detection has emerged as a crucial strategy in countering these growing threats. However, as a key factor in training and validating deepfake detectors, most existing deepfake datasets primarily focus on the visual modal, and the few that are multimodal employ outdated techniques, and their audio content is limited to a single language, thereby failing to represent the cutting-edge advancements and globalization trends in current deepfake technologies. To address this gap, we propose a novel, multilingual, and multimodal deepfake dataset: PolyGlotFake. It includes content in seven languages, created using a variety of cutting-edge and popular Text-to-Speech, voice cloning, and lip-sync technologies. We conduct comprehensive experiments using state-of-the-art detection methods on PolyGlotFake dataset. These experiments demonstrate the dataset's significant challenges and its practical value in advancing research into multimodal deepfake detection.
随着生成式 AI 的快速发展,多模态深度伪造(Multimodal deepfakes)已经引起了越来越多的公众关注。目前,深度伪造检测已成为对抗这些不断增长威胁的关键策略。然而,作为训练和验证深度伪造检测算法的关键因素,大多数现有的深度伪造数据集主要关注视觉模态,而那些多模态的采用过时技术的数据集,其音频内容仅限于一种语言,因此无法代表当前深度伪造技术的尖端发展和全球趋势。为了填补这一空白,我们提出了一个新颖的多语言、多模态深度伪造数据集:PolyGlotFake。它包括来自七种语言的内容,利用了各种尖端和流行的文本转语音、语音复制和同步技术。我们对 PolyGlotFake 数据集进行了最先进的检测方法的综合实验。这些实验证明了该数据集的重大挑战,以及其在推动多模态深度伪造检测研究方面的实际价值。
https://arxiv.org/abs/2405.08838
In large-scale disaster events, the planning of optimal rescue routes depends on the object detection ability at the disaster scene, with one of the main challenges being the presence of dense and occluded objects. Existing methods, which are typically based on the RGB modality, struggle to distinguish targets with similar colors and textures in crowded environments and are unable to identify obscured objects. To this end, we first construct two multimodal dense and occlusion vehicle detection datasets for large-scale events, utilizing RGB and height map modalities. Based on these datasets, we propose a multimodal collaboration network for dense and occluded vehicle detection, MuDet for short. MuDet hierarchically enhances the completeness of discriminable information within and across modalities and differentiates between simple and complex samples. MuDet includes three main modules: Unimodal Feature Hierarchical Enhancement (Uni-Enh), Multimodal Cross Learning (Mul-Lea), and Hard-easy Discriminative (He-Dis) Pattern. Uni-Enh and Mul-Lea enhance the features within each modality and facilitate the cross-integration of features from two heterogeneous modalities. He-Dis effectively separates densely occluded vehicle targets with significant intra-class differences and minimal inter-class differences by defining and thresholding confidence values, thereby suppressing the complex background. Experimental results on two re-labeled multimodal benchmark datasets, the 4K-SAI-LCS dataset, and the ISPRS Potsdam dataset, demonstrate the robustness and generalization of the MuDet. The codes of this work are available openly at \url{this https URL}.
在大规模灾害事件中,最优救援路线的规划取决于灾害现场的物体检测能力,其中主要挑战是存在密集和遮挡的对象。现有的方法,通常基于RGB模态,在拥挤的环境中很难区分具有相似颜色和纹理的目标,也无法识别遮挡的对象。为此,我们首先为大规模事件构建了两个多模态密集和遮挡车辆检测数据集,利用RGB和高度图模态。基于这些数据集,我们提出了一个多模态协作网络用于密集和遮挡车辆检测,MuDet短。MuDet通过分层增强模态之间的可区分信息完整性并区分简单和复杂样本来提高模型的可解释性。MuDet包括三个主要模块:单模态特征级联增强(Uni-Enh)、多模态跨学习(Mul-Lea)和硬-容易区分(He-Dis)模式。Uni-Enh和Mul-Lea通过在每种模块内增强特征和在两个异质模态之间促进特征的跨整合来提高模型的性能。He-Dis通过定义和阈值信心值有效地将密集遮挡的车辆目标与具有显著内部类差异的简单目标区分开来,从而抑制复杂背景。在两个重新标注的多模态基准数据集(4K-SAI-LCS数据集和ISPRS Potsdam数据集)上的实验结果证明了MuDet的稳健性和泛化能力。本工作的代码公开可见,在\url{这个链接}处。
https://arxiv.org/abs/2405.08251
Cyberharassment is a critical, socially relevant cybersecurity problem because of the adverse effects it can have on targeted groups or individuals. While progress has been made in understanding cyber-harassment, its detection, attacks on artificial intelligence (AI) based cyberharassment systems, and the social problems in cyberharassment detectors, little has been done in designing experiential learning educational materials that engage students in this emerging social cybersecurity in the era of AI. Experiential learning opportunities are usually provided through capstone projects and engineering design courses in STEM programs such as computer science. While capstone projects are an excellent example of experiential learning, given the interdisciplinary nature of this emerging social cybersecurity problem, it can be challenging to use them to engage non-computing students without prior knowledge of AI. Because of this, we were motivated to develop a hands-on lab platform that provided experiential learning experiences to non-computing students with little or no background knowledge in AI and discussed the lessons learned in developing this lab. In this lab used by social science students at North Carolina A&T State University across two semesters (spring and fall) in 2022, students are given a detailed lab manual and are to complete a set of well-detailed tasks. Through this process, students learn AI concepts and the application of AI for cyberharassment detection. Using pre- and post-surveys, we asked students to rate their knowledge or skills in AI and their understanding of the concepts learned. The results revealed that the students moderately understood the concepts of AI and cyberharassment.
网络骚扰是一个关键的社会安全问题,因为它可能对目标群体或个人产生有害影响。尽管在理解网络骚扰方面已经取得了一些进展,但网络骚扰检测系统对人工智能(AI)的攻击以及网络骚扰探测器中的社会问题,都没有采取太多设计实践,使学生积极参与这一新兴的社交网络安全问题。实践学习机会通常通过包括计算机科学等STEM课程的毕业设计项目提供。虽然毕业设计项目是一个很好的实践学习例子,但由于这一新兴社会安全问题多学科的性质,很难在没有AI先前知识的情况下使用它们来吸引非计算机科学专业的学生。因此,我们受到了激励,开发了一个手动的实验室平台,为没有AI背景知识的学生提供实践学习体验,并讨论了在开发这个实验室过程中所学到的教训。在2022年春季和秋季期间,北卡罗来纳农业和技术州立大学的社会科学学生使用的这个实验室中,学生被提供了一份详细的实验室手册,并完成了一系列详细的任务。通过这个过程,学生学习了AI概念以及AI在网络骚扰检测中的应用。使用前调查和后调查,我们要求学生评估他们在AI方面的知识或技能以及他们对于学到的概念的理解。结果显示,学生对该概念的理解程度适中。
https://arxiv.org/abs/2405.08125
Many commercial and open-source models claim to detect machine-generated text with very high accuracy (99\% or higher). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging -- lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our dataset and tools to encourage further exploration into detector robustness.
许多商业和开源模型声称其对机器生成的文本检测具有非常高的准确度(99%或更高)。然而,几乎所有的这些检测器都没有在共享基准数据集上进行评估,即使它们进行了评估,用于评估的数据集也缺乏挑战性——缺乏抽样策略、对抗攻击和开源生成模型的变化。在这项工作中,我们提出了RAID:机器生成文本检测中最大的、最具挑战性的基准数据集。RAID包括11个模型的超过600万组训练样本,8个领域,11个对抗攻击和4种解码策略。使用RAID,我们评估了8个开源的和4个闭源的检测器的离域和对抗鲁棒性,发现当前的检测器很容易受到对抗攻击、抽样策略的变化、重复惩罚和未见过的生成模型的欺骗。我们发布了我们的数据集和工具,鼓励对检测器鲁棒性的进一步探索。
https://arxiv.org/abs/2405.07940
Autonomous driving systems require a quick and robust perception of the nearby environment to carry out their routines effectively. With the aim to avoid collisions and drive safely, autonomous driving systems rely heavily on object detection. However, 2D object detections alone are insufficient; more information, such as relative velocity and distance, is required for safer planning. Monocular 3D object detectors try to solve this problem by directly predicting 3D bounding boxes and object velocities given a camera image. Recent research estimates time-to-contact in a per-pixel manner and suggests that it is more effective measure than velocity and depth combined. However, per-pixel time-to-contact requires object detection to serve its purpose effectively and hence increases overall computational requirements as two different models need to run. To address this issue, we propose per-object time-to-contact estimation by extending object detection models to additionally predict the time-to-contact attribute for each object. We compare our proposed approach with existing time-to-contact methods and provide benchmarking results on well-known datasets. Our proposed approach achieves higher precision compared to prior art while using a single image.
自动驾驶系统需要对周围环境进行快速且可靠的感知,以有效执行其任务。为了避免碰撞并安全驾驶,自动驾驶系统 reliance heavily on object detection。然而,单独的2D物体检测是不够的;为了进行更安全的规划,还需要更多的信息,如相对速度和距离。单目3D物体检测器试图通过直接预测相机图像中的3D边界框和物体速度来解决这个问题。最近的研究以每像素方式估计了时间到达,并建议这是比速度和深度联合更有效的测量方法。然而,每像素时间到达需要物体检测器实现其目的,因此增加了整体计算需求。为了应对这个问题,我们提出了一种通过扩展物体检测模型来预测每个物体的时间到达的方法。我们比较了我们提出的方法与现有时间到达方法,并在知名数据集上进行了基准测试。我们提出的方法在单张图片上实现更高精度的时间到达,同时使用了一个物体。
https://arxiv.org/abs/2405.07698
Monocular 3D object detection aims for precise 3D localization and identification of objects from a single-view image. Despite its recent progress, it often struggles while handling pervasive object occlusions that tend to complicate and degrade the prediction of object dimensions, depths, and orientations. We design MonoMAE, a monocular 3D detector inspired by Masked Autoencoders that addresses the object occlusion issue by masking and reconstructing objects in the feature space. MonoMAE consists of two novel designs. The first is depth-aware masking that selectively masks certain parts of non-occluded object queries in the feature space for simulating occluded object queries for network training. It masks non-occluded object queries by balancing the masked and preserved query portions adaptively according to the depth information. The second is lightweight query completion that works with the depth-aware masking to learn to reconstruct and complete the masked object queries. With the proposed object occlusion and completion, MonoMAE learns enriched 3D representations that achieve superior monocular 3D detection performance qualitatively and quantitatively for both occluded and non-occluded objects. Additionally, MonoMAE learns generalizable representations that can work well in new domains.
单目3D物体检测旨在实现从单视图图像中精确地定位和识别物体。尽管其最近的进展,它在处理普遍存在的物体遮挡时往往遇到困难,这会复杂化并降低物体尺寸、深度和方向的预测。我们设计MonoMAE,一种以遮罩卷积神经网络为灵感的多目3D检测器,通过在特征空间中遮罩和重构物体来解决物体遮挡问题。MonoMAE由两个新颖的设计组成。第一个是深度感知遮罩,它选择性地在特征空间中遮罩非遮挡物体查询的某些部分,以模拟网络训练中的遮挡物体查询。它通过根据深度信息平衡遮罩和保留查询部分来遮罩非遮挡物体查询。第二个是轻量级查询完成,它与深度感知遮罩协同工作,学习如何重构和完成遮膜物体查询。通过所提出的物体遮挡和完成,MonoMAE获得了丰富的3D表示,在遮挡和非遮挡物体上实现卓越的单目3D检测性能。此外,MonoMAE还学习了可以在新领域中表现良好的泛化表示。
https://arxiv.org/abs/2405.07696
Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some researchers pay attention to the triple-modal SOD task, where they attempt to explore the complementarity of the RGB image, the depth image, and the thermal image. However, existing triple-modal SOD methods fail to perceive the quality of depth maps and thermal images, which leads to performance degradation when dealing with scenes with low-quality depth and thermal images. Therefore, we propose a quality-aware selective fusion network (QSF-Net) to conduct VDT salient object detection, which contains three subnets including the initial feature extraction subnet, the quality-aware region selection subnet, and the region-guided selective fusion subnet. Firstly, except for extracting features, the initial feature extraction subnet can generate a preliminary prediction map from each modality via a shrinkage pyramid architecture. Then, we design the weakly-supervised quality-aware region selection subnet to generate the quality-aware maps. Concretely, we first find the high-quality and low-quality regions by using the preliminary predictions, which further constitute the pseudo label that can be used to train this subnet. Finally, the region-guided selective fusion subnet purifies the initial features under the guidance of the quality-aware maps, and then fuses the triple-modal features and refines the edge details of prediction maps through the intra-modality and inter-modality attention (IIA) module and the edge refinement (ER) module, respectively. Extensive experiments are performed on VDT-2048
深度图像和热图像包含空间几何信息和表面温度信息,这些信息可以为红外模态提供互补信息。然而,在某些具有挑战性的场景中,深度和热图像的质量通常不可靠,这将导致基于双模态的显着目标检测(SOD)性能下降。同时,一些研究人员关注三元模态的SOD任务,他们试图探讨RGB图像、深度图像和热图像的互补性。然而,现有的三元模态SOD方法无法感知深度图和热图的质量,因此在处理低质量深度和热图的场景时,性能会下降。因此,我们提出了一个质量感知的选择性融合网络(QSF-Net)来进行VDT显着目标检测,它包含三个子网络,包括初始特征提取子网、质量感知区域选择子网和区域引导的选择性融合子网。首先,除了提取特征外,初始特征提取子网可以通过收缩金字塔架构从每个模式生成初步预测图。然后,我们设计了一个弱监督的质量感知区域选择子网,用于生成质量感知图。具体来说,我们首先通过初步预测找到高质量和低质量的区域,这进一步构成了可以用于训练这个子网的伪标签。最后,在质量感知地图的指导下,区域引导选择性融合子网对初始特征进行净化,然后通过内模态和跨模态关注(ER)模块对预测地图的边缘进行细化。在VDT-2048上进行大量实验。
https://arxiv.org/abs/2405.07655
Deep Neural Networks (DNNs) require large amounts of annotated training data for a good performance. Often this data is generated using manual labeling (error-prone and time-consuming) or rendering (requiring geometry and material information). Both approaches make it difficult or uneconomic to apply them to many small-scale applications. A fast and straightforward approach of acquiring the necessary training data would allow the adoption of deep learning to even the smallest of applications. Chroma keying is the process of replacing a color (usually blue or green) with another background. Instead of chroma keying, we propose luminance keying for fast and straightforward training image acquisition. We deploy a black screen with high light absorption (99.99\%) to record roughly 1-minute long videos of our target objects, circumventing typical problems of chroma keying, such as color bleeding or color overlap between background color and object color. Next we automatically mask our objects using simple brightness thresholding, saving the need for manual annotation. Finally, we automatically place the objects on random backgrounds and train a 2D object detector. We do extensive evaluation of the performance on the widely-used YCB-V object set and compare favourably to other conventional techniques such as rendering, without needing 3D meshes, materials or any other information of our target objects and in a fraction of the time needed for other approaches. Our work demonstrates highly accurate training data acquisition allowing to start training state-of-the-art networks within minutes.
深度神经网络(DNNs)需要大量带注释的训练数据来获得良好的性能。通常,这是通过手动标注(错误率高且耗时)或渲染(需要几何和材料信息)来生成的。这两种方法使得将它们应用于许多小规模应用程序变得困难或不经济。获取所需训练数据的一种快速直接的方法将使将深度学习应用于即使是规模最小的应用程序成为可能。 色度键值是将一种颜色(通常为蓝色或绿色)替换为另一种背景的过程。我们提出了一种亮度键值来进行快速而直接的训练图像获取。我们使用高光吸收(99.99\%)的黑色屏幕来记录我们目标物体的1-minute视频,从而避免典型的色度键值问题,例如颜色 bleeding或背景颜色和物体颜色之间的颜色重叠。接下来,我们使用简单的亮度阈值自动遮盖我们的物体,无需手动标注。最后,我们自动将物体放置在随机背景上并训练2D物体检测器。我们对YCB-V对象集进行了广泛的评估,并与其他传统技术(例如渲染)进行了比较,无需需要我们目标对象的3D模型、材料或其他信息,且用时更短。 我们的工作证明,精确的训练数据获取使可以在几分钟内开始训练最先进的网络。
https://arxiv.org/abs/2405.07653
The deep neural network (DNN) models are widely used for object detection in automated driving systems (ADS). Yet, such models are prone to errors which can have serious safety implications. Introspection and self-assessment models that aim to detect such errors are therefore of paramount importance for the safe deployment of ADS. Current research on this topic has focused on techniques to monitor the integrity of the perception mechanism in ADS. Existing introspection models in the literature, however, largely concentrate on detecting perception errors by assigning equal importance to all parts of the input data frame to the perception module. This generic approach overlooks the varying safety significance of different objects within a scene, which obscures the recognition of safety-critical errors, posing challenges in assessing the reliability of perception in specific, crucial instances. Motivated by this shortcoming of state of the art, this paper proposes a novel method integrating raw activation patterns of the underlying DNNs, employed by the perception module, analysis with spatial filtering techniques. This novel approach enhances the accuracy of runtime introspection of the DNN-based 3D object detections by selectively focusing on an area of interest in the data, thereby contributing to the safety and efficacy of ADS perception self-assessment processes.
深度神经网络(DNN)模型在自动驾驶系统(ADS)中的物体检测中得到了广泛应用。然而,这些模型容易产生错误,可能会对安全性产生严重的影响。因此,为了确保ADS的安全部署,自省和自我评估模型(Introspection and self-assessment models)至关重要。目前,关于这一主题的研究主要集中在关注ADS感知机制的完整性监视技术。然而,文献中现有的自省模型大多集中在将感知模块对输入数据框中所有部分赋予相同的重要性来检测感知错误。这种通用方法忽略了场景中不同物体之间 safety significance 的差异,从而难以识别关键安全错误,评估感知在特定、关键实例的可靠性具有挑战性。为了弥补这一缺陷,本文提出了一种将底层DNN的原始激活模式与感知模块分析相结合的新方法。这种新颖的方法通过选择性地关注数据中感兴趣区域,提高基于DNN的3D物体检测的运行时自省精度,从而为ADS感知自我评估过程的安全性和有效性作出贡献。
https://arxiv.org/abs/2405.07600
Object detection techniques for Unmanned Aerial Vehicles (UAVs) rely on Deep Neural Networks (DNNs), which are vulnerable to adversarial attacks. Nonetheless, adversarial patches generated by existing algorithms in the UAV domain pay very little attention to the naturalness of adversarial patches. Moreover, imposing constraints directly on adversarial patches makes it difficult to generate patches that appear natural to the human eye while ensuring a high attack success rate. We notice that patches are natural looking when their overall color is consistent with the environment. Therefore, we propose a new method named Environmental Matching Attack(EMA) to address the issue of optimizing the adversarial patch under the constraints of color. To the best of our knowledge, this paper is the first to consider natural patches in the domain of UAVs. The EMA method exploits strong prior knowledge of a pretrained stable diffusion to guide the optimization direction of the adversarial patch, where the text guidance can restrict the color of the patch. To better match the environment, the contrast and brightness of the patch are appropriately adjusted. Instead of optimizing the adversarial patch itself, we optimize an adversarial perturbation patch which initializes to zero so that the model can better trade off attacking performance and naturalness. Experiments conducted on the DroneVehicle and Carpk datasets have shown that our work can reach nearly the same attack performance in the digital attack(no greater than 2 in mAP$\%$), surpass the baseline method in the physical specific scenarios, and exhibit a significant advantage in terms of naturalness in visualization and color difference with the environment.
无人机(UAV)的目标检测技术依赖于深度神经网络(DNNs),这些网络对攻击非常敏感。然而,UAV领域现有算法生成的攻击补丁对攻击的自然性非常关注。此外,直接对攻击补丁施加约束会使得生成看起来自然的人工补丁变得困难,同时保证高攻击成功率。我们注意到,当补丁的整体颜色与环境相同时,它们看起来是自然的。因此,我们提出了一种名为环境匹配攻击(EMA)的新方法来解决在颜色约束下优化攻击补丁的问题。据我们所知,这是第一个考虑UAV领域自然补丁的论文。EMA方法利用预训练的稳定扩散的强烈先验知识引导攻击补丁的优化方向,其中文本指导可以限制补丁的颜色。为了更好地匹配环境,适当调整补丁的对比度和亮度。我们不是优化攻击补丁本身,而是优化一个攻击补丁,该补丁初始化为零,以便模型可以更好地平衡攻击性能和自然性。在DroneVehicle和Carpk数据集上进行的实验表明,我们的工作在数字攻击(MAP%不超过2)方面的攻击性能与基线方法相当,在物理特定场景中超过了基线方法,并且在可视化和颜色差异方面具有显著的优越性。
https://arxiv.org/abs/2405.07595
Berry picking has long-standing traditions in Finland, yet it is challenging and can potentially be dangerous. The integration of drones equipped with advanced imaging techniques represents a transformative leap forward, optimising harvests and promising sustainable practices. We propose WildBe, the first image dataset of wild berries captured in peatlands and under the canopy of Finnish forests using drones. Unlike previous and related datasets, WildBe includes new varieties of berries, such as bilberries, cloudberries, lingonberries, and crowberries, captured under severe light variations and in cluttered environments. WildBe features 3,516 images, including a total of 18,468 annotated bounding boxes. We carry out a comprehensive analysis of WildBe using six popular object detectors, assessing their effectiveness in berry detection across different forest regions and camera types. We will release WildBe publicly.
翻译:虽然芬兰有着悠久的采摘野果的传统,但是采摘野果具有挑战性,还可能存在危险。利用配备先进成像技术的无人机进行集成,代表着向前迈进了一步,优化了采摘成果并承诺了可持续的实践。我们提出了WildBe,第一个利用无人机在泥炭地和对芬兰森林树冠层采摘野果的图像数据集。与之前和相关数据集相比,WildBe包括在严重光变和杂乱环境中捕获的新品种野果,如越橘、云莓、酸果和野草莓。WildBe包含3,516张图像,包括总共18,468个标注的边界框。我们使用六个流行的物体检测器对WildBe进行全面分析,评估它们在不同森林区域和相机类型下的野果检测效果。我们将发布WildBe公开。
https://arxiv.org/abs/2405.07550