Accurate detection of vulvovaginal candidiasis is critical for women's health, yet its sparse distribution and visually ambiguous characteristics pose significant challenges for accurate identification by pathologists and neural networks alike. Our eye-tracking data reveals that areas garnering sustained attention - yet not marked by experts after deliberation - are often aligned with false positives of neural networks. Leveraging this finding, we introduce Gaze-DETR, a pioneering method that integrates gaze data to enhance neural network precision by diminishing false positives. Gaze-DETR incorporates a universal gaze-guided warm-up protocol applicable across various detection methods and a gaze-guided rectification strategy specifically designed for DETR-based models. Our comprehensive tests confirm that Gaze-DETR surpasses existing leading methods, showcasing remarkable improvements in detection accuracy and generalizability.
准确检测外阴阴道念珠菌病对女性健康至关重要,但它的稀疏分布和视觉上模糊的特点对病理学家和神经网络鉴定者来说都带来了重大挑战。我们的眼动数据表明,获得持续关注却未被专家肯定的区域通常与神经网络的假阳性结果一致。利用这一发现,我们引入了Gaze-DETR,一种开创性的方法,将眼动数据集成到神经网络中,通过降低假阳性结果来提高检测精度。Gaze-DETR采用了一个通用的眼动引导预热协议,适用于各种检测方法,并专门为基于DETR模型的检测方法设计了一个眼动引导校正策略。我们全面的测试证实,Gaze-DETR超越了现有领先方法,展示了在检测准确性和泛化方面显著的改进。
https://arxiv.org/abs/2405.09463
Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: this https URL.
在内镜手术过程中,将自身定位可能会出现问题,因为内镜设备(如有限视野和复杂的照明条件)以及由于缺乏可区分纹理和标志而产生的困难。在本文中,我们提出了一种基于解剖识别的深度学习方法,在未经监督的情况下从手术视频中构建手术路径,并建模不同视角下的相对位置和变化。在推理时,该模型可以在路径上映射未见过的视频帧,并估计视角,旨在提供指导,例如,到达特定目的地。我们在包括Transsphenoidal腺瘤在内的大规模手术视频数据集上测试了该方法,以及在合成数据集上进行了测试。在这个网站上有这样一个在线工具,让研究人员上传他们的手术视频以获得解剖检测和训练的YOLOv7模型的权重:https:// this URL.
https://arxiv.org/abs/2405.09355
Deep learning classifiers are prone to latching onto dominant confounders present in a dataset rather than on the causal markers associated with the target class, leading to poor generalization and biased predictions. Although explainability via counterfactual image generation has been successful at exposing the problem, bias mitigation strategies that permit accurate explainability in the presence of dominant and diverse artifacts remain unsolved. In this work, we propose the DeCoDEx framework and show how an external, pre-trained binary artifact detector can be leveraged during inference to guide a diffusion-based counterfactual image generator towards accurate explainability. Experiments on the CheXpert dataset, using both synthetic artifacts and real visual artifacts (support devices), show that the proposed method successfully synthesizes the counterfactual images that change the causal pathology markers associated with Pleural Effusion while preserving or ignoring the visual artifacts. Augmentation of ERM and Group-DRO classifiers with the DeCoDEx generated images substantially improves the results across underrepresented groups that are out of distribution for each class. The code is made publicly available at this https URL.
深度学习分类器容易在数据集中固有的主导混淆因素上留下印象,而不是在目标类别的相关因果标记上,导致泛化差和有偏预测。尽管通过反事实图像生成来解释该问题已经取得成功,但允许在主导和多样异常物中实现准确解释的偏差减轻策略仍然是一个未解决的问题。在本文中,我们提出了DeCoDEx框架,并展示了如何在外部预训练的二进制异常物检测器的基础上,在推理过程中指导扩散式反事实图像生成器走向准确解释。在CheXpert数据集上进行的实验(使用合成异常物和真实视觉异常物)表明,与该方法相结合,可以成功生成反事实图像,这些图像在改变与胸膜积液相关的因果病理特征的同时,保留或忽略视觉异常。使用DeCoDEx生成的图像对ERM和Group-DRO分类器的扩展显著提高了分布不寻常类别的结果。代码可在此处公开访问:https://this URL。
https://arxiv.org/abs/2405.09288
Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
尽管大型语言模型和它们在各种任务上的高零样本触发性能最近非常普遍,但还不清楚它们在处理可能具有地道性需求的任务上的表现。特别是,这些模型与专门为地道性任务微调的编码器模型相比,表现如何?在这项工作中,我们试图回答这个问题,通过观察一系列LLM(包括本地和软件服务模型)在三个地道性数据集上的表现:SemEval 2022 Task 2a,FLUTE和MAGPIE。总体而言,我们发现,尽管这些模型确实具有竞争力的性能,但它们并不匹配针对特定任务进行微调的模型,即使在最大的规模上(例如,对于GPT-4)。然而,我们确实看到随着模型规模的一致性能改进。此外,我们研究了提示方法以提高性能,并讨论了为这些任务使用LLM的实用性。
https://arxiv.org/abs/2405.09279
In this work, we present Score MUsic Graph (SMUG)-Explain, a framework for generating and visualizing explanations of graph neural networks applied to arbitrary prediction tasks on musical scores. Our system allows the user to visualize the contribution of input notes (and note features) to the network output, directly in the context of the musical score. We provide an interactive interface based on the music notation engraving library Verovio. We showcase the usage of SMUG-Explain on the task of cadence detection in classical music. All code is available on this https URL.
在这项工作中,我们提出了Score Music Graph (SMUG)-Explain,一个用于生成和可视化应用到任意预测任务的图形化解释的框架,针对音乐乐谱。我们的系统允许用户在音乐乐谱的上下文中直接可视化输入音符(和音符特征)对网络输出的贡献。我们还基于音乐乐谱雕刻库Verovio提供了一个交互式的界面。我们在古典音乐中的句尾检测任务中展示了SMUG-Explain的使用。所有代码都可以在https://url.com/这个网址上找到。
https://arxiv.org/abs/2405.09241
We propose a new graph convolutional block, called MusGConv, specifically designed for the efficient processing of musical score data and motivated by general perceptual principles. It focuses on two fundamental dimensions of music, pitch and rhythm, and considers both relative and absolute representations of these components. We evaluate our approach on four different musical understanding problems: monophonic voice separation, harmonic analysis, cadence detection, and composer identification which, in abstract terms, translate to different graph learning problems, namely, node classification, link prediction, and graph classification. Our experiments demonstrate that MusGConv improves the performance on three of the aforementioned tasks while being conceptually very simple and efficient. We interpret this as evidence that it is beneficial to include perception-informed processing of fundamental musical concepts when developing graph network applications on musical score data.
我们提出了一个新的图形卷积块,称为MusGConv,专门为音乐分数数据的高效处理而设计,并受到一般感知原则的启发。它专注于音乐的两个基本维度,即音高和节奏,并考虑这两个组件的相对和绝对表示。我们对我们的方法在四个不同的音乐理解问题进行了评估:单声道声音分离,和弦分析,句末检测和作曲家识别。用抽象的话,这些问题翻译为不同的图学习问题,即节点分类,链预测和图分类。我们的实验结果表明,MusGConv在提高上述三个任务的同时,在概念上非常简单和高效。我们将这一结果解释为在开发基于音乐分数数据的图形网络应用时,有意识地处理基本音乐概念的感知信息是有益的。
https://arxiv.org/abs/2405.09224
Our study addresses a significant gap in online hate speech detection research by focusing on homophobia, an area often neglected in sentiment analysis research. Utilising advanced sentiment analysis models, particularly BERT, and traditional machine learning methods, we developed a nuanced approach to identify homophobic content on X/Twitter. This research is pivotal due to the persistent underrepresentation of homophobia in detection models. Our findings reveal that while BERT outperforms traditional methods, the choice of validation technique can impact model performance. This underscores the importance of contextual understanding in detecting nuanced hate speech. By releasing the largest open-source labelled English dataset for homophobia detection known to us, an analysis of various models' performance and our strongest BERT-based model, we aim to enhance online safety and inclusivity. Future work will extend to broader LGBTQIA+ hate speech detection, addressing the challenges of sourcing diverse datasets. Through this endeavour, we contribute to the larger effort against online hate, advocating for a more inclusive digital landscape. Our study not only offers insights into the effective detection of homophobic content by improving on previous research results, but it also lays groundwork for future advancements in hate speech analysis.
我们的研究在在线仇恨言论检测研究中填补了一个重要的空白,专注于情感分析研究经常被忽视的领域。利用先进的情感分析模型,特别是BERT,以及传统机器学习方法,我们开发了一种 nuanced的方法来识别X/Twitter上的同性恋内容。由于在检测模型中持续存在对仇恨言论的低估,这项研究至关重要。我们的发现表明,尽管BERT超越了传统方法,但验证技术的选择可能会影响模型性能。这凸显了在检测复杂仇恨言论中情境理解的重要性。通过发布我们所拥有的最大开放源代码的英语仇恨言论检测数据集,以及我们最强的基于BERT的模型,我们旨在提高在线安全和包容性。未来的工作将扩展到更广泛的LGBTQIA+仇恨言论检测,解决数据来源的挑战。通过这项努力,我们为反对在线仇恨言论作出了贡献,主张建设一个更加包容的数字环境。我们的研究不仅为以前的研究成果提供了洞察,而且也为未来仇恨言论分析的进步奠定了基础。
https://arxiv.org/abs/2405.09221
Due to the increasing need for effective security measures and the integration of cameras in commercial products, a hugeamount of visual data is created today. Law enforcement agencies (LEAs) are inspecting images and videos to findradicalization, propaganda for terrorist organizations and illegal products on darknet markets. This is time consuming.Instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specificlocations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deepconvolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has fivecontributions. The first contribution allows image-based geo-localization to estimate the origin of an image. CNNs andgeotagged images are used to create a model that determines the location of an image by its pixel values. The secondcontribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposedmethod encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition ofperson attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attributeproblem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotationtool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimalannotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion.Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectableconcepts is required for the users. The methods are validated on data with varying locations (popular and non-touristiclocations), varying person attributes (CelebA dataset), and varying number of annotations.
由于对有效安全措施的需求不断增加以及摄像头在商业产品中的应用,如今产生了大量的视觉数据。执法机构(LEAs)正在检查图像和视频以寻找极端化、恐怖主义组织和非法商品在暗网市场上的传播。这需要耗费大量时间。 instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specific locations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deep convolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has five contributions. The first contribution allows image-based geolocation to estimate the origin of an image. CNNs and geotagged images are used to create a model that determines the location of an image by its pixel values. The second contribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposed method encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition of person attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attribute problem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotation tool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimal annotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion. Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectable concepts is required for the users. The methods are validated on data with varying locations (popular and non-tourist locations), varying person attributes (CelebA dataset), and varying number of annotations.
https://arxiv.org/abs/2405.09194
Anomaly detection and localization without any manual annotations and prior knowledge is a challenging task under the setting of unsupervised learning. The existing works achieve excellent performance in the anomaly detection, but with complex networks or cumbersome pipelines. To address this issue, this paper explores a simple but effective architecture in the anomaly detection. It consists of a well pre-trained encoder to extract hierarchical feature representations and a decoder to reconstruct these intermediate features from the encoder. In particular, it does not require any data augmentations and anomalous images for training. The anomalies can be detected when the decoder fails to reconstruct features well, and then errors of hierarchical feature reconstruction are aggregated into an anomaly map to achieve anomaly localization. The difference comparison between those features of encoder and decode lead to more accurate and robust localization results than the comparison in single feature or pixel-by-pixel comparison in the conventional works. Experiment results show that the proposed method outperforms the state-of-the-art methods on MNIST, Fashion-MNIST, CIFAR-10, and MVTec Anomaly Detection datasets on both anomaly detection and localization.
在无需手动注释和先前知识的情况下,检测异常并定位异常是一个具有挑战性的任务,尤其是在无监督学习环境中。现有的作品在异常检测方面表现出色,但使用了复杂的网络或繁琐的流程。为解决这个问题,本文探索了一种简单但有效的异常检测架构。它由一个预训练的编码器和一个解码器组成,编码器用于提取分层次的特征表示,解码器用于从编码器中重构这些中间特征。特别地,它不需要进行数据增强或异常图像的训练。当解码器无法很好地重构特征时,可以检测到异常。然后将层次特征重构的错误聚集在异常地图上,实现异常的局部化。编码器和解码器的特征差异比较比传统工作中的单个特征或像素逐像素比较更准确和稳健的局部化结果。实验结果表明,与最先进的 methods相比,所提出的方法在MNIST、Fashion-MNIST、CIFAR-10和MVTec异常检测数据集上 both anomaly detection and localization outperforms.
https://arxiv.org/abs/2405.09148
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.
尽管在过去的几年中,面部识别已经取得了显著的进步,但设计一个多任务面部识别模型仍然具有挑战性。大多数面部识别任务都被单独研究,并没有从相关任务之间的协同作用中受益。在本文中,我们提出了一种名为 Q-Face 的具有新颖性的多任务面部识别方法,该方法使用一个统一模型同时执行多个面部识别任务。我们将来自大型预训练模型的多个层次的特征融合在一起,使整个模型可以利用局部和全局面部信息来支持多个任务。此外,我们还设计了一个任务适应模块,在查询向量和融合多级特征之间进行跨注意,并最终根据每个面部识别任务自适应地提取所需特征。大量实验证明,我们的方法可以同时执行多个任务,在面部表情识别、动作单元检测、面部属性分析、年龄估计和面部姿态估计方面的表现均为最先进的水平。与传统方法相比,我们的方法为多任务面部识别提供了新的可能性,并展示了准确性和效率的潜力。
https://arxiv.org/abs/2405.09059
The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features and the feasibility based on temporal features of realizing effective detection. According to this analysis, we use a multi-frame as a detection unit and propose a detection method based on temporal energy selective scaling (TESS). Specifically, we investigated the composition of intensity temporal profiles (ITPs) formed by pixels on a multi-frame detection unit. For the target-present pixel, the target passing through the pixel will bring a weak transient disturbance on the ITP and introduce a change in the statistical properties of ITP. We use a well-designed function to amplify the transient disturbance, suppress the background and noise components, and output the trajectory of the target on the multi-frame detection unit. Subsequently, to solve the contradiction between the detection rate and the false alarm rate brought by the traditional threshold segmentation, we associate the temporal and spatial features of the output trajectory and propose a trajectory extraction method based on the 3D Hough transform. Finally, we model the trajectory of the target and propose a trajectory-based multi-target tracking method. Compared with the various state-of-the-art detection and tracking methods, experiments in multiple scenarios prove the superiority of our proposed methods.
被动光学遥感(PORS)中检测和跟踪小目标具有广泛的应用价值。然而,之前提出的大多数方法很少利用目标运动产生的丰富时变特征,导致低信号-噪声比(SCR)目标检测和跟踪性能较差。在本文中,我们分析基于空间特征和基于时变特征实现有效检测的难度,并根据分析结果提出了一种基于时变能量选择性缩放(TESS)的检测方法。具体来说,我们研究了多帧中像素产生的强度时变轮廓(ITP)的组成。对于目标存在的像素,穿过像素的目标会对ITP产生弱暂态干扰,并改变ITP的统计特性。我们使用一个精心设计的函数来放大暂态干扰,抑制背景和噪声分量,并输出目标在多帧检测单元上的轨迹。为了解决传统阈值分割带来的检测率和误报警率之间的矛盾,我们将输出轨迹的时域和空间特征相关联,并提出了基于3D Hough变换的轨迹提取方法。最后,我们建模了目标轨迹,并提出了基于轨迹的多目标跟踪方法。与各种最先进的检测和跟踪方法相比,多个场景下的实验证明了我们提出方法的优越性。
https://arxiv.org/abs/2405.09054
Recent advancements in deep learning for 3D models have propelled breakthroughs in generation, detection, and scene understanding. However, the effectiveness of these algorithms hinges on large training datasets. We address the challenge by introducing Efficient 3D Seam Carving (E3SC), a novel 3D model augmentation method based on seam carving, which progressively deforms only part of the input model while ensuring the overall semantics are unchanged. Experiments show that our approach is capable of producing diverse and high-quality augmented 3D shapes across various types and styles of input models, achieving considerable improvements over previous methods. Quantitative evaluations demonstrate that our method effectively enhances the novelty and quality of shapes generated by other subsequent 3D generation algorithms.
近年来,在深度学习领域为3D模型取得突破性的进展,主要体现在生成、检测和场景理解方面的提升。然而,这些算法的有效性依赖于大型训练数据集。为了解决这一挑战,我们引入了Efficient 3D Seam Carving(E3SC),一种基于缝合切割的新3D模型增强方法,在确保整体语义不变的前提下,逐步改变输入模型的部分部分。实验结果表明,我们的方法能够为各种输入模型的多样性和高质量生成3D形状,并在很大程度上超过了以前的方法。定量的评估结果表明,我们的方法有效地增强了后续3D生成算法生成的形状的新奇度和质量。
https://arxiv.org/abs/2405.09050
The ambiguous appearance, tiny scale, and fine-grained classes of objects in remote sensing imagery inevitably lead to the noisy annotations in category labels of detection dataset. However, the effects and treatments of the label noises are underexplored in modern oriented remote sensing object detectors. To address this issue, we propose a robust oriented remote sensing object detection method through dynamic loss decay (DLD) mechanism, inspired by the two phase ``early-learning'' and ``memorization'' learning dynamics of deep neural networks on clean and noisy samples. To be specific, we first observe the end point of early learning phase termed as EL, after which the models begin to memorize the false labels that significantly degrade the detection accuracy. Secondly, under the guidance of the training indicator, the losses of each sample are ranked in descending order, and we adaptively decay the losses of the top K largest ones (bad samples) in the following epochs. Because these large losses are of high confidence to be calculated with wrong labels. Experimental results show that the method achieves excellent noise resistance performance tested on multiple public datasets such as HRSC2016 and DOTA-v1.0/v2.0 with synthetic category label noise. Our solution also has won the 2st place in the "fine-grained object detection based on sub-meter remote sensing imagery" track with noisy labels of 2023 National Big Data and Computing Intelligence Challenge.
远程 sensing图像中模糊的景象、微小的尺度和精细的类别的物体必然会导致检测数据集中的类别标签噪声。然而,在现代面向对象的远程感测物体检测器中,对标签噪音的影响和处理方法仍然没有被深入研究。为解决这个问题,我们提出了一种通过动态损失衰减(DLD)机制的稳健面向对象的远程感测物体检测方法,灵感来自深度神经网络在干净和噪音样本上的“早期学习”和“记忆”学习动态。具体来说,我们首先观察到早学习阶段的结束点,即EL,然后模型开始显著降低检测准确度的虚假标签。其次,在训练指标的指导下,将每个样本的损失按照降序排列,并在后续 epoch 中自适应地衰减最大 K 个(坏样本)的损失。因为这些大损失对错误标签计算具有很高的信心。实验结果表明,该方法在多个公共数据集如HRSC2016和DOTA-v1.0/v2.0上具有出色的噪音抗性表现。我们的解决方案还在2023年全国大数据和计算智能挑战中获得了“基于亚米级遥感图像的细粒度物体检测”的2nd place。
https://arxiv.org/abs/2405.09024
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports to address downstream tasks of interest on bone radiography. A practical processing pipeline is introduced to anonymize and process French medical reports. Pretraining then consists in the self-supervised alignment of visual and textual embedding spaces derived from deep model encoders. The resulting image encoder is then used to handle various downstream tasks, including quantification of osteoarthritis, estimation of bone age on pediatric wrists, bone fracture and anomaly detection. Our approach demonstrates competitive performance on downstream tasks, compared to alternatives requiring a significantly larger amount of human expert annotations. Our work stands as the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations, capitalizing on the large quantity of paired images and reports data available in an hospital. By relying on generic vision-laguage deep models in a language-specific scenario, it contributes to the deployement of vision models for wider healthcare applications.
本文提出了一种利用成骨扫描图像搭配法国报告进行视觉-语言预训练的方法,以解决骨摄影下游任务的挑战。具体来说,我们介绍了一个实用的处理流程来匿名化和处理法国医疗报告。预训练包括对深度模型编码器产生的视觉和文本嵌入空间的自监督对齐。然后,经过调整的图像编码器用于处理各种下游任务,包括对骨关节炎的定量、对儿童手腕上的骨龄估计、骨骨折和异常检测。我们的方法在下游任务上具有竞争力的性能,与需要大量人专家注释的替代方法相比。我们的工作是第一项将法国报告整合到专门用于骨X光表示的嵌入空间的研究,充分利用了医院中存在的大量成对图像和报告数据。通过在语言特定的场景中依赖通用视觉语言深度模型,它为更广泛的医疗应用部署视觉模型做出了贡献。
https://arxiv.org/abs/2405.08932
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911
Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm, detecting objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms. Code is available at this https URL.
许多基于查询的3D多对象跟踪(MOT)方法采用了关注点的跟踪范式,利用跟踪查询进行身份一致的检测,利用对象查询进行身份无关的跟踪生成。然而,关注点的跟踪范式将检测和跟踪查询在同一个嵌入中纠缠在一起,对于检测和跟踪任务来说不是最优解。其他方法类似于跟踪-by-detection范式,使用解耦的跟踪和检测查询然后进行后续的相关联来检测物体。然而,这些方法并未利用检测和关联任务之间的协同作用。通过结合这两种范式的优势,我们引入了ADA-Track,一种从多视角摄像机视角的3D MOT的新型端到端框架。我们基于边缘增强交叉注意力的可学习数据关联模块,利用外观和几何特征。此外,我们将该关联模块集成到基于DETR的3D检测器的解码层中,实现同时检测和查询到图像的交叉注意。通过堆叠这些解码层,查询在检测和关联任务上进行 alternating refine,有效利用了任务依赖关系。我们在nuScenes数据集上评估我们的方法,并证明了与前两种范式相比,我们的方法具有优势。代码可在此处下载:https://www.xxx.com/。
https://arxiv.org/abs/2405.08909
In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that can withstand and adapt to these real-world variabilities. Focusing on four pivotal tasks -- BEV detection, map segmentation, semantic occupancy prediction, and multi-view depth estimation -- the competition laid down a gauntlet to innovate and enhance system resilience against typical and atypical disturbances. This year's challenge consisted of five distinct tracks and attracted 140 registered teams from 93 institutes across 11 countries, resulting in nearly one thousand submissions evaluated through our servers. The competition culminated in 15 top-performing solutions, which introduced a range of innovative approaches including advanced data augmentation, multi-sensor fusion, self-supervised learning for error correction, and new algorithmic strategies to enhance sensor robustness. These contributions significantly advanced the state of the art, particularly in handling sensor inconsistencies and environmental variability. Participants, through collaborative efforts, pushed the boundaries of current technologies, showcasing their potential in real-world scenarios. Extensive evaluations and analyses provided insights into the effectiveness of these solutions, highlighting key trends and successful strategies for improving the resilience of driving perception systems. This challenge has set a new benchmark in the field, providing a rich repository of techniques expected to guide future research in this field.
在自动驾驶领域,在非分布环境下稳健的感知至关重要,这将有利于车辆的安全部署。例如恶劣天气、传感器故障和环境不可预测性等问题会对自动驾驶系统的性能造成严重影响。为了解决这个问题,2024 RoboDrive挑战是为了推动开发能够承受并适应这些现实世界变异性的人工智能驱动感知技术。将注意力放在四个关键任务上--BEV检测、地图分割、语义占用预测和多视角深度估计--比赛为创新和提高系统抗干扰能力设定了挑战。今年的挑战包括五个不同的赛道,吸引了来自93个机构的140支注册队伍,并通过我们的服务器评估了大约1000个解决方案。比赛最终产生了15个最佳解决方案,其中包括先进的数据增强、多传感器融合、自监督学习误码纠正和新的算法策略来增强传感器稳健性。这些贡献显著推动了技术的进步,尤其是在处理传感器不一致性和环境变化方面。参与者通过协同努力,推动了现有技术的边界,展示了他们在现实场景中的潜力。 extensive评估和分析提供了对这些解决方案的有效性的深入了解,强调了改进驾驶感知系统韧性的关键趋势和成功策略。这个挑战为该领域设定了新的基准,为未来研究提供了丰富的技术资料。
https://arxiv.org/abs/2405.08816
Datasets labelled by human annotators are widely used in the training and testing of machine learning models. In recent years, researchers are increasingly paying attention to label quality. However, it is not always possible to objectively determine whether an assigned label is correct or not. The present work investigates this ambiguity in the annotation of autonomous driving datasets as an important dimension of data quality. Our experiments show that excluding highly ambiguous data from the training improves model performance of a state-of-the-art pedestrian detector in terms of LAMR, precision and F1 score, thereby saving training time and annotation costs. Furthermore, we demonstrate that, in order to safely remove ambiguous instances and ensure the retained representativeness of the training data, an understanding of the properties of the dataset and class under investigation is crucial.
数据集是由人类注释者标记的 labeled 数据集在机器学习模型的训练和测试中得到了广泛应用。近年来,研究者们越来越关注标签的质量。然而,确定分配给任务的标签是否正确并不总是可能的。本文研究了自动驾驶数据集注释中的不确定性作为一个重要数据质量维度。我们的实验结果表明,从训练中排除高度 ambiguous 的数据可以提高最先进的行人检测模型(LAMM)的精度、召回率和 F1 分数,从而节省训练时间和标注成本。此外,我们还证明了,为了安全地移除歧义实例并确保训练数据的保留代表性,了解数据集及其所属类的特性至关重要。
https://arxiv.org/abs/2405.08794
Out-of-distribution (OOD) detection is critical when deploying machine learning models in the real world. Outlier exposure methods, which incorporate auxiliary outlier data in the training process, can drastically improve OOD detection performance compared to approaches without advanced training strategies. We introduce Hopfield Boosting, a boosting approach, which leverages modern Hopfield energy (MHE) to sharpen the decision boundary between the in-distribution and OOD data. Hopfield Boosting encourages the model to concentrate on hard-to-distinguish auxiliary outlier examples that lie close to the decision boundary between in-distribution and auxiliary outlier data. Our method achieves a new state-of-the-art in OOD detection with outlier exposure, improving the FPR95 metric from 2.28 to 0.92 on CIFAR-10 and from 11.76 to 7.94 on CIFAR-100.
分布式(OOD)检测是部署机器学习模型在现实世界中的关键。引入辅助异常数据的方法,在训练过程中包含辅助异常数据,可以在很大程度上提高OOD检测性能,而相对于没有高级训练策略的方法。我们引入了Hopfield Boosting,一种利用现代Hopfield能量(MHE)来提高决策边界的方法。Hopfield Boosting鼓励模型将注意力集中在在分布式和辅助异常数据之间难以区分的辅助异常示例上。我们的方法在异常暴露方面实现了新的最优状态,将CIFAR-10和CIFAR-100上的FPR95指标从2.28提高到了0.92。
https://arxiv.org/abs/2405.08766