Synthetic aperture radar (SAR) is essential in actively acquiring information for Earth observation. SAR Automatic Target Recognition (ATR) focuses on detecting and classifying various target categories under different image conditions. The current deep learning-based SAR ATR methods are typically designed for specific datasets and applications. Various target characteristics, scene background information, and sensor parameters across ATR datasets challenge the generalization of those methods. This paper aims to achieve general SAR ATR based on a foundation model with Self-Supervised Learning (SSL). Our motivation is to break through the specific dataset and condition limitations and obtain universal perceptual capabilities across the target, scene, and sensor. A foundation model named SARATR-X is proposed with the following four aspects: pre-training dataset, model backbone, SSL, and evaluation task. First, we integrated 14 datasets with various target categories and imaging conditions as a pre-training dataset. Second, different model backbones were discussed to find the most suitable approaches for remote-sensing images. Third, we applied two-stage training and SAR gradient features to ensure the diversity and scalability of SARATR-X. Finally, SARATR-X has achieved competitive and superior performance on 5 datasets with 8 task settings, which shows that the foundation model can achieve universal SAR ATR. We believe it is time to embrace fundamental models for SAR image interpretation in the era of increasing big data.
合成孔雷达(SAR)在积极获取地球观测信息方面至关重要。SAR自动目标识别(ATR)关注于在不同的图像条件下检测和分类各种目标类别。目前基于深度学习的SAR ATR方法通常是为特定数据集和应用设计的。各种目标特征、场景背景信息和ATR数据集中的传感器参数挑战了这些方法的一般化。本文旨在基于自监督学习(SSL)的基础模型实现通用SAR ATR。我们的目标是突破特定数据和条件的限制,获得目标、场景和传感器之间的普遍感知能力。 我们提出了一个名为SARATR-X的基础模型,包括以下四个方面:预训练数据集、模型骨架、SSL和评估任务。首先,我们将14个数据集与各种目标和成像条件集成作为一个预训练数据集。其次,讨论了不同的模型骨架,以找到最适合远程感测图像的适当方法。第三,我们应用了两阶段培训和SAR梯度特征来确保SARATR-X的多样性和可扩展性。最后,SARATR-X在5个数据集和8个任务设置上实现了竞争性和卓越性能,这表明基础模型可以实现通用SAR ATR。我们认为,在数据和数据量不断增加的时代,应该拥抱基本模型用于SAR图像解释。
https://arxiv.org/abs/2405.09365
Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
近年来,短视频的发展速度加快,通常包含视觉和音频模态。背景音乐对于短视频来说非常重要,因为它可以显著影响观众的情感。然而,目前,短视频的背景音乐通常由视频制作人选择,缺乏针对短视频的自动音乐推荐方法。本文介绍了一种创新的跨模态检索模型MVBind,用于短片的跨模态检索。MVBind是一种自监督方法,通过直接从数据中获取模态关系的固有知识,无需手动注释。此外,为了弥补短视频没有相应的音乐-视觉对数据集的不足,我们构建了一个数据集SVM-10K(短视频与音乐-10K),主要包含精心选择短的短视频。在这个数据集上,MVBind表现出比其他基线方法显著的优异性能。构建的数据集和代码将发布,以促进未来研究。
https://arxiv.org/abs/2405.09286
Model Predictive Control (MPC)-based trajectory planning has been widely used in robotics, and incorporating Control Barrier Function (CBF) constraints into MPC can greatly improve its obstacle avoidance efficiency. Unfortunately, traditional optimizers are resource-consuming and slow to solve such non-convex constrained optimization problems (COPs) while learning-based methods struggle to satisfy the non-convex constraints. In this paper, we propose SOMTP algorithm, a self-supervised learning-based optimizer for CBF-MPC trajectory planning. Specifically, first, SOMTP employs problem transcription to satisfy most of the constraints. Then the differentiable SLPG correction is proposed to move the solution closer to the safe set and is then converted as the guide policy in the following training process. After that, inspired by the Augmented Lagrangian Method (ALM), our training algorithm integrated with guide policy constraints is proposed to enable the optimizer network to converge to a feasible solution. Finally, experiments show that the proposed algorithm has better feasibility than other learning-based methods and can provide solutions much faster than traditional optimizers with similar optimality.
基于模型的预测控制(MPC)路径规划在机器人领域得到了广泛应用,并将控制障碍功能(CBF)约束融入MPC可以大大提高其避障效率。然而,传统的优化器在处理非凸约束优化问题(COPs)时资源消耗大、解决速度慢。基于学习的方法也难以满足非凸约束。在本文中,我们提出了SOMTP算法,一种基于自我监督学习的自适应CBF-MPC路径规划优化器。具体来说,SOMTP首先采用问题变换来满足大多数约束。然后,提出了不同可导的SLPG校正来将解决方案更接近安全集,接着在训练过程中将其转换为引导策略。此外,受到增广拉格朗日方法(ALM)的启发,我们提出了一种与引导策略约束相结合的训练算法,使优化器网络能够收敛到可行解。最后,实验证明,与其它学习方法相比,该算法具有更好的可行性,并提供比传统具有相似最优性的优化器更快的解决方案。
https://arxiv.org/abs/2405.09212
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports to address downstream tasks of interest on bone radiography. A practical processing pipeline is introduced to anonymize and process French medical reports. Pretraining then consists in the self-supervised alignment of visual and textual embedding spaces derived from deep model encoders. The resulting image encoder is then used to handle various downstream tasks, including quantification of osteoarthritis, estimation of bone age on pediatric wrists, bone fracture and anomaly detection. Our approach demonstrates competitive performance on downstream tasks, compared to alternatives requiring a significantly larger amount of human expert annotations. Our work stands as the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations, capitalizing on the large quantity of paired images and reports data available in an hospital. By relying on generic vision-laguage deep models in a language-specific scenario, it contributes to the deployement of vision models for wider healthcare applications.
本文提出了一种利用成骨扫描图像搭配法国报告进行视觉-语言预训练的方法,以解决骨摄影下游任务的挑战。具体来说,我们介绍了一个实用的处理流程来匿名化和处理法国医疗报告。预训练包括对深度模型编码器产生的视觉和文本嵌入空间的自监督对齐。然后,经过调整的图像编码器用于处理各种下游任务,包括对骨关节炎的定量、对儿童手腕上的骨龄估计、骨骨折和异常检测。我们的方法在下游任务上具有竞争力的性能,与需要大量人专家注释的替代方法相比。我们的工作是第一项将法国报告整合到专门用于骨X光表示的嵌入空间的研究,充分利用了医院中存在的大量成对图像和报告数据。通过在语言特定的场景中依赖通用视觉语言深度模型,它为更广泛的医疗应用部署视觉模型做出了贡献。
https://arxiv.org/abs/2405.08932
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911
Current video summarization methods primarily depend on supervised computer vision techniques, which demands time-consuming manual annotations. Further, the annotations are always subjective which make this task more challenging. To address these issues, we analyzed the feasibility in transforming the video summarization into a text summary task and leverage Large Language Models (LLMs) to boost video summarization. This paper proposes a novel self-supervised framework for video summarization guided by LLMs. Our method begins by generating captions for video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the frame captions and the text summary. It's worth noting that we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames whose captions are similar with the text summary. Our model achieves competitive results against other state-of-the-art methods and paves a novel pathway in video summarization.
目前,主要的视频摘要方法依赖于监督计算机视觉技术,这需要耗时的人工标注。此外,标注总是主观的,使这项任务更具挑战性。为了应对这些问题,我们分析了将视频摘要转换为文本摘要的可行性,并利用大型语言模型(LLMs)提高视频摘要。本文提出了一种新的自监督框架,用于指导LLMs的 video summarization。我们的方法首先为视频帧生成字幕,然后由LLMs将其合成为文本摘要。接下来,我们测量视频帧字幕与文本摘要之间的语义距离。值得注意的是,我们提出了一个新颖的损失函数,根据视频的多样性优化我们的模型。最后,可以根据文本摘要选择具有相似文本摘要的帧来生成摘要视频。我们的模型在与其他最先进的 methods竞争的同时,在视频摘要领域开辟了新的途径。
https://arxiv.org/abs/2405.08890
In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that can withstand and adapt to these real-world variabilities. Focusing on four pivotal tasks -- BEV detection, map segmentation, semantic occupancy prediction, and multi-view depth estimation -- the competition laid down a gauntlet to innovate and enhance system resilience against typical and atypical disturbances. This year's challenge consisted of five distinct tracks and attracted 140 registered teams from 93 institutes across 11 countries, resulting in nearly one thousand submissions evaluated through our servers. The competition culminated in 15 top-performing solutions, which introduced a range of innovative approaches including advanced data augmentation, multi-sensor fusion, self-supervised learning for error correction, and new algorithmic strategies to enhance sensor robustness. These contributions significantly advanced the state of the art, particularly in handling sensor inconsistencies and environmental variability. Participants, through collaborative efforts, pushed the boundaries of current technologies, showcasing their potential in real-world scenarios. Extensive evaluations and analyses provided insights into the effectiveness of these solutions, highlighting key trends and successful strategies for improving the resilience of driving perception systems. This challenge has set a new benchmark in the field, providing a rich repository of techniques expected to guide future research in this field.
在自动驾驶领域,在非分布环境下稳健的感知至关重要,这将有利于车辆的安全部署。例如恶劣天气、传感器故障和环境不可预测性等问题会对自动驾驶系统的性能造成严重影响。为了解决这个问题,2024 RoboDrive挑战是为了推动开发能够承受并适应这些现实世界变异性的人工智能驱动感知技术。将注意力放在四个关键任务上--BEV检测、地图分割、语义占用预测和多视角深度估计--比赛为创新和提高系统抗干扰能力设定了挑战。今年的挑战包括五个不同的赛道,吸引了来自93个机构的140支注册队伍,并通过我们的服务器评估了大约1000个解决方案。比赛最终产生了15个最佳解决方案,其中包括先进的数据增强、多传感器融合、自监督学习误码纠正和新的算法策略来增强传感器稳健性。这些贡献显著推动了技术的进步,尤其是在处理传感器不一致性和环境变化方面。参与者通过协同努力,推动了现有技术的边界,展示了他们在现实场景中的潜力。 extensive评估和分析提供了对这些解决方案的有效性的深入了解,强调了改进驾驶感知系统韧性的关键趋势和成功策略。这个挑战为该领域设定了新的基准,为未来研究提供了丰富的技术资料。
https://arxiv.org/abs/2405.08816
The superior performance of modern visual backbones usually comes with a costly training procedure. We contribute to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these aspects and design curriculum schedules with tailored search algorithms. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. It reduces the training time of a wide variety of popular models by 1.5-3.0x on ImageNet-1K/22K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).
现代视觉骨干的优越性能通常需要付出高昂的训练代价。我们通过将课程学习的概念扩展到其原始定义之外,即使用更容易-更难的数据训练模型,为这个问题做出了贡献。具体来说,我们将训练课程重新定义为一个软选择函数,该函数在训练过程中逐个例子中揭示出越来越困难的模式,而不是执行更容易-更难的样本选择。我们的工作受到视觉骨干学习动态中一个引人入胜的观察的启发:在训练的前几个阶段,模型主要学习识别数据中的“更容易-学习”的判别模式。这些模式通过频域和空间域观察时,包括较低频率的成分,以及不会扭曲或数据增强的自然图像内容。为了实现这些发现,我们提出了一个课程,其中模型在每次学习阶段始终利用所有训练数据,然而,对于每个实例,首先启动更容易-学习模式的暴露,随着训练的进行,逐渐引入更难的模式。为了在计算上实现这一想法,我们在输入的傅里叶频谱上引入裁剪操作,使模型仅从低频成分中学习。然后,我们证明了通过调整数据增强的强度,可以轻松地获得自然图像内容的暴露。最后,我们将这些方面综合起来,并设计了具有自定义搜索算法的课程计划。所得到的方法,EfficientTrain++,简单、通用,然而却非常有效。它将各种流行模型的训练时间缩短了1.5-3.0倍,而不会牺牲准确性。它还证明了在自监督学习方面的效果(例如MAE)。
https://arxiv.org/abs/2405.08768
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
本文解决了自监督通用音频表示学习的问题。我们探讨了使用联合嵌入预测架构(JEPA)解决这个任务的途径,它包括将输入的Mel声谱图拆分为两个部分(上下文和目标),计算每个部分的神经表示,并训练神经网络从上下文表示预测目标表示。我们在这种框架内研究了几个设计选择,并通过广泛的实验研究了它们的影响,评估了我们的模型在各种音频分类基准上的表现,包括环境声音、语音和音乐下游任务。我们特别关注输入数据中哪个部分被用作上下文或目标,并通过实验证明了它对模型性能的影响。值得注意的是,在图像领域,一些有效的设计选择导致了在音频方面的表现不佳,从而突出了这两种媒体之间的主要区别。
https://arxiv.org/abs/2405.08679
Depth estimation plays a crucial role in various tasks within endoscopic surgery, including navigation, surface reconstruction, and augmented reality visualization. Despite the significant achievements of foundation models in vision tasks, including depth estimation, their direct application to the medical domain often results in suboptimal performance. This highlights the need for efficient adaptation methods to adapt these models to endoscopic depth estimation. We propose Endoscopic Depth Any Camera (EndoDAC) which is an efficient self-supervised depth estimation framework that adapts foundation models to endoscopic scenes. Specifically, we develop the Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks to tailor the foundational model to the surgical domain, utilizing remarkably few trainable parameters. Given that camera information is not always accessible, we also introduce a self-supervised adaptation strategy that estimates camera intrinsics using the pose encoder. Our framework is capable of being trained solely on monocular surgical videos from any camera, ensuring minimal training costs. Experiments demonstrate that our approach obtains superior performance even with fewer training epochs and unaware of the ground truth camera intrinsics. Code is available at this https URL.
深度估计在内窥镜手术的各种任务中扮演着至关重要的角色,包括导航、表面重建和增强现实可视化。尽管基础模型在视觉任务中取得了显著的成就,包括深度估计,但它们的直接应用到医学领域通常会导致性能较低。这表明了需要有效的适应方法将这些模型应用于内窥镜深度估计。我们提出了Endoscopic Depth Any Camera(EndoDAC),这是一种有效的自监督深度估计框架,将基础模型适应内窥镜场景。具体来说,我们开发了基于动态向量的高级低秩适应(DV-LoRA)方法,并使用卷积颈块将基本模型裁剪为手术领域,利用训练参数的数量非常少。鉴于相机信息通常不可用,我们还引入了一种自监督的适应策略,使用姿态编码器估计相机内参。我们的框架能够仅通过单目手术视频进行训练,确保最小化训练成本。实验结果表明,即使训练次数较少,甚至不知道真实相机内参,我们的方法也能获得卓越的性能。代码位于此链接处。
https://arxiv.org/abs/2405.08672
Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses the curated task dataset for both pretraining the networks and fine-tuning them. Availability of large, diverse, and uncurated public medical image sets provides the opportunity to apply SSL in the "wild" and potentially extract features robust to imaging variations. However, the benefit of wild- vs self-pretraining has not been studied for medical image analysis. In this paper, we compare robustness of wild versus self-pretrained transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models to computed tomography (CT) imaging differences for non-small cell lung cancer (NSCLC) segmentation. Wild-pretrained Swin models outperformed self-pretrained Swin for the various imaging acquisitions. ViT resulted in similar accuracy for both wild- and self-pretrained models. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained models resulted in higher feature reuse at the lower level layers and feature differentiation close to output layer after fine-tuning. Hence, we conclude: Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. Swin architecture benefited from such pretraining more than ViT.
自监督学习(SSL)是从未标记数据中提取有用特征表示的方法,并在有限标记示例的情况下,在下游任务上进行微调。自监督预训练是一种SSL方法,它利用精心挑选的任务数据集来预训练网络并对其进行微调。大型、多样化和未标记的公共医疗图像数据集的可用性提供了在“野地”应用SSL的机会,从而可能提取对影像变异具有鲁棒性的特征。然而,对于医学图像分析,还没有研究野生预训练和自监督预训练之间的优势。在本文中,我们比较野生预训练和自监督预训练的Transformer(视觉Transformer [ViT]和层次窗滑动窗口 [Swin])模型的计算断层扫描(CT)成像差异对非小细胞肺癌(NSCLC)分割的鲁棒性。野生预训练的Swin模型在各种成像采集中都优于自监督预训练的Swin模型。ViT模型的准确性与野生预训练和自监督预训练模型相当。使网络学习局部结构的目标预处理任务产生了比全局图像信息建模的对比任务更高的准确率。野生预训练模型在微调后,低层层级的特征复用较高,输出层附近的特征分化也较高。因此,我们得出结论:野生预训练网络在肺癌分割分析中的鲁棒性大于自监督方法。Swin架构从预训练中受益更多。
https://arxiv.org/abs/2405.08657
Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech this http URL better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.
自监督学习在语音识别方面取得了巨大的成功。然而,观察发现,对学习模型的所有层进行微调会导致性能低于重置顶层。这种现象归因于“自编码器”行为:顶层包含更接近输入的信息,并且不太适合需要语言信息的任务,比如更好地理解这个行为。为了研究模型在预训练期间的高级信息演化,我们关注了表现较轻的HuBERT模型。通过实验探索可能影响训练过程的各种因素,我们的目标是改进训练程序并提高HuBERT模型在高级任务上的顶级层。此外,我们的实验还表明,这些训练过程的改进会导致更快的学习曲线和竞争力的下游任务性能。
https://arxiv.org/abs/2405.08402
From a perspective of feature matching, optical flow estimation for event cameras involves identifying event correspondences by comparing feature similarity across accompanying event frames. In this work, we introduces an effective and robust high-dimensional (HD) feature descriptor for event frames, utilizing Vector Symbolic Architectures (VSA). The topological similarity among neighboring variables within VSA contributes to the enhanced representation similarity of feature descriptors for flow-matching points, while its structured symbolic representation capacity facilitates feature fusion from both event polarities and multiple spatial scales. Based on this HD feature descriptor, we propose a novel feature matching framework for event-based optical flow, encompassing both model-based (VSA-Flow) and self-supervised learning (VSA-SM) methods. In VSA-Flow, accurate optical flow estimation validates the effectiveness of HD feature descriptors. In VSA-SM, a novel similarity maximization method based on the HD feature descriptor is proposed to learn optical flow in a self-supervised way from events alone, eliminating the need for auxiliary grayscale images. Evaluation results demonstrate that our VSA-based method achieves superior accuracy in comparison to both model-based and self-supervised learning methods on the DSEC benchmark, while remains competitive among both methods on the MVSEC benchmark. This contribution marks a significant advancement in event-based optical flow within the feature matching methodology.
从特征匹配的角度来看,事件相机中的光学流估计涉及通过比较伴随事件帧之间的特征相似性来识别事件对应关系。在这项工作中,我们引入了一个有效且鲁棒的高维(HD)特征描述器来描述事件帧,利用向量符号抽象结构(VSA)。VSA中相邻变量之间的拓扑相似性有助于增强特征描述器对于匹配点的光流匹配的表示相似性,而其结构化符号表示能力则促使来自事件极性和多个空间维度的特征进行融合。基于这个HD特征描述器,我们提出了一个基于事件的光流匹配框架,涵盖了基于模型的(VSA-Flow)和自监督学习(VSA-SM)方法。在VSA-Flow中,准确的光流估计验证了HD特征描述器的有效性。在VSA-SM中,我们提出了一种基于HD特征描述器的新型相似度最大化方法来以事件的方式自监督地学习光流,消除了需要辅助灰度图像的需求。评估结果表明,基于VSA的方法在DSEC基准上实现了与基于模型的方法和自监督学习方法相比的卓越准确性,而在MVSEC基准上则与这两种方法保持竞争力。这项贡献在基于特征匹配的光流匹配方法论上取得了显著的进步。
https://arxiv.org/abs/2405.08300
The ability of deep networks to learn superior representations hinges on leveraging the proper inductive biases, considering the inherent properties of datasets. In tabular domains, it is critical to effectively handle heterogeneous features (both categorical and numerical) in a unified manner and to grasp irregular functions like piecewise constant functions. To address the challenges in the self-supervised learning framework, we propose a novel pretext task based on the classical binning method. The idea is straightforward: reconstructing the bin indices (either orders or classes) rather than the original values. This pretext task provides the encoder with an inductive bias to capture the irregular dependencies, mapping from continuous inputs to discretized bins, and mitigates the feature heterogeneity by setting all features to have category-type targets. Our empirical investigations ascertain several advantages of binning: capturing the irregular function, compatibility with encoder architecture and additional modifications, standardizing all features into equal sets, grouping similar values within a feature, and providing ordering information. Comprehensive evaluations across diverse tabular datasets corroborate that our method consistently improves tabular representation learning performance for a wide range of downstream tasks. The codes are available in this https URL.
深度网络学习优越的表示能力取决于正确利用归纳偏见,考虑到数据集的固有特性。在表格领域,关键是要以统一的方式有效地处理异质特征(包括分类和数值特征),并理解分段常数函数等不规则函数。为解决自监督学习框架中的挑战,我们提出了一个基于经典分立方法的新型预处理任务。这个想法很简单:重构分标(无论是顺序还是类别)而不是原始值。这个预处理任务为编码器提供了一个归纳偏见,以捕捉不规则依赖关系,将连续输入映射到离散的区间,并通过将所有特征都设置为具有类目类型的目标来缓解特征异质性。我们的实证调查证实了分立的一些优点:捕捉不规则函数、与编码器架构的兼容性、将所有特征标准化为等集、将相似值分组在一起以及提供排序信息。通过对各种表格数据集的全面评估,证实了我们的方法对于各种下游任务的表格表示学习性能始终有所改进。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2405.07414
Machine unlearning is a complex process that necessitates the model to diminish the influence of the training data while keeping the loss of accuracy to a minimum. Despite the numerous studies on machine unlearning in recent years, the majority of them have primarily focused on supervised learning models, leaving research on contrastive learning models relatively underexplored. With the conviction that self-supervised learning harbors a promising potential, surpassing or rivaling that of supervised learning, we set out to investigate methods for machine unlearning centered around contrastive learning models. In this study, we introduce a novel gradient constraint-based approach for training the model to effectively achieve machine unlearning. Our method only necessitates a minimal number of training epochs and the identification of the data slated for unlearning. Remarkably, our approach demonstrates proficient performance not only on contrastive learning models but also on supervised learning models, showcasing its versatility and adaptability in various learning paradigms.
机器学习消退是一个复杂的过程,需要模型在减小训练数据的影响的同时,将准确度的损失降到最低。尽管在近年来有关机器学习消退的研究数量众多,但大多数研究主要集中在监督学习模型上,相对较少地研究了对比学习模型。我们坚信,自监督学习具有很大的潜力,可以与监督学习相媲美或者超越监督学习。因此,我们研究了以对比学习模型为中心的机器学习消退方法。在这项研究中,我们引入了一种新颖的基于梯度约束的训练方法,以有效实现机器消退。我们的方法只需要很少的训练周期和计划消退数据的确定。值得注意的是,我们的方法不仅在对比学习模型上表现出卓越的性能,而且在监督学习模型上同样表现出卓越的性能,展示了其在各种学习范式中的灵活性和适应性。
https://arxiv.org/abs/2405.07317
This study introduces a novel Supervised Info-enhanced Contrastive Learning framework for EEG based Emotion Recognition (SICLEER). SI-CLEER employs multi-granularity contrastive learning to create robust EEG contextual representations, potentiallyn improving emotion recognition effectiveness. Unlike existing methods solely guided by classification loss, we propose a joint learning model combining self-supervised contrastive learning loss and supervised classification loss. This model optimizes both loss functions, capturing subtle EEG signal differences specific to emotion detection. Extensive experiments demonstrate SI-CLEER's robustness and superior accuracy on the SEED dataset compared to state-of-the-art methods. Furthermore, we analyze electrode performance, highlighting the significance of central frontal and temporal brain region EEGs in emotion detection. This study offers an universally applicable approach with potential benefits for diverse EEG classification tasks.
这项研究提出了一种新颖的基于监督的增强对比学习框架,用于基于脑电波(EEG)的情感识别(SICLEER)。SI-CLEER采用多粒度对比学习来创建稳健的EEG上下文表示,从而可能提高情感识别的有效性。与现有的方法仅基于分类损失不同,我们提出了一个结合自监督对比学习损失和监督分类损失的联合学习模型。该模型优化了两个损失函数,捕捉到特定于情感检测的微小EEG信号差异。大量实验证明,SI-CLEER在SEED数据集上的鲁棒性和卓越准确性比最先进的methods要强。此外,我们分析了电极性能,强调了中央前额和颞叶皮层EEG在情感检测中的重要性。这项研究提供了一种通用的方法,对各种EEG分类任务具有潜在的益处。
https://arxiv.org/abs/2405.07260
Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations.
视频语言预训练是一个典型而具有挑战性的问题,旨在通过自监督的方式从大规模数据中学习视觉和文本表示。现有的预训练方法要么捕获图像-文本对之间的对应关系,要么利用帧的时间顺序。然而,它们并没有明确探索音频与其他两个模式之间的自然同步。在这项工作中,我们提出了一种名为VLSA的视频语言预训练增强框架,可以在统一的自监督Transformer中学习三模态表示。具体来说,我们的VLSA共同聚合视频、文本和音频的局部补丁和全局词向量。此外,我们利用局部补丁遮蔽建模来学习模式感知的特征,并利用全局音频匹配来捕捉视频和文本中的音频指导特征。我们在文本、视频和音频上进行广泛的实验。仅使用0.9M数据预训练的简单模型取得了比最先进的基准模型更好的结果。此外,定性的可视化生动地展示了VLSA在学习具有区分性的视觉文本表示方面的优越性。
https://arxiv.org/abs/2405.07202
Precision agriculture involves the application of advanced technologies to improve agricultural productivity, efficiency, and profitability while minimizing waste and environmental impact. Deep learning approaches enable automated decision-making for many visual tasks. However, in the agricultural domain, variability in growth stages and environmental conditions, such as weather and lighting, presents significant challenges to developing deep learning-based techniques that generalize across different conditions. The resource-intensive nature of creating extensive annotated datasets that capture these variabilities further hinders the widespread adoption of these approaches. To tackle these issues, we introduce a semi-self-supervised domain adaptation technique based on deep convolutional neural networks with a probabilistic diffusion process, requiring minimal manual data annotation. Using only three manually annotated images and a selection of video clips from wheat fields, we generated a large-scale computationally annotated dataset of image-mask pairs and a large dataset of unannotated images extracted from video frames. We developed a two-branch convolutional encoder-decoder model architecture that uses both synthesized image-mask pairs and unannotated images, enabling effective adaptation to real images. The proposed model achieved a Dice score of 80.7\% on an internal test dataset and a Dice score of 64.8\% on an external test set, composed of images from five countries and spanning 18 domains, indicating its potential to develop generalizable solutions that could encourage the wider adoption of advanced technologies in agriculture.
精确农业涉及将先进技术应用于提高农业的生产力、效率和盈利能力,同时最小化浪费和对环境的影响。深度学习方法可以自动决策许多视觉任务。然而,在农业领域,生长阶段和环境条件(如天气和光照)的变异性给基于深度学习的技术在各种条件下推广带来了巨大的挑战。创建广泛注释的数据集来捕捉这些变异性进一步阻碍了这些方法的应用。为了应对这些问题,我们引入了一种基于深度卷积神经网络的半自监督领域迁移技术,采用概率扩散过程,不需要手动数据注释。仅使用三个手动标注的图像和来自小麦田的视频剪辑,我们生成了一个大规模计算注释的图像-掩码对和一个大型无注释图像集。我们开发了一个两分支卷积编码器-解码器模型架构,使用合成图像-掩码对和无注释图像,实现了对真实图像的有效适应。与内部测试数据集相比,所提出的模型获得了80.7%的Dice分数,与外部测试集中的数据相比,获得了64.8%的Dice分数,这些分数由来自五个国家的五个领域的图像组成,涵盖了18个领域。这表明该模型具有推动农业领域采用先进技术的能力。
https://arxiv.org/abs/2405.07157
This research develops advanced methodologies for Large Language Models (LLMs) to better manage linguistic behaviors related to emotions and ethics. We introduce DIKE, an adversarial framework that enhances the LLMs' ability to internalize and reflect global human values, adapting to varied cultural contexts to promote transparency and trust among users. The methodology involves detailed modeling of emotions, classification of linguistic behaviors, and implementation of ethical guardrails. Our innovative approaches include mapping emotions and behaviors using self-supervised learning techniques, refining these guardrails through adversarial reviews, and systematically adjusting outputs to ensure ethical alignment. This framework establishes a robust foundation for AI systems to operate with ethical integrity and cultural sensitivity, paving the way for more responsible and context-aware AI interactions.
这项研究开发了大型语言模型(LLMs)更好地管理与情感和伦理行为相关的先进方法。我们引入了DIKE,一种对抗性框架,它增强了LLMs内部化和反映全球人类价值观的能力,适应各种文化环境以促进用户之间的透明度和信任。方法论包括情感的详细建模、语言行为的分类和实施道德边界。我们创新的方法包括使用自我监督学习技术进行情感和行为的映射,通过对抗性审查来优化这些边界,并系统地调整输出以确保伦理一致性。这个框架为AI系统操作道德 integrity 和文化敏感奠定了坚实的基础,为更负责任和上下文敏感的AI交互铺平了道路。
https://arxiv.org/abs/2405.07076
Garment manipulation (e.g., unfolding, folding and hanging clothes) is essential for future robots to accomplish home-assistant tasks, while highly challenging due to the diversity of garment configurations, geometries and deformations. Although able to manipulate similar shaped garments in a certain task, previous works mostly have to design different policies for different tasks, could not generalize to garments with diverse geometries, and often rely heavily on human-annotated data. In this paper, we leverage the property that, garments in a certain category have similar structures, and then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations in the self-supervised manner. The topological correspondence can be easily adapted to the functional correspondence to guide the manipulation policies for various downstream tasks, within only one or few-shot demonstrations. Experiments over garments in 3 different categories on 3 representative tasks in diverse scenarios, using one or two arms, taking one or more steps, inputting flat or messy garments, demonstrate the effectiveness of our proposed method. Project page: this https URL.
衣物的操作(例如,展开、折叠和挂起衣物)对于未来机器人完成家庭助手任务至关重要,但由于衣物配置、几何形状和变形程度的多样性,实现这一目标具有极大挑战性。尽管在某些任务中,它们能够操纵具有相似形状的衣物,但以前的工作主要需要为不同任务设计不同的策略,不能推广到具有不同几何形状的衣物,并且通常依赖于人类标注的数据。在本文中,我们利用衣物属于同一类别的衣物具有相似结构的性质,然后以自监督的方式在同一类别级别学习衣物之间的拓扑密集(点级别)视觉对应关系。这种拓扑对应关系可以轻松地适应功能的对应关系来指导各种下游任务的操作策略,仅需几轮演示。在多样场景下的3种不同类别的衣物上进行实验,使用1或2个手臂,进行1或2步操作,输入平滑或凌乱的衣物,证明了我们所提出方法的有效性。项目页面:此链接。
https://arxiv.org/abs/2405.06903