Recent advances in aerial robotics have enabled the use of multirotor vehicles for autonomous payload transportation. Resorting only to classical methods to reliably model a quadrotor carrying a cable-slung load poses significant challenges. On the other hand, purely data-driven learning methods do not comply by design with the problem's physical constraints, especially in states that are not densely represented in training data. In this work, we explore the use of physics informed neural networks to learn an end-to-end model of the multirotor-slung-load system and, at a given time, estimate a sequence of the future system states. An LSTM encoder decoder with an attention mechanism is used to capture the dynamics of the system. To guarantee the cohesiveness between the multiple predicted states of the system, we propose the use of a physics-based term in the loss function, which includes a discretized physical model derived from first principles together with slack variables that allow for a small mismatch between expected and predicted values. To train the model, a dataset using a real-world quadrotor carrying a slung load was curated and is made available. Prediction results are presented and corroborate the feasibility of the approach. The proposed method outperforms both the first principles physical model and a comparable neural network model trained without the physics regularization proposed.
近年来,随着无人机遥控技术的进步,多旋翼车辆已被用于自主承载电缆吊重物的运输。仅仅依靠经典方法来可靠地建模四旋翼携带电缆吊重物存在重大挑战。另一方面,完全基于数据驱动的学习方法在设计上并不符合问题固有约束,尤其是在训练数据中没有很好地表示的状态。在本文中,我们探讨了使用受物理学启发的神经网络来学习多旋翼吊重系统端到端模型的应用,并在给定时间估计未来系统状态。为了捕捉系统的动态,我们使用了LSTM编码器-解码器模型,并引入了注意机制来控制多个预测状态之间的连贯性。为了保证系统中多个预测状态的连贯性,我们在损失函数中引入了一个基于物理学的项,包括从基本原理导出的离散化物理模型和允许预期值和预测值之间的小误差的可缩放变量。为了训练模型,我们挑选了一个使用真实世界四旋翼运输电缆吊重物的数据集,并提供了用于训练的数据集。预测结果被呈现,并证实了该方法的有效性。与无物理学 regularization的第一原理物理模型和没有物理学的神经网络模型相比,所提出的方法优越。
https://arxiv.org/abs/2405.09428
Contrastive pretraining provides robust representations by ensuring their invariance to different image transformations while simultaneously preventing representational collapse. Equivariant contrastive learning, on the other hand, provides representations sensitive to specific image transformations while remaining invariant to others. By introducing equivariance to time-induced transformations, such as disease-related anatomical changes in longitudinal imaging, the model can effectively capture such changes in the representation space. In this work, we pro-pose a Time-equivariant Contrastive Learning (TC) method. First, an encoder embeds two unlabeled scans from different time points of the same patient into the representation space. Next, a temporal equivariance module is trained to predict the representation of a later visit based on the representation from one of the previous visits and the corresponding time interval with a novel regularization loss term while preserving the invariance property to irrelevant image transformations. On a large longitudinal dataset, our model clearly outperforms existing equivariant contrastive methods in predicting progression from intermediate age-related macular degeneration (AMD) to advanced wet-AMD within a specified time-window.
对比性预训练通过确保其对不同图像变换的不变性来提供稳健的表示,同时防止表示坍缩。另一方面,等价对比学习提供对特定图像变换的敏感性,同时保持对其他变换的不变性。通过引入对时间引起的变换的等价性,例如纵向成像中与疾病相关的解剖变化,该模型可以有效地捕捉表示空间中的这些变化。在这篇工作中,我们提出了一个时间等价的对比学习(TC)方法。首先,将同一患者不同时间点的两个未标记的扫描嵌入到表示空间中。然后,训练一个时间等价模块,根据之前一次访问的表示和相应的时间间隔,预测后续访问的表示,并使用新的正则化损失函数保留对无关图像变换的不变性。在一个大型纵向数据集上,我们的模型在预测中老年相关黄斑变性(AMD)的中间年龄到高级湿性AMD的进展方面明显优于现有的等价对比方法。
https://arxiv.org/abs/2405.09404
This paper explores a novel multi-modal alternating learning paradigm pursuing a reconciliation between the exploitation of uni-modal features and the exploration of cross-modal interactions. This is motivated by the fact that current paradigms of multi-modal learning tend to explore multi-modal features simultaneously. The resulting gradient prohibits further exploitation of the features in the weak modality, leading to modality competition, where the dominant modality overpowers the learning process. To address this issue, we study the modality-alternating learning paradigm to achieve reconcilement. Specifically, we propose a new method called ReconBoost to update a fixed modality each time. Herein, the learning objective is dynamically adjusted with a reconcilement regularization against competition with the historical models. By choosing a KL-based reconcilement, we show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others and help enhance the overall performance. The major difference with the classic GB is that we only preserve the newest model for each modality to avoid overfitting caused by ensembling strong learners. Furthermore, we propose a memory consolidation scheme and a global rectification scheme to make this strategy more effective. Experiments over six multi-modal benchmarks speak to the efficacy of the method. We release the code at this https URL.
本文探讨了一种新型的多模态交替学习范式,旨在实现单模态特征的充分利用和跨模态相互作用的探索之间的调和。这是由于当前的多模态学习范式倾向于同时探索多模态特征。由此产生的梯度禁止进一步挖掘弱模态特征,导致模态竞争,其中优势模态会压倒学习过程。为了解决这个问题,我们研究了模态交替学习范式以实现调和。具体来说,我们提出了一种名为ReconBoost的新方法,每次更新一个固定的模态。在这里,学习目标通过和谐 regularization 对抗历史模型的竞争进行动态调整。通过选择基于KL的和谐方法,我们证明了所提出的方法类似于Friedman的梯度 Boost (GB)算法,其中更新后的学习者可以纠正他人的错误并帮助提高整体性能。与经典GB的主要区别在于,我们只保留每个模态中最新的模型,以避免由于集成强势学习者而引起的过拟合。此外,我们还提出了一种记忆整合方案和全局矩形化方案,使这种策略更加有效。六个多模态基准的实验结果证实了这种方法的有效性。您可以在以下链接处获取代码:https://url.cn/
https://arxiv.org/abs/2405.09321
-Recent strides in model predictive control (MPC)underscore a dependence on numerical advancements to efficientlyand accurately solve large-scale problems. Given the substantialnumber of variables characterizing typical whole-body optimalcontrol (OC) problems -often numbering in the thousands-exploiting the sparse structure of the numerical problem becomescrucial to meet computational demands, typically in the range ofa few milliseconds. A fundamental building block for computingNewton or Sequential Quadratic Programming (SQP) steps indirect optimal control methods involves addressing the linearquadratic regulator (LQR) problem. This paper concentrateson equality-constrained problems featuring implicit systemdynamics and dual regularization, a characteristic found inadvanced interior-point or augmented Lagrangian solvers. Here,we introduce a parallel algorithm designed for solving an LQRproblem with dual regularization. Leveraging a rewriting of theLQR recursion through block elimination, we first enhanced theefficiency of the serial algorithm, then subsequently generalized itto handle parametric problems. This extension enables us to splitdecision variables and solve multiple subproblems concurrently.Our algorithm is implemented in our nonlinear numerical optimalcontrol library ALIGATOR. It showcases improved performanceover previous serial formulations and we validate its efficacy bydeploying it in the model predictive control of a real quadrupedrobot. This paper follows up from our prior work on augmentedLagrangian methods for numerical optimal control with implicitdynamics and constraints.
近年来,模型预测控制(MPC)的进步表明,要高效准确地解决大规模问题,需要依赖数值改进。对于典型全身最优控制(OC)问题中大量存在的变量,通常有几千个,利用数值问题的稀疏结构变得至关重要,通常需要花费计算资源的毫秒级。计算新牛顿或序贯四元规划(SQP)步的直接最优控制方法的基本构建模块涉及解决线性二次调节器(LQR)问题。本文重点讨论具有隐含系统动力学和支持的等式约束问题,这是高级内部点或增强型拉格朗日求解器中发现的特征。在这里,我们介绍了一种用于求解具有双重 regularization 的 LQR 问题的并行算法。通过通过块消除重写LQR递归,我们首先增强了序列算法的效率,然后随后扩展到处理参数问题。这个扩展使得我们可以同时划分决策变量并解决多个子问题。 我们的算法实现在我们非线性数值最优控制库 ALIGATOR 中。它展示了前述序列形式的改进性能,并通过将该算法应用于实际四足机器人的模型预测控制来验证其有效性。本文接着我们在之前的关于具有隐含动力学和支持的增广拉格朗日方法的研究工作。
https://arxiv.org/abs/2405.09197
Point cloud filtering is a fundamental 3D vision task, which aims to remove noise while recovering the underlying clean surfaces. State-of-the-art methods remove noise by moving noisy points along stochastic trajectories to the clean surfaces. These methods often require regularization within the training objective and/or during post-processing, to ensure fidelity. In this paper, we introduce StraightPCF, a new deep learning based method for point cloud filtering. It works by moving noisy points along straight paths, thus reducing discretization errors while ensuring faster convergence to the clean surfaces. We model noisy patches as intermediate states between high noise patch variants and their clean counterparts, and design the VelocityModule to infer a constant flow velocity from the former to the latter. This constant flow leads to straight filtering trajectories. In addition, we introduce a DistanceModule that scales the straight trajectory using an estimated distance scalar to attain convergence near the clean surface. Our network is lightweight and only has $\sim530K$ parameters, being 17% of IterativePFN (a most recent point cloud filtering network). Extensive experiments on both synthetic and real-world data show our method achieves state-of-the-art results. Our method also demonstrates nice distributions of filtered points without the need for regularization. The implementation code can be found at: this https URL.
点云筛选是一个基本的三维视觉任务,旨在在删除噪声的同时恢复底层干净表面。最先进的方法通过将噪音点沿着随机轨迹移动到干净表面来消除噪音。这些方法通常需要在训练目标内进行正则化,或在后处理过程中进行正则化,以确保精度。在本文中,我们介绍了StraightPCF,一种基于直线的点云筛选新方法。它通过将噪音点沿着直线移动来减少离散误差,同时确保更快地达到干净表面。我们将噪音斑块建模为高噪音块变体和干净块之间的中间状态,并设计VelocityModule从前者推断到后者的恒定流速。这种恒定流速导致直线过滤轨迹。此外,我们还引入了距离模块,通过估计距离标量来缩放直线轨迹以达到清洁表面的收敛。我们的网络轻量化,仅含有约530K个参数,是IterativePFN(一种最先进的点云过滤网络)的17%。对合成和真实世界数据的广泛实验表明,我们的方法达到了最先进水平。我们的方法还展示了无需正则化即可良好地分布滤波点的分布。实现代码可以在:这个链接找到。
https://arxiv.org/abs/2405.08322
Tensors serve as a crucial tool in the representation and analysis of complex, multi-dimensional data. As data volumes continue to expand, there is an increasing demand for developing optimization algorithms that can directly operate on tensors to deliver fast and effective computations. Many problems in real-world applications can be formulated as the task of recovering high-order tensors characterized by sparse and/or low-rank structures. In this work, we propose novel Kaczmarz algorithms with a power of the $\ell_1$-norm regularization for reconstructing high-order tensors by exploiting sparsity and/or low-rankness of tensor data. In addition, we develop both a block and an accelerated variant, along with a thorough convergence analysis of these algorithms. A variety of numerical experiments on both synthetic and real-world datasets demonstrate the effectiveness and significant potential of the proposed methods in image and video processing tasks, such as image sequence destriping and video deconvolution.
稀疏和/或低秩数据中的高阶张量表示和分析的关键工具。随着数据量的不断扩展,对开发可以直接对张量进行操作的优化算法的需求也在增加,以实现快速和有效的计算。许多现实应用问题可以表述为通过利用张量数据的稀疏和/或低秩性来求解高阶张量的问题。在本文中,我们提出了一种名为Kaczmarz的新的稀疏度量算法,通过利用张量数据的稀疏和/或低秩性来重构高阶张量。此外,我们还开发了块状和加速版本,以及这些算法的详细收敛分析。对 both synthetic and real-world datasets 的多种数值实验表明,与所提出的方法一起,在图像和视频处理任务中取得了有效性和显著潜力。
https://arxiv.org/abs/2405.08275
This paper explores test-agnostic long-tail recognition, a challenging long-tail task where the test label distributions are unknown and arbitrarily imbalanced. We argue that the variation in these distributions can be broken down hierarchically into global and local levels. The global ones reflect a broad range of diversity, while the local ones typically arise from milder changes, often focused on a particular neighbor. Traditional methods predominantly use a Mixture-of-Expert (MoE) approach, targeting a few fixed test label distributions that exhibit substantial global variations. However, the local variations are left unconsidered. To address this issue, we propose a new MoE strategy, $\mathsf{DirMixE}$, which assigns experts to different Dirichlet meta-distributions of the label distribution, each targeting a specific aspect of local variations. Additionally, the diversity among these Dirichlet meta-distributions inherently captures global variations. This dual-level approach also leads to a more stable objective function, allowing us to sample different test distributions better to quantify the mean and variance of performance outcomes. Theoretically, we show that our proposed objective benefits from enhanced generalization by virtue of the variance-based regularization. Comprehensive experiments across multiple benchmarks confirm the effectiveness of $\mathsf{DirMixE}$. The code is available at \url{this https URL}.
本文探讨了测试无关的长尾识别,这是一种具有挑战性的长尾任务,其中测试标签分布是未知的,而且通常是不平衡的。我们认为这些分布的变异性可以按层次结构分解为全局和局部级别。全局变异性反映了广泛的多样性,而局部变异性通常源于较轻的变化,往往关注于特定的邻居。传统方法主要使用混合专家(MoE)方法,针对展示巨大全局变异性的一些固定测试标签分布。然而,局部变异性未被考虑。为了解决这个问题,我们提出了一个新的MoE策略,$\mathrm{DirMixE}$,它将专家分配到标签分布的不同狄利克特meta分布中,每个分布针对特定的局部变化。此外,这些狄利克特meta分布的多样性本质上捕捉了全局变异性。这种双层方法还导致目标函数更加稳定,使我们能够更好地采样不同的测试分布来评估性能的平均值和方差。从理论上看,我们证明了我们的目标函数通过变量方差基于正则化得到了增强的泛化能力。在多个基准测试上进行全面的实验证实了$\mathrm{DirMixE}$的有效性。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2405.07780
Transportation of samples across different domains is a central task in several machine learning problems. A sensible requirement for domain transfer tasks in computer vision and language domains is the sparsity of the transportation map, i.e., the transfer algorithm aims to modify the least number of input features while transporting samples across the source and target domains. In this work, we propose Elastic Net Optimal Transport (ENOT) to address the sparse distribution transfer problem. The ENOT framework utilizes the $L_1$-norm and $L_2$-norm regularization mechanisms to find a sparse and stable transportation map between the source and target domains. To compute the ENOT transport map, we consider the dual formulation of the ENOT optimization task and prove that the sparsified gradient of the optimal potential function in the ENOT's dual representation provides the ENOT transport map. Furthermore, we demonstrate the application of the ENOT framework to perform feature selection for sparse domain transfer. We present the numerical results of applying ENOT to several domain transfer problems for synthetic Gaussian mixtures and real image and text data. Our empirical results indicate the success of the ENOT framework in identifying a sparse domain transport map.
在许多机器学习问题中,跨不同领域的样本传输是一个重要的任务。在计算机视觉和自然语言领域中,领域迁移任务的合理要求是传输图的稀疏性,即传输算法旨在在将样本从一个源域传输到目标域的同时,尽量减少输入特征的数量。在本文中,我们提出了一种弹性网最优传输(ENOT)来解决稀疏分布传输问题。 ENOT框架利用$L_1$范数和$L_2$范数正则化机制来寻找源域和目标域之间的稀疏和稳定的传输图。为了计算ENOT传输图,我们考虑ENOT优化任务的拉格朗日乘子法,并证明在ENOT的双表示中,最优潜在函数的稀疏梯度提供了ENOT传输图。此外,我们还证明了ENOT框架在稀疏域传输问题中的应用。 我们通过为 synthetic Gaussian mixtures 和真实图像和文本数据应用 ENOT 框架进行了数值实验,展示了 ENOT 框架在稀疏域传输问题中找到稀疏领域传输图的成功。我们的实证结果表明,ENOT框架在识别稀疏域传输图方面非常成功。
https://arxiv.org/abs/2405.07489
This study introduces a novel data augmentation technique, ADLDA, aimed at mitigating the negative impact of data distribution shifts caused by the data augmentation process in computer vision task. ADLDA partitions augmented data into distinct subdomains and incorporates domain labels, combined with domain adaptation techniques, to optimize data representation in the model's feature space. Experimental results demonstrate that ADLDA significantly enhances model performance across multiple datasets, particularly in neural network architectures with complex feature extraction layers. Furthermore, ADLDA improves the model's ability to locate and recognize key features, showcasing potential in object recognition and image segmentation tasks. This paper's contribution provides an effective data augmentation regularization method for the field of computer vision aiding in the enhancement of robustness and accuracy in deep learning models.
本研究介绍了一种新的数据增强技术——ADLDA,旨在减轻数据增强过程中数据分布变化对计算机视觉任务造成的负面影响。ADLDA将增强的数据划分为不同的子域,并引入领域标签,结合领域自适应技术,优化模型在特征空间中的数据表示。实验结果表明,ADLDA在多个数据集上显著增强了模型的性能,特别是在具有复杂特征提取层的神经网络架构中。此外,ADLDA提高了模型在关键特征的定位和识别能力,展示了其在物体识别和图像分割任务中的潜力。本论文对计算机视觉领域提供了一种有效的数据增强规范方法,以增强深度学习模型的稳健性和准确性。
https://arxiv.org/abs/2405.06893
We introduce an Invertible Symbolic Regression (ISR) method. It is a machine learning technique that generates analytical relationships between inputs and outputs of a given dataset via invertible maps (or architectures). The proposed ISR method naturally combines the principles of Invertible Neural Networks (INNs) and Equation Learner (EQL), a neural network-based symbolic architecture for function learning. In particular, we transform the affine coupling blocks of INNs into a symbolic framework, resulting in an end-to-end differentiable symbolic invertible architecture that allows for efficient gradient-based learning. The proposed ISR framework also relies on sparsity promoting regularization, allowing the discovery of concise and interpretable invertible expressions. We show that ISR can serve as a (symbolic) normalizing flow for density estimation tasks. Furthermore, we highlight its practical applicability in solving inverse problems, including a benchmark inverse kinematics problem, and notably, a geoacoustic inversion problem in oceanography aimed at inferring posterior distributions of underlying seabed parameters from acoustic signals.
我们提出了一个可逆符号回归(ISR)方法。这是一种机器学习技术,通过反向映射(或架构)在给定数据集的输入和输出之间生成分析关系。所提出的ISR方法自然地将反向神经网络(INNs)和方程学习器(EQL)的原则结合起来,这是一种基于神经网络的功能学习符号架构。特别地,我们将INNs的凸性耦合块转换为符号框架,从而实现了一个端到端可导的符号反向架构,允许进行高效的基于梯度的学习。所提出的ISR框架还依赖于稀疏性促进正则化,使得发现简洁且可解释的反向表达式成为可能。我们证明了ISR可以作为密度估计任务的(符号)归一化流。此外,我们还强调了其在解决反问题方面的实际应用,包括基准反向运动学问题和海洋地理学反演问题,特别是旨在从海洋信号中推断底层海底参数的后验分布问题。
https://arxiv.org/abs/2405.06848
This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently Gaussian Splatting-based SLAM has yielded promising results, but rely on RGB-D input and is weak in tracking. To address these limitations, we uniquely integrates advanced sparse visual odometry with a dense Gaussian Splatting scene representation for the first time, thereby eliminating the dependency on depth maps typical of Gaussian Splatting-based SLAM systems and enhancing tracking robustness. Here, the sparse visual odometry tracks camera poses in RGB stream, while Gaussian Splatting handles map reconstruction. These components are interconnected through a Multi-View Stereo (MVS) depth estimation network. And we propose a depth smooth loss to reduce the negative effect of estimated depth maps. Furthermore, the consistency in scale between the sparse visual odometry and the dense Gaussian map is preserved by Sparse-Dense Adjustment Ring (SDAR). We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art performance. Additionally, it outperforms previous monocular methods in terms of novel view synthesis fidelity, matching the results of neural SLAM systems that utilize RGB-D input.
本文介绍了一种基于Gaussian Splatting的密集视觉同时定位与映射(VSLAM)的新框架。最近基于Gaussian Splatting的SLAM已经取得了良好的结果,但是依赖于RGB-D输入,并且在跟踪方面较弱。为了克服这些限制,我们独特地将先进的稀疏视觉欧拉角与密集Gaussian Splatting场景表示集成在一起,从而消除了Gaussian Splatting-based SLAM系统常见的深度图依赖,提高了跟踪的鲁棒性。在这里,稀疏视觉欧拉角跟踪相机姿态,而Gaussian Splatting处理地图重建。这些组件通过多视角立体(MVS)深度估计网络相互连接。我们提出了一个深度平滑损失来减少估计深度图的负影响。此外,通过稀疏-稠密调整环(SDAR)可以保留稀疏视觉欧拉角与密集Gaussian地图之间的一致缩放。我们在各种合成和真实世界数据集上对系统进行了评估。我们的姿态估计精度超越了现有方法,达到了最先进的水平。此外,在新颖视角合成保真度方面,它超过了之前的多目视觉SLAM系统,与利用RGB-D输入的神经SLAM系统的结果相匹敌。
https://arxiv.org/abs/2405.06241
Model Inversion (MI) attacks aim to reconstruct private training data by abusing access to machine learning models. Contemporary MI attacks have achieved impressive attack performance, posing serious threats to privacy. Meanwhile, all existing MI defense methods rely on regularization that is in direct conflict with the training objective, resulting in noticeable degradation in model utility. In this work, we take a different perspective, and propose a novel and simple Transfer Learning-based Defense against Model Inversion (TL-DMI) to render MI-robust models. Particularly, by leveraging TL, we limit the number of layers encoding sensitive information from private training dataset, thereby degrading the performance of MI attack. We conduct an analysis using Fisher Information to justify our method. Our defense is remarkably simple to implement. Without bells and whistles, we show in extensive experiments that TL-DMI achieves state-of-the-art (SOTA) MI robustness. Our code, pre-trained models, demo and inverted data are available at: this https URL
模型反向(MI)攻击旨在通过滥用对机器学习模型的访问来重构私有训练数据。当代MI攻击已经取得了令人印象深刻的攻击性能,对隐私造成了严重威胁。与此同时,所有现有的MI防御方法都依赖于正则化,与训练目标直接冲突,导致模型效用明显下降。在本文中,我们采取了一种不同的观点,并提出了一个基于Transfer Learning的新的简单的防御方法对抗模型反向(TL-DMI),从而使MI攻击鲁棒。 特别地,通过利用Transfer Learning,我们限制了从私有训练数据中编码敏感信息的层数,从而降低了MI攻击的性能。我们使用Fisher信息进行分析,以证明我们的方法。我们的防御方法非常简单易行。没有花哨的装饰,我们在广泛的实验中展示了TL-DMI达到了最先进的(SOTA)MI鲁棒性。我们的代码、预训练模型、演示数据和反向数据都可以在:https:// this URL 找到。
https://arxiv.org/abs/2405.05588
Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue, there exist data representational differences that require additional effort to resolve. In this work, for the first time, we synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS in an open-world, annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation, we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably, we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.
基于事件的语义分割(ESS)是事件相机感知的根本挑战任务。解释和标注事件数据存在困难,这限制了其可扩展性。尽管从图像到事件数据的域迁移有助于减轻此问题,但存在数据表示差异,需要额外努力来解决。在这项工作中,我们首次将来自图像、文本和事件数据领域的信息进行协同,并引入了OpenESS,以在开放世界、注释 efficient 的环境中实现可扩展的 ESS。我们通过将语义丰富的 CLIP 知识从图像-文本对中传递到事件流中来实现这一目标。为了追求更好的跨模态适应,我们提出了帧到事件的对比蒸馏和文本到事件的语义一致性正则化。在流行的 ESS 基准测试上进行的实验结果表明,我们的方法优于现有方法。值得注意的是,在没有使用事件或帧标签的情况下,我们实现了 DDD17 和 DSEC-Semantic 基准测试中的 mIoU 值分别为 53.93% 和 43.31%。
https://arxiv.org/abs/2405.05259
Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
高效的数据利用对于在自动驾驶中提高3D场景理解至关重要,因为过分依赖人工标注的激光雷达点云使得完全监督方法受到挑战。为了解决这个问题,我们的研究扩展了激光雷达语义分割的半监督学习,利用驾驶场景的固有空间先验和多传感器补充,以提高未标注数据集的有效性。我们引入了LaserMix++,一个进化框架,整合了不同来源的激光雷达扫描的激光束操作,并进一步通过激光雷达相机对应关系辅助数据有效的学习。我们的框架旨在通过包括多模态、包括1)多模态激光Mix操作,实现对细粒度跨传感器交互的优化;2)相机到激光雷达特征蒸馏,增强激光雷达特征学习;3)使用开维词表模型的语言驱动知识指导生成辅助监督。LaserMix++的多功能使其能够应用于各种激光雷达表示形式,使其成为一种通用的解决方案。通过理论分析和广泛应用于流行驾驶感知数据集的实验,验证了我们的框架。结果表明,LaserMix++在完全监督替代方案方面取得了显著的优越性,其准确率与五个标注少的关系相当,同时显著提高了仅监督基础。这一重大的进步凸显了半监督方法在减少基于激光雷达的3D场景理解系统中过度依赖全面标注数据中的潜力。
https://arxiv.org/abs/2405.05258
Medical Image Synthesis (MIS) plays an important role in the intelligent medical field, which greatly saves the economic and time costs of medical diagnosis. However, due to the complexity of medical images and similar characteristics of different tissue cells, existing methods face great challenges in meeting their biological consistency. To this end, we propose the Hybrid Augmented Generative Adversarial Network (HAGAN) to maintain the authenticity of structural texture and tissue cells. HAGAN contains Attention Mixed (AttnMix) Generator, Hierarchical Discriminator and Reverse Skip Connection between Discriminator and Generator. The AttnMix consistency differentiable regularization encourages the perception in structural and textural variations between real and fake images, which improves the pathological integrity of synthetic images and the accuracy of features in local areas. The Hierarchical Discriminator introduces pixel-by-pixel discriminant feedback to generator for enhancing the saliency and discriminance of global and local details simultaneously. The Reverse Skip Connection further improves the accuracy for fine details by fusing real and synthetic distribution features. Our experimental evaluations on three datasets of different scales, i.e., COVID-CT, ACDC and BraTS2018, demonstrate that HAGAN outperforms the existing methods and achieves state-of-the-art performance in both high-resolution and low-resolution.
医学图像合成(MIS)在智能医疗领域中发挥着重要作用,大大降低了医疗诊断的经济和时间成本。然而,由于医学图像的复杂性和不同组织细胞的类似特征,现有方法在满足其生物一致性方面面临巨大挑战。为此,我们提出了混合增强生成对抗网络(HAGAN)来保持结构的真实性和组织细胞的真实性。HAGAN包括注意力混合(AttnMix)生成器、分层判别器和判别器和生成器的反向跳过连接。AttnMix一致性差分 regularization 鼓励在真实和假图像之间关注结构和组织学变异性,从而提高合成图像的病理完整性以及局部区域的特征准确性。分层判别器引入了逐像素判别反馈来增强生成器,以同时提高全局和局部细节的清晰度和鉴别度。反向跳过连接通过融合真实和合成分布特征进一步提高了准确度。我们在三个不同规模的数据集(即 COVID-CT、ACDC 和 BraTS2018)上的实验评估结果表明,HAGAN 优于现有方法,在 both high-resolution 和 low-resolution 高分辨率低分辨率方面实现了最先进的性能。
https://arxiv.org/abs/2405.04902
Change Detection (CD) aims to identify pixels with semantic changes between images. However, annotating massive numbers of pixel-level images is labor-intensive and costly, especially for multi-temporal images, which require pixel-wise comparisons by human experts. Considering the excellent performance of visual language models (VLMs) for zero-shot, open-vocabulary, etc. with prompt-based reasoning, it is promising to utilize VLMs to make better CD under limited labeled data. In this paper, we propose a VLM guidance-based semi-supervised CD method, namely DiffMatch. The insight of DiffMatch is to synthesize free change labels using VLMs to provide additional supervision signals for unlabeled data. However, almost all current VLMs are designed for single-temporal images and cannot be directly applied to bi- or multi-temporal images. Motivated by this, we first propose a VLM-based mixed change event generation (CEG) strategy to yield pseudo labels for unlabeled CD data. Since the additional supervised signals provided by these VLM-driven pseudo labels may conflict with the pseudo labels from the consistency regularization paradigm (e.g. FixMatch), we propose the dual projection head for de-entangling different signal sources. Further, we explicitly decouple the bi-temporal images semantic representation through two auxiliary segmentation decoders, which are also guided by VLM. Finally, to make the model more adequately capture change representations, we introduce metric-aware supervision by feature-level contrastive loss in auxiliary branches. Extensive experiments show the advantage of DiffMatch. For instance, DiffMatch improves the FixMatch baseline by +5.3 IoU on WHU-CD and by +2.4 IoU on LEVIR-CD with 5% labels. In addition, our CEG strategy, in an un-supervised manner, can achieve performance far superior to state-of-the-art un-supervised CD methods.
变化检测(CD)旨在识别图像之间语义变化的像素。然而,标注大量像素级别图像劳动密集且代价昂贵,尤其是在需要通过人类专家进行逐像素比较的多时间尺度图像上。考虑到视觉语言模型(VLMs)在零散、开词等提示下推理的优秀表现,我们有望在有限的标注数据下使用VLMs实现更好的CD。在本文中,我们提出了一种基于VLM指导的半监督CD方法,即DiffMatch。DiffMatch的洞察力在于利用VLMs合成自由变化标签,为未标注数据提供额外的监督信号。然而,几乎所有当前的VLMs都是为单时间尺度图像设计的,不能直接应用于双或多时间尺度图像。因此,我们首先提出了一种基于VLM的混合变化事件生成(CEG)策略,为未标注的CD数据生成伪标签。由于这些VLM驱动的伪标签可能与一致性正则化范式(例如FixMatch)中的伪标签发生冲突,我们提出了双投影头以解开不同信号源。此外,我们通过两个辅助分割解码器明确地解耦双时间尺度图像的语义表示。最后,为了使模型更好地捕捉变化表示,我们在辅助分支上引入基于特征级的对比损失的度量指导。大量实验证明,DiffMatch具有优势。例如,DiffMatch在WHU-CD上的IoU提高了+5.3,而在LEVIR-CD上的IoU提高了+2.4,同时我们的CEG策略在没有监督的情况下可以达到比最先进的无监督CD方法更出色的性能。
https://arxiv.org/abs/2405.04788
Despite notable successes of Reinforcement Learning (RL), the prevalent use of an online learning paradigm prevents its widespread adoption, especially in hazardous or costly scenarios. Offline RL has emerged as an alternative solution, learning from pre-collected static datasets. However, this offline learning introduces a new challenge known as distributional shift, degrading the performance when the policy is evaluated on scenarios that are Out-Of-Distribution (OOD) from the training dataset. Most existing offline RL resolves this issue by regularizing policy learning within the information supported by the given dataset. However, such regularization overlooks the potential for high-reward regions that may exist beyond the dataset. This motivates exploring novel offline learning techniques that can make improvements beyond the data support without compromising policy performance, potentially by learning causation (cause-and-effect) instead of correlation from the dataset. In this paper, we propose the MOOD-CRL (Model-based Offline OOD-Adapting Causal RL) algorithm, which aims to address the challenge of extrapolation for offline policy training through causal inference instead of policy-regularizing methods. Specifically, Causal Normalizing Flow (CNF) is developed to learn the transition and reward functions for data generation and augmentation in offline policy evaluation and training. Based on the data-invariant, physics-based qualitative causal graph and the observational data, we develop a novel learning scheme for CNF to learn the quantitative structural causal model. As a result, CNF gains predictive and counterfactual reasoning capabilities for sequential decision-making tasks, revealing a high potential for OOD adaptation. Our CNF-based offline RL approach is validated through empirical evaluations, outperforming model-free and model-based methods by a significant margin.
尽管强化学习(RL)取得了显著的成功,但在线学习范式普遍使用导致其广泛应用受限,尤其是在危险或昂贵的情景中。离线强化学习(Offline RL)作为一种替代方案应运而生,通过预先收集的静态数据集进行学习。然而,这种离线学习引入了一个名为分布平滑的新挑战,当策略在训练数据集之外的场景上评估时,会降低其性能。为解决此问题,大多数现有的离线强化学习方法通过在给定数据集支持的范围内对策略进行规范化来解决。然而,这种规范化方法忽视了数据之外可能存在高奖励区域的事实。因此,探索新的离线学习方法具有提高数据支持下的策略性能而不会牺牲策略性能潜力,通过从数据中学习因果关系(原因和结果)来解决此问题。在本文中,我们提出了MOOD-CRL(基于模型的离线OUD适应因果RL)算法,旨在通过因果推理而不是策略规范化方法来解决离线策略训练的扩展问题。具体来说,我们开发了因果正常化流(CNF)来学习数据生成和增强在离线策略评估和训练中的转移和奖励函数。基于数据无关、基于物理的定性因果图和观测数据,我们为CNF开发了一种新的学习方案,以学习量化结构因果模型。因此,CNF在序列决策任务中获得了预测和反事实推理能力,揭示了其在大数据迁移方面的巨大潜力。我们的基于CNF的离线强化学习方法通过实证评估证明了比模型免费和基于模型的方法具有显著的优越性。
https://arxiv.org/abs/2405.03892
We developed a deep learning classifier of rectal cancer response (tumor vs. no-tumor) to total neoadjuvant treatment (TNT) from endoscopic images acquired before, during, and following TNT. We further evaluated the network's ability in a near out-of-distribution (OOD) problem to identify local regrowth (LR) from follow-up endoscopy images acquired several months to years after completing TNT. We addressed endoscopic image variability by using optimal mass transport-based image harmonization. We evaluated multiple training regularization schemes to study the ResNet-50 network's in-distribution and near-OOD generalization ability. Test time augmentation resulted in the most considerable accuracy improvement. Image harmonization resulted in slight accuracy improvement for the near-OOD cases. Our results suggest that off-the-shelf deep learning classifiers can detect rectal cancer from endoscopic images at various stages of therapy for surveillance.
我们开发了一个用于直肠癌反应分层的深度学习分类器(肿瘤与无肿瘤)来预测从结镜图像中获取的初始、进行中和和完成中和治疗的肿瘤反应(TNT)。我们还评估了网络在近分布外(OOD)问题中识别局部生长(LR)的能力,从完成TNT数月到数年后的随访结镜图像中。我们通过使用最优质量传输图像和谐来解决结镜图像变异性。我们评估了多个训练正则化方案,以研究ResNet-50网络在分布内和近分布外的泛化能力。测试时间增强导致准确性提高最为显著。图像和谐在近分布外案例中略微提高了一些准确性。我们的结果表明,标准的深度学习分类器可以从结镜图像中检测直肠癌的不同治疗阶段。
https://arxiv.org/abs/2405.03762
In this paper, we introduce a new method for the task of interaction transfer. Given an example interaction between a source object and an agent, our method can automatically infer both surface and spatial relationships for the agent and target objects within the same category, yielding more accurate and valid transfers. Specifically, our method characterizes the example interaction using a combined spatial and surface representation. We correspond the agent points and object points related to the representation to the target object space using a learned spatial and surface correspondence field, which represents objects as deformed and rotated signed distance fields. With the corresponded points, an optimization is performed under the constraints of our spatial and surface interaction representation and additional regularization. Experiments conducted on human-chair and hand-mug interaction transfer tasks show that our approach can handle larger geometry and topology variations between source and target shapes, significantly outperforming state-of-the-art methods.
在本文中,我们提出了一种新的交互转移方法。给定一个源对象和一个代理之间的交互示例,我们的方法可以自动推断同一类别中代理和目标对象之间的表面和空间关系,从而产生更准确和有效的转移。具体来说,我们的方法通过结合空间和表面表示来描述示例交互。我们将代理点和对表示的物体点与目标对象空间中的点相对应,使用学习到的空间和表面匹配场将物体表示为变形和旋转的签名距离场。在满足我们的空间和表面交互表示的约束条件下,进行优化。在人类椅和手拿咖啡杯的交互转移任务上进行实验证明,我们的方法可以处理源和目标形状之间更大的几何和拓扑变化,显著优于现有方法。
https://arxiv.org/abs/2405.03221
Deep learning has emerged as a promising approach for learning the nonlinear mapping between diffusion-weighted MR images and tissue parameters, which enables automatic and deep understanding of the brain microstructures. However, the efficiency and accuracy in the multi-parametric estimations are still limited since previous studies tend to estimate multi-parametric maps with dense sampling and isolated signal modeling. This paper proposes DeepMpMRI, a unified framework for fast and high-fidelity multi-parametric estimation from various diffusion models using sparsely sampled q-space data. DeepMpMRI is equipped with a newly designed tensor-decomposition-based regularizer to effectively capture fine details by exploiting the correlation across parameters. In addition, we introduce a Nesterov-based adaptive learning algorithm that optimizes the regularization parameter dynamically to enhance the performance. DeepMpMRI is an extendable framework capable of incorporating flexible network architecture. Experimental results demonstrate the superiority of our approach over 5 state-of-the-art methods in simultaneously estimating multi-parametric maps for various diffusion models with fine-grained details both quantitatively and qualitatively, achieving 4.5 - 22.5$\times$ acceleration compared to the dense sampling of a total of 270 diffusion gradients.
深度学习已经成为了学习扩散加权磁共振图像(DWI)和组织参数之间非线性映射的有前途的方法,这使得我们能够自动和深入理解大脑微观结构。然而,多参数估计的效率和准确性仍然有限,因为以前的研究倾向于使用稀疏采样和离散信号建模来估计多参数映射。本文提出DeepMpMRI,一种基于稀疏采样q空间数据的统一框架,用于从各种扩散模型进行高速和高保真的多参数估计。DeepMpMRI配备了一个新设计的张量分解基于正则化的特征,通过利用参数之间的相关性有效地捕捉细节。此外,我们引入了一种Nesterov基于自适应学习算法,动态优化正则化参数以提高性能。DeepMpMRI是一个可扩展的框架,能够容纳灵活的网络架构。实验结果表明,我们的方法在同时估计多种扩散模型的细粒度多参数映射方面具有优越性,超过5种最先进的无监督学习方法,实现了4.5 - 22.5×的加速,相比总共270个扩散梯度的密集采样。
https://arxiv.org/abs/2405.03159