Haze severely degrades the visual quality of remote sensing images and hampers the performance of automotive navigation, intelligent monitoring, and urban management. The emerging denoising diffusion probabilistic model (DDPM) exhibits the significant potential for dense haze removal with its strong generation ability. Since remote sensing images contain extensive small-scale texture structures, it is important to effectively restore image details from hazy images. However, current wisdom of DDPM fails to preserve image details and color fidelity well, limiting its dehazing capacity for remote sensing images. In this paper, we propose a novel unified Fourier-aware diffusion model for remote sensing image dehazing, termed RSHazeDiff. From a new perspective, RSHazeDiff explores the conditional DDPM to improve image quality in dense hazy scenarios, and it makes three key contributions. First, RSHazeDiff refines the training phase of diffusion process by performing noise estimation and reconstruction constraints in a coarse-to-fine fashion. Thus, it remedies the unpleasing results caused by the simple noise estimation constraint in DDPM. Second, by taking the frequency information as important prior knowledge during iterative sampling steps, RSHazeDiff can preserve more texture details and color fidelity in dehazed images. Third, we design a global compensated learning module to utilize the Fourier transform to capture the global dependency features of input images, which can effectively mitigate the effects of boundary artifacts when processing fixed-size patches. Experiments on both synthetic and real-world benchmarks validate the favorable performance of RSHazeDiff over multiple state-of-the-art methods. Source code will be released at this https URL.
雾霾严重地降低了遥感图像的视觉质量,并阻碍了汽车导航、智能监测和城市管理的效果。新兴的去雾扩散概率模型(DDPM)通过其强大的生成能力在遥感图像去雾方面具有显著的潜力。由于遥感图像含有广泛的微小纹理结构,因此从雾霾图像中有效地恢复图像细节非常重要。然而,DDPM的现有智慧无法保留图像细节和色彩保真度,从而限制了其对遥感图像去雾的能力。在本文中,我们提出了一个名为RSHazeDiff的新统一的傅里叶感知扩散模型,用于遥感图像去雾,称之为RSHazeDiff。从新的角度来看,RSHazeDiff通过在粗到细的粒度上执行扩散过程的训练阶段噪声估计和重构约束,改善了雾霾场景下图像的质量,并做出了三个关键贡献。首先,RSHazeDiff通过在粗-细粒度上执行扩散过程的噪声估计和重构约束,细化了DDPM的训练阶段。因此,它消除了DDPM中简单噪声估计约束带来的令人不满意的后果。其次,通过在迭代采样步骤中将频率信息视为重要的先验知识,RSHazeDiff可以在去雾图像中保留更多的纹理细节和色彩保真度。第三,我们设计了一个全局补偿学习模块,利用傅里叶变换来捕捉输入图像的全局依赖特征,这可以有效减轻处理固定大小的补丁时边界伪影的影响。在模拟和现实世界的基准测试中,RSHazeDiff的表现优于多种最先进的算法。源代码将在此处发布:https://www.xxxxxx。
https://arxiv.org/abs/2405.09083
Reports regarding the misuse of $\textit{Generative AI}$ ($\textit{GenAI}$) to create harmful deepfakes are emerging daily. Recently, defensive watermarking, which enables $\textit{GenAI}$ providers to hide fingerprints in their images to later use for deepfake detection, has been on the rise. Yet, its potential has not been fully explored. We present $\textit{UnMarker}$ -- the first practical $\textit{universal}$ attack on defensive watermarking. Unlike existing attacks, $\textit{UnMarker}$ requires no detector feedback, no unrealistic knowledge of the scheme or similar models, and no advanced denoising pipelines that may not be available. Instead, being the product of an in-depth analysis of the watermarking paradigm revealing that robust schemes must construct their watermarks in the spectral amplitudes, $\textit{UnMarker}$ employs two novel adversarial optimizations to disrupt the spectra of watermarked images, erasing the watermarks. Evaluations against the $\textit{SOTA}$ prove its effectiveness, not only defeating traditional schemes while retaining superior quality compared to existing attacks but also breaking $\textit{semantic}$ watermarks that alter the image's structure, reducing the best detection rate to $43\%$ and rendering them useless. To our knowledge, $\textit{UnMarker}$ is the first practical attack on $\textit{semantic}$ watermarks, which have been deemed the future of robust watermarking. $\textit{UnMarker}$ casts doubts on the very penitential of this countermeasure and exposes its paradoxical nature as designing schemes for robustness inevitably compromises other robustness aspects.
关于$\textit{Generative AI}$($\textit{GenAI}$)用于制作有害深度伪造的报告每天都在增加。最近,防御性水印标记(Defensive Watermarking)激增,它使$\textit{GenAI}$提供商能够在他们的图像中隐藏指纹,以便稍后用于深度伪造检测。然而,它的潜力还没有完全发挥出来。我们提出了$\textit{UnMarker}$——第一个针对防御性水印标记的实际通用攻击。与现有攻击不同,$\textit{UnMarker}$不需要检测器反馈,不需要对攻击方案或类似模型的不切实际知识,也不需要高级去噪管道,这些可能并不存在。相反,它是通过深入分析水印范式揭示出,具有弹性的方案必须在频谱幅度上构建水印,$\textit{UnMarker}$采用两种新颖的对抗性优化来干扰水印图像的频谱,消除水印。与当前最佳攻击($\textit{SOTA}$)的评估证明其有效性,不仅在对传统攻击的胜利中保持卓越的质量,而且打破了改变图像结构的“语义”水印,将最佳检测率降到43%,使它们变得毫无用处。据我们所知,$\textit{UnMarker}$是第一个针对“语义”水印的实际攻击,这些水印被认为将是未来具有弹性的水印方案。$\textit{UnMarker}$使人们对这一补救措施的惩罚产生怀疑,并揭示了其自相矛盾的性质,即为了设计具有弹性的方案,必然会牺牲其他方面的可靠性。
https://arxiv.org/abs/2405.08363
In recent years, diffusion models (DMs) have become a popular method for generating synthetic data. By achieving samples of higher quality, they quickly became superior to generative adversarial networks (GANs) and the current state-of-the-art method in generative modeling. However, their potential has not yet been exploited in radar, where the lack of available training data is a long-standing problem. In this work, a specific type of DMs, namely denoising diffusion probabilistic model (DDPM) is adapted to the SAR domain. We investigate the network choice and specific diffusion parameters for conditional and unconditional SAR image generation. In our experiments, we show that DDPM qualitatively and quantitatively outperforms state-of-the-art GAN-based methods for SAR image generation. Finally, we show that DDPM profits from pretraining on largescale clutter data, generating SAR images of even higher quality.
近年来,扩散模型(DMs)已成为生成合成数据的一种流行方法。通过实现高质量的样本,它们迅速成为生成对抗网络(GANs)和当前生成建模状态的佼佼者。然而,在雷达领域,缺乏可用训练数据是一个长期存在的问题。在这项工作中,我们针对SAR领域 adapt 一种特定类型的DM,即去噪扩散概率模型(DDPM)。我们研究了条件下的和无条件SAR图像生成网络选择和扩散参数。在我们的实验中,我们证明了DDPM在SAR图像生成方面既具有定性又具有定量优势。最后,我们证明了DDPM在大型杂乱数据上的预训练使其产生更高质量的SAR图像。
https://arxiv.org/abs/2405.07776
Integrating the different data modalities of cancer patients can significantly improve the predictive performance of patient survival. However, most existing methods ignore the simultaneous utilization of rich semantic features at different scales in pathology images. When collecting multimodal data and extracting features, there is a likelihood of encountering intra-modality missing data, introducing noise into the multimodal data. To address these challenges, this paper proposes a new end-to-end framework, FORESEE, for robustly predicting patient survival by mining multimodal information. Specifically, the cross-fusion transformer effectively utilizes features at the cellular level, tissue level, and tumor heterogeneity level to correlate prognosis through a cross-scale feature cross-fusion method. This enhances the ability of pathological image feature representation. Secondly, the hybrid attention encoder (HAE) uses the denoising contextual attention module to obtain the contextual relationship features and local detail features of the molecular data. HAE's channel attention module obtains global features of molecular data. Furthermore, to address the issue of missing information within modalities, we propose an asymmetrically masked triplet masked autoencoder to reconstruct lost information within modalities. Extensive experiments demonstrate the superiority of our method over state-of-the-art methods on four benchmark datasets in both complete and missing settings.
综合癌症患者的不同数据模式可以显著提高预测患者生存的性能。然而,现有的方法大多数忽略了在病理图像中同时利用不同尺度上的丰富语义特征。在收集多模态数据和提取特征时,很可能遇到内模态缺失数据,引入噪声到多模态数据中。为解决这些挑战,本文提出了一种新的端到端框架,FORESEE,用于通过挖掘多模态信息 robust 预测患者生存。具体来说,跨模态变换器通过细胞水平、组织水平和对肿瘤异质性的跨模态特征交叉融合方法,有效地利用特征进行 prognosis 的关联。这增强了病理图像特征表示的能力。其次,混合注意编码器(HAE)使用去噪上下文关注模块获得分子数据的上下文关系特征和局部细节特征。HAE的通道关注模块获得分子数据的全局特征。此外,为解决模态内缺失信息的问题,我们提出了一种不对称掩码三元组掩码自动编码器,用于在模态内重构丢失的信息。大量实验证明,我们的方法在完整和缺失设置下的四个基准数据集上优于最先进的方法。
https://arxiv.org/abs/2405.07702
Pedestrian trajectory prediction plays a pivotal role in the realms of autonomous driving and smart cities. Despite extensive prior research employing sequence and generative models, the unpredictable nature of pedestrians, influenced by their social interactions and individual preferences, presents challenges marked by uncertainty and multimodality. In response, we propose the Energy Plan Denoising (EPD) model for stochastic trajectory prediction. EPD initially provides a coarse estimation of the distribution of future trajectories, termed the Plan, utilizing the Langevin Energy Model. Subsequently, it refines this estimation through denoising via the Probabilistic Diffusion Model. By initiating denoising with the Plan, EPD effectively reduces the need for iterative steps, thereby enhancing efficiency. Furthermore, EPD differs from conventional approaches by modeling the distribution of trajectories instead of individual trajectories. This allows for the explicit modeling of pedestrian intrinsic uncertainties and eliminates the need for multiple denoising operations. A single denoising operation produces a distribution from which multiple samples can be drawn, significantly enhancing efficiency. Moreover, EPD's fine-tuning of the Plan contributes to improved model performance. We validate EPD on two publicly available datasets, where it achieves state-of-the-art results. Additionally, ablation experiments underscore the contributions of individual modules, affirming the efficacy of the proposed approach.
行人轨迹预测在自动驾驶和智能城市领域扮演着关键角色。尽管之前的研究广泛采用了序列和生成模型,但行人的不可预测性,受到其社会互动和个人偏好的影响,提出了不确定性特征和多模态问题。为了应对这一挑战,我们提出了能量规划去噪(EPD)模型来进行随机轨迹预测。EPD最初利用Langevin能量模型提供未来轨迹的粗略估计,然后通过概率扩散模型进行去噪优化。通过以计划为基础进行去噪,EPD有效地减少了迭代步骤的需要,从而提高了效率。此外,EPD与传统方法的不同之处在于它将轨迹分布建模为单个分布,而不是个体轨迹。这使得它可以明确地建模行人固有不确定性,并避免了多个去噪操作的需要。通过单次去噪操作产生一个分布,可以从中抽取多个样本,从而大大提高效率。此外,EPD对计划的微调有助于提高模型性能。我们在两个公开可用的数据集上验证了EPD,并实现了最先进的性能。此外,消融实验证实了所提出的方案的有效性。
https://arxiv.org/abs/2405.07164
Denoising diffusion models (DDM) have gained recent traction in medical image translation given improved training stability over adversarial models. DDMs learn a multi-step denoising transformation to progressively map random Gaussian-noise images onto target-modality images, while receiving stationary guidance from source-modality images. As this denoising transformation diverges significantly from the task-relevant source-to-target transformation, DDMs can suffer from weak source-modality guidance. Here, we propose a novel self-consistent recursive diffusion bridge (SelfRDB) for improved performance in medical image translation. Unlike DDMs, SelfRDB employs a novel forward process with start- and end-points defined based on target and source images, respectively. Intermediate image samples across the process are expressed via a normal distribution with mean taken as a convex combination of start-end points, and variance from additive noise. Unlike regular diffusion bridges that prescribe zero variance at start-end points and high variance at mid-point of the process, we propose a novel noise scheduling with monotonically increasing variance towards the end-point in order to boost generalization performance and facilitate information transfer between the two modalities. To further enhance sampling accuracy in each reverse step, we propose a novel sampling procedure where the network recursively generates a transient-estimate of the target image until convergence onto a self-consistent solution. Comprehensive analyses in multi-contrast MRI and MRI-CT translation indicate that SelfRDB offers superior performance against competing methods.
滤波扩散模型(DDM)在医学图像翻译领域最近受到了关注,因为通过改善对抗模型的训练稳定性,DDM获得了更好的性能。DDM通过学习多步滤波变换来逐步将随机高斯噪声图像映射到目标模态图像,同时从源模态图像中获得稳态指导。然而,由于这个滤波变换与任务相关的源到目标变换存在显著差异,DDM可能会受到弱源模态指导的困扰。 在这里,我们提出了一个名为自适应一致递归扩散桥(SelfRDB)的新型医疗图像翻译模型,以提高在医学图像翻译方面的性能。与DDM不同,SelfRDB采用了一种基于目标和源图像分别定义起点和终点的全新前向过程。过程中的中间图像样本通过正态分布来表示,平均取作为凸组合的起点和终点,方差从添加噪声中取。与常规扩散桥规定在起点和终点处方差为零,过程中间的方差较高的情况不同,我们提出了一个在端点处方差逐渐增加的新噪音调度策略,以提高泛化性能并促进两种模态之间的信息传递。 为了进一步提高反向步骤的抽样精度,我们提出了一个新抽样方法,网络在收敛到自适应解之前递归生成目标图像的暂态估计。在多对比MRI和MRI-CT翻译中进行全面的分析表明,SelfRDB相对于竞争方法具有卓越的性能。
https://arxiv.org/abs/2405.06789
We propose a gradient flow procedure for generative modeling by transporting particles from an initial source distribution to a target distribution, where the gradient field on the particles is given by a noise-adaptive Wasserstein Gradient of the Maximum Mean Discrepancy (MMD). The noise-adaptive MMD is trained on data distributions corrupted by increasing levels of noise, obtained via a forward diffusion process, as commonly used in denoising diffusion probabilistic models. The result is a generalization of MMD Gradient Flow, which we call Diffusion-MMD-Gradient Flow or DMMD. The divergence training procedure is related to discriminator training in Generative Adversarial Networks (GAN), but does not require adversarial training. We obtain competitive empirical performance in unconditional image generation on CIFAR10, MNIST, CELEB-A (64 x64) and LSUN Church (64 x 64). Furthermore, we demonstrate the validity of the approach when MMD is replaced by a lower bound on the KL divergence.
我们提出了一个通过将粒子从初始分布传输到目标分布的梯度流方法来进行生成建模,其中粒子的梯度场由噪声自适应的Wasserstein梯度(MMD)给出。训练噪声自适应的MMD时,通常用于去噪扩散概率模型。结果是MMD梯度流的推广,我们称之为扩散-MMD梯度流或DMMD。与GAN中的判别器训练过程相关,但不要求对抗性训练。在CIFAR10、MNIST和CELEB-A(64x64)以及LSUN Church(64x64)等条件下的图像生成中获得了竞争力的实证性能。此外,我们还证明了当MMD用KL散度下界替换时,该方法的有效性。
https://arxiv.org/abs/2405.06780
Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference processs. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods. Project page: this https URL
感谢扩散模型的强大生成能力,近年来在人类运动生成方面取得了快速进展。现有的扩散方法采用不同的网络架构和训练策略。每个组件的效果仍然不明确。此外,迭代去噪过程消耗大量计算开销,这对于实时场景(如虚拟角色和类人机器人)来说是不容许的。因此,我们首先对网络架构、训练策略和推理过程进行全面调查。根据深刻的分析,我们为高效高质量的人体运动生成定制了每个组件。尽管具有 promising 的性能,但定制模型仍然存在脚滑问题,这是扩散基解决方案中普遍存在的问题。为了消除脚滑,我们在去噪过程中确定了脚与地面接触并沿着去噪过程纠正脚部运动。通过有机地将这些设计良好的组件组合在一起,我们提出了 StableMoFusion,一个健壮高效的用于人体运动生成的框架。大量实验结果表明,我们的 StableMoFusion 相对于最先进的办法具有优势。项目页面:此链接
https://arxiv.org/abs/2405.05691
Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable. To this end, we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining. Specifically, for single-denoising-step pruning, we develop a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify redundant tokens, and a similarity-based recovery method to restore tokens for the convolution operation. In addition, we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model. Project webpage: this https URL.
扩散模型(DMs)在生成高质量和多样性的图像方面表现出色。然而,这种卓越性能是以昂贵的建筑设计为代价的,特别是由于在主要模型中大量使用的关注模块。现有作品主要采用重新训练来提高DM的效率。这既耗时又不太可扩展。为此,我们提出了一个自适应训练-免费高效的扩散模型(AT-EDM)框架,利用注意力图执行运行时剪枝,无需进行重新训练。具体来说,对于单 denoising 步剪枝,我们开发了一种新颖的排名算法,称为广义权重页排名(G-WPR),用于识别冗余标记,并采用基于相似度的恢复方法来修复卷积操作中的标记。此外,我们提出了一个适用于不同去噪时刻的剪枝方法,以调整剪枝预算以提高生成质量。大量评估结果表明,AT-EDM在效率方面优于现有技术(例如,38.8%的FLOPs节省和高达1.53倍的加速速度)。同时,与完整模型相比,AT-EDM在FID和CLIP分数方面几乎相同。项目网页:https://this URL。
https://arxiv.org/abs/2405.05252
Diffusion models are a powerful generative framework, but come with expensive inference. Existing acceleration methods often compromise image quality or fail under complex conditioning when operating in an extremely low-step regime. In this work, we propose a novel distillation framework tailored to enable high-fidelity, diverse sample generation using just one to three steps. Our approach comprises three key components: (i) Backward Distillation, which mitigates training-inference discrepancies by calibrating the student on its own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically adapts knowledge transfer based on the current time step; and (iii) Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction. Through extensive experiments, we demonstrate that our method outperforms existing competitors in quantitative metrics and human evaluations. Remarkably, it achieves performance comparable to the teacher model using only three denoising steps, enabling efficient high-quality generation.
扩散模型是一个强大的生成框架,但它们通常代价高昂的推理。现有的加速方法在操作在极度低步骤率时通常会牺牲图像质量或者在复杂条件下失效。在本文中,我们提出了一个专门为高保真度、多样样本生成而设计的全新去蒸馏框架。我们的方法包括三个关键组件: (i)反向扩散,通过在自己反向轨迹上进行自校准来缓解训练推理差异;(ii)平移重构损失,根据当前时间步动态适应知识转移;和(iii)噪声纠正,通过解决噪声预测中的 singularities 来提高样本质量。通过大量实验,我们证明了我们的方法在数量指标和人类评价方面优于现有竞争对手。值得注意的是,它仅使用三个去噪步骤就实现了与教师模型相当的表现,从而实现高效的优质生成。
https://arxiv.org/abs/2405.05224
The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods, they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans, missing out on valuable implicit supervision to guide the 3D HPE task. Moreover, previous efforts often study this task from the perspective of the whole human body, neglecting fine-grained guidance hidden in different body parts. To this end, we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at this https URL.
3D 人类姿态估计(3D HPE)任务使用 2D 图像或视频预测人体关节坐标在 3D 空间中。尽管在基于深度学习的方法中最近取得了进展,但它们通常忽视了将可访问的文本和人类自然可行的知识相结合的能力,从而错过了对引导 3D HPE 任务的宝贵隐式监督。此外,之前的努力通常从整个人体角度研究这个问题,忽视了隐藏在身体不同部位的微小指导。因此,我们提出了一个基于扩散模型的细粒度提示驱动去噪器,名为 FinePOSE。它包括三个核心模块: (1)细粒度部分感知提示学习(FPP)模块通过将可访问的文本和人体部位的自然可行知识与可学习提示相结合来构建细粒度部分感知提示。 (2)细粒度提示-姿态通信(FPC)模块建立可学习部分感知提示和学习到的姿态之间的细粒度通信,以提高去噪质量。 (3)提示驱动时间样式化(PTS)模块将学习到的提示嵌入和与噪声水平相关的时序信息集成在一起,以实现在每个去噪步骤的自适应调整。在公开的人体姿态估计数据集上进行广泛的实验证明,FinePOSE 超越了最先进的方法。我们进一步将 FinePOSE 扩展到多人类姿态估计。在实现 egoHumans 数据集上的平均 MPJPE 值为 34.3mm 证明了 FinePOSE 在处理复杂的人体场景方面具有潜力。代码可在此处访问:<https:// this URL>
https://arxiv.org/abs/2405.05216
The search for refining 3D LiDAR data has attracted growing interest motivated by recent techniques such as supervised learning or generative model-based methods. Existing approaches have shown the possibilities for using diffusion models to generate refined LiDAR data with high fidelity, although the performance and speed of such methods have been limited. These limitations make it difficult to execute in real-time, causing the approaches to struggle in real-world tasks such as autonomous navigation and human-robot interaction. In this work, we introduce a novel approach based on conditional diffusion models for fast and high-quality sparse-to-dense upsampling of 3D scene point clouds through an image representation. Our method employs denoising diffusion probabilistic models trained with conditional inpainting masks, which have been shown to give high performance on image completion tasks. We introduce a series of experiments, including multiple datasets, sampling steps, and conditional masks, to determine the ideal configuration, striking a balance between performance and inference speed. This paper illustrates that our method outperforms the baselines in sampling speed and quality on upsampling tasks using the KITTI-360 dataset. Furthermore, we illustrate the generalization ability of our approach by simultaneously training on real-world and synthetic datasets, introducing variance in quality and environments.
寻找精化3D LiDAR数据的搜索吸引了越来越多的关注,这是由最近使用的如监督学习或基于生成模型的方法等技术引起的。虽然已经证明了使用扩散模型生成具有高保真度的精化LiDAR数据的可能性,但这种方法的性能和速度仍然有限。这些限制使得在实时执行中很难实现,导致在现实世界的任务(如自主导航和人类机器人交互)中,这些方法遇到困难。 在本文中,我们介绍了一种基于条件扩散模型的新的方法,用于通过图像表示对3D场景点云进行高保真度的平滑和压缩。我们的方法采用带条件修补掩码的噪声扩散概率模型进行训练,这些模型已经在图像完成任务中表现出良好的性能。我们介绍了一系列实验,包括多个数据集、采样步骤和条件掩码,以确定理想的配置,在性能和推理速度之间取得平衡。本文证明了,我们的方法在KITTI-360数据集上的采样速度和质量方面超过了基线。此外,我们还通过同时在一手真实世界和合成数据上训练,展示了我们方法的一般化能力。我们还展示了在不同质量和环境下的平滑和压缩效果。
https://arxiv.org/abs/2405.04889
Understanding how humans would behave during hand-object interaction is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works have been proposed to simultaneously predict hand trajectories and object affordances on human egocentric videos. They are regarded as the representation of future hand-object interactions, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm for unidirectional prediction, which lacks mutual constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, these works basically overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based interaction prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner. We transform the sequential 2D images into latent feature space and design a denoising diffusion model to predict future latent interaction features conditioned on past ones. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer's dynamics for more accurate interaction prediction. The experimental results show that our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our proposed new evaluation protocol. This highlights the efficacy of leveraging a generative paradigm for 2D hand-object interaction prediction. The code of Diff-IP2D will be released at this https URL.
理解人类在双手与物体交互过程中的行为对服务机器人操作和扩展现实应用至关重要。为了实现这一目标,一些最近的工作提出了在人类自中心视频中同时预测手轨迹和物体姿态。它们被认为是未来双手与物体交互的表示,表明潜在的人类运动和动机。然而,现有的方法主要采用自回归范式进行单向预测,缺乏全局未来序列中的相互约束,并在时间轴上累积误差。同时,这些作品基本上忽略了相机自转运动对第一人称视角预测的影响。为了克服这些限制,我们提出了名为Diff-IP2D的新颖扩散基于交互预测方法,以同时预测未来手轨迹和物体姿态的迭代非自回归方式。我们将顺序2D图像转换为潜在特征空间,并设计了一个去噪扩散模型,根据过去的特征预测未来的潜在交互特征。为了使Diff-IP2D能够关注到佩戴摄像头的用户的动态,我们进一步将运动特征融入条件去噪过程中。实验结果表明,我们的方法在标准指标和非标准指标上均显著优于最先进的基准模型。这突出了利用生成范式进行2D手与物体交互预测的有效性。Diff-IP2D的代码将在该链接上发布。
https://arxiv.org/abs/2405.04370
Continuous Conditional Generative Modeling (CCGM) aims to estimate the distribution of high-dimensional data, typically images, conditioned on scalar continuous variables known as regression labels. While Continuous conditional Generative Adversarial Networks (CcGANs) were initially designed for this task, their adversarial training mechanism remains vulnerable to extremely sparse or imbalanced data, resulting in suboptimal outcomes. To enhance the quality of generated images, a promising alternative is to replace CcGANs with Conditional Diffusion Models (CDMs), renowned for their stable training process and ability to produce more realistic images. However, existing CDMs encounter challenges when applied to CCGM tasks due to several limitations such as inadequate U-Net architectures and deficient model fitting mechanisms for handling regression labels. In this paper, we introduce Continuous Conditional Diffusion Models (CCDMs), the first CDM designed specifically for the CCGM task. CCDMs address the limitations of existing CDMs by introducing specially designed conditional diffusion processes, a modified denoising U-Net with a custom-made conditioning mechanism, a novel hard vicinal loss for model fitting, and an efficient conditional sampling procedure. With comprehensive experiments on four datasets with varying resolutions ranging from 64x64 to 192x192, we demonstrate the superiority of the proposed CCDM over state-of-the-art CCGM models, establishing new benchmarks in CCGM. Extensive ablation studies validate the model design and implementation configuration of the proposed CCDM. Our code is publicly available at this https URL.
连续条件生成建模(CCGM)旨在根据已知的一维连续变量,通常为回归标签,估计高维数据的分布。虽然最初设计的CCGANs用于这项任务,但它们的对抗训练机制仍然容易受到极稀疏或不平衡数据的影响,导致性能较差。为了提高生成的图像的质量,一种有前途的替代方法是将CCGANs替换为条件扩散模型(CDMs),以其稳定的训练过程和生成更逼真的图像而闻名。然而,由于CCDMs在应用于CCGM任务时存在几个限制,如不足的U-Net架构和对于回归标签处理能力不足的模型拟合机制,因此现有的CDMs在应用于CCGM任务时遇到了挑战。在本文中,我们引入了专门为CCGM任务设计的连续条件扩散模型(CCDMs)。CCDMs通过引入特别设计的条件扩散过程、经过修改的带条件条件的U-Net、新的具有模型拟合特性的硬局部损失以及高效的条件抽样程序,解决了现有CDMs的局限性。在四个不同分辨率的公开数据集上进行全面的实验,从64x64到192x192,我们证明了所提出的CCDM相对于最先进的CCGM模型具有优越性,为CCGM设立了新的基准。广泛的消融分析验证了所提出的CCDM的模型设计和实现配置是正确的。我们的代码公开在https://这个URL上。
https://arxiv.org/abs/2405.03546
Geospatial data has been transformative for the monitoring of the Earth, yet, as in the case of (geo)physical monitoring, the measurements can have variable spatial and temporal sampling and may be associated with a significant level of perturbations degrading the signal quality. Denoising geospatial data is, therefore, essential, yet often challenging because the observations may comprise noise coming from different origins, including both environmental signals and instrumental artifacts, which are spatially and temporally correlated, thus hard to disentangle. This study addresses the denoising of multivariate time series acquired by irregularly distributed networks of sensors, requiring specific methods to handle the spatiotemporal correlation of the noise and the signal of interest. Specifically, our method focuses on the denoising of geodetic position time series, used to monitor ground displacement worldwide with centimeter- to-millimeter precision. Among the signals affecting GNSS data, slow slip events (SSEs) are of interest to seismologists. These are transients of deformation that are weakly emerging compared to other signals. Here, we design SSEdenoiser, a multi-station spatiotemporal graph-based attentive denoiser that learns latent characteristics of GNSS noise to reveal SSE-related displacement with sub-millimeter precision. It is based on the key combination of graph recurrent networks and spatiotemporal Transformers. The proposed method is applied to the Cascadia subduction zone, where SSEs occur along with bursts of tectonic tremors, a seismic rumbling identified from independent seismic recordings. The extracted events match the spatiotemporal evolution of tremors. This good space-time correlation of the denoised GNSS signals with the tremors validates the proposed denoising procedure.
地理空间数据对于监测地球是有变革性的,然而,在(地理)物理监测中,测量值可能具有变化的地域和时间采样,并可能与显著的扰动降低信号质量。因此,去噪地理空间数据是必要的,但通常具有挑战性,因为观测可能包括来自不同来源的噪声,包括环境信号和仪器噪声,这些噪声在时间和空间上是相关的,因此很难区分。 本研究旨在解决由分布式传感器网络获得的多元时间序列中的去噪问题,需要特定的方法来处理噪声的地域和时间相关性,特别是针对地震学家的信号。具体来说,我们的方法专注于地形位移时间序列的去噪,该序列用于全球范围内以厘米到毫米的精度监测地面沉降。在影响GNSS数据的信号中,慢滑事件(SSEs)对于地震学家是感兴趣的。这些是变形衰减的其他信号。在这里,我们设计了一个基于多站点的时空图的注意去噪器,可以学习GNSS噪声的潜在特征,以揭示与SSE相关的位移,具有亚毫米精度的精度。它是基于图循环网络和时空转换器的关键组合。所提出的方法应用于Cascadia断层带,其中SSE与火山爆发期间产生的地震冲击波同时出现。提取的事件与地震发生的时空演变相匹配。这种去噪的GNSS信号与地震之间良好的空间-时间相关性证实了所提出的去噪过程。
https://arxiv.org/abs/2405.03320
Score matching with Langevin dynamics (SMLD) method has been successfully applied to accelerated MRI. However, the hyperparameters in the sampling process require subtle tuning, otherwise the results can be severely corrupted by hallucination artifacts, particularly with out-of-distribution test data. In this study, we propose a novel workflow in which SMLD results are regarded as additional priors to guide model-driven network training. First, we adopted a pretrained score network to obtain samples as preliminary guidance images (PGI) without the need for network retraining, parameter tuning and in-distribution test data. Although PGIs are corrupted by hallucination artifacts, we believe that they can provide extra information through effective denoising steps to facilitate reconstruction. Therefore, we designed a denoising module (DM) in the second step to improve the quality of PGIs. The features are extracted from the components of Langevin dynamics and the same score network with fine-tuning; hence, we can directly learn the artifact patterns. Third, we designed a model-driven network whose training is guided by denoised PGIs (DGIs). DGIs are densely connected with intermediate reconstructions in each cascade to enrich the features and are periodically updated to provide more accurate guidance. Our experiments on different sequences revealed that despite the low average quality of PGIs, the proposed workflow can effectively extract valuable information to guide the network training, even with severely reduced training data and sampling steps. Our method outperforms other cutting-edge techniques by effectively mitigating hallucination artifacts, yielding robust and high-quality reconstruction results.
使用Langevin动力学(SMLD)方法进行分数匹配已经被成功地应用于加速磁共振成像(MRI)。然而,在采样过程中需要精细调整的参数,否则结果可能会受到伪影伪像的严重影响,特别是在非分布测试数据上。在这项研究中,我们提出了一个新颖的工作流程,其中SMLD结果被视为指导模型驱动网络训练的额外 prior。首先,我们采用预训练的分数网络来获取不需要网络重新训练和参数调整的样本作为初步指导图像(PGI)。尽管PGIs受到伪影伪像的污染,但我们认为它们可以通过有效的去噪步骤提供额外的信息,以促进重建。因此,我们在第二步设计了一个去噪模块(DM),以提高PGIs的质量。特征是从Langevin动态的组件中提取的,与同一分数网络进行了微调;因此,我们可以直接学习伪影模式。第三,我们设计了一个指导去噪PGIs(DGIs)的模型驱动网络。DGIs在级联过程中与中间修复相比密度连接,从而丰富特征,并定期更新以提供更准确的指导。我们对不同序列的实验表明,尽管PGIs的平均质量较低,但所提出的工作流程仍可以有效地提取有价值的信息来指导网络训练,即使训练数据和采样步骤严重减少。我们的方法通过有效地减轻伪影伪像,产生 robust 和 high-quality 重建结果,超越了其他尖端技术。
https://arxiv.org/abs/2405.02958
Multi-agent systems (MAS) need to adaptively cope with dynamic environments, changing agent populations, and diverse tasks. However, most of the multi-agent systems cannot easily handle them, due to the complexity of the state and task space. The social impact theory regards the complex influencing factors as forces acting on an agent, emanating from the environment, other agents, and the agent's intrinsic motivation, referring to the social force. Inspired by this concept, we propose a novel gradient-based state representation for multi-agent reinforcement learning. To non-trivially model the social forces, we further introduce a data-driven method, where we employ denoising score matching to learn the social gradient fields (SocialGFs) from offline samples, e.g., the attractive or repulsive outcomes of each force. During interactions, the agents take actions based on the multi-dimensional gradients to maximize their own rewards. In practice, we integrate SocialGFs into the widely used multi-agent reinforcement learning algorithms, e.g., MAPPO. The empirical results reveal that SocialGFs offer four advantages for multi-agent systems: 1) they can be learned without requiring online interaction, 2) they demonstrate transferability across diverse tasks, 3) they facilitate credit assignment in challenging reward settings, and 4) they are scalable with the increasing number of agents.
多智能体系统(MAS)需要适应性地应对动态环境、变化的人工智能体种和多样化的任务。然而,大多数MAS无法轻松处理这些复杂状态和任务空间。社会影响理论将复杂的影响力因素视为作用于智能体、来自环境、其他智能体以及智能体内在动机的力量,即社会力。受到这个概念的启发,我们提出了一个新颖的基于梯度的多智能体强化学习状态表示。为了非平凡地建模社会力,我们进一步引入了一种数据驱动的方法,其中我们使用去噪评分匹配来从离线样本中学习社会梯度场(SocialGFs),例如每个力的吸引或排斥后果。在交互过程中,智能体根据多维梯度采取行动,以最大化自己的奖励。在实践中,我们将SocialGFs集成到广泛使用的多智能体强化学习算法中,如MAPPO。实证结果表明,SocialGFs对多智能体系统具有以下四个优点:1)无需在线交互即可学习,2)展示了跨多样化任务的传递性,3)有助于在具有挑战性的奖励设置中进行信道分配,4)随着智能体数量的增加,具有可扩展性。
https://arxiv.org/abs/2405.01839
Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
表达性语音转换(VC)通过共同转换说话人身份和情感风格来对情感说话人进行演讲者身份转换。对于表达性VC中任意说话人的情感风格建模,尚未进行过深入探讨。之前的解决方案依赖于语音重建器的语音重建,这使得语音质量高度依赖于语音重建器的表现。情感语音转换的一个主要挑战是情感语调建模。为了应对这些挑战,本文基于条件去噪扩散概率模型(DDPM)提出了一种完整的端到端情感语音转换框架。我们利用自监督语音模型产生的语音单元作为内容条件,同时从语音情感识别和说话人验证系统中提取深度特征来建模情感风格和说话人身份。客观和主观评估结果表明,我们的框架的有效性。代码和样本公开可用。
https://arxiv.org/abs/2405.01730
Denoising hyperspectral images (HSIs) is a crucial preprocessing procedure due to the noise originating from intra-imaging mechanisms and environmental factors. Utilizing domain-specific knowledge of HSIs, such as spectral correlation, spatial self-similarity, and spatial-spectral correlation, is essential for deep learning-based denoising. Existing methods are often constrained by running time, space complexity, and computational complexity, employing strategies that explore these priors separately. While the strategies can avoid some redundant information, considering that hyperspectral images are 3-D images with strong spatial continuity and spectral correlation, this kind of strategy inevitably overlooks subtle long-range spatial-spectral information that positively impacts image restoration. This paper proposes a Spatial-Spectral Selective State Space Model-based U-shaped network, termed Spatial-Spectral U-Mamba (SSUMamba), for hyperspectral image denoising. We can obtain complete global spatial-spectral correlation within a module thanks to the linear space complexity in State Space Model (SSM) computations. We introduce an Alternating Scan (SSAS) strategy for HSI data, which helps model the information flow in multiple directions in 3-D HSIs. Experimental results demonstrate that our method outperforms several compared methods. The source code will be available at this https URL.
去噪的人造超分辨率图像(HSIs)是 due 来自成像机制和环境因素的噪声的关键预处理步骤。利用HSIs的领域特定知识,如光谱相关性、空间自相似性和空间-光谱相关性,对于基于深度学习的去噪是至关重要的。现有的方法通常受到运行时间、空间复杂度和计算复杂性的限制,采用单独探索这些 prior 的策略。虽然这些策略可以避免一些冗余信息,但考虑到超分辨率图像具有强烈的空间连续性和光谱相关性,这种策略会无意地忽视图像恢复中微妙的长期空间-光谱信息。本文提出了一种基于空间-光谱选择状态空间模型的U型网络,称为空间-光谱U-Mamba(SSUMamba)用于HSI去噪。我们可以在State Space Model(SSM)计算的线性空间复杂性内获得完整的全局空间-光谱相关性。我们引入了交替扫描(SSAS)策略HSI数据,有助于在3D HSIs中建模信息流。实验结果表明,我们的方法超越了若干比较方法。源代码将在此处https:// URL上可用。
https://arxiv.org/abs/2405.01726
Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at this https URL
大规模文本到图像(T2I)扩散模型基于文本提示表现出显著的生成能力。基于T2I扩散模型,文本指导图像编辑研究旨在通过改变文本提示来用户操纵生成的图像。然而,现有的图像编辑技术容易导致编辑超过预期的目标区域,主要原因是跨注意图的准确性。为了解决这个问题,我们提出了局部化注意的逆置(LocInv)技术,该技术利用分词图或边界框作为额外的局部化先验来优化扩散过程的降噪阶段中的跨注意图。通过动态更新文本输入中相应的名词词位的 tokens,我们使得跨注意图与文本提示中的正确名词和形容词词首 closely 对齐。基于这种技术,我们在特定物体上实现精细图像编辑,同时防止对其他区域的不必要修改。我们的方法LocInv,基于公开可用的Stable Diffusion,在COCO数据集的子集上进行了广泛的评估,并且无论是在数量上还是在质量上,都取得了卓越的结果。代码将发布在https://这个 URL上。
https://arxiv.org/abs/2405.01496