We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at this http URL
我们提出了Hunyuan-DiT,一种具有对英语和中文细粒度理解的自监督文本到图像扩散变换器。为了构建Hunyuan-DiT,我们精心设计了变换器结构、文本编码器和位置编码。我们还从零开始构建了一个完整的数据管道,用于迭代模型优化。为了深入理解语言,我们训练了一个多模态大型语言模型,以优化图像的摘要。最后,Hunyuan-DiT可以与用户进行多轮多模态对话,根据上下文生成和优化图像。通过我们的全面人类评估协议(超过50名专业人类评估员)的全面评估,Hunyuan-DiT在中文到图像生成方面达到了最先进水平,与其他开源模型相比。代码和预训练模型可在该http URL公开获取。
https://arxiv.org/abs/2405.08748
Multi-objective Bayesian optimization (MOBO) has shown promising performance on various expensive multi-objective optimization problems (EMOPs). However, effectively modeling complex distributions of the Pareto optimal solutions is difficult with limited function evaluations. Existing Pareto set learning algorithms may exhibit considerable instability in such expensive scenarios, leading to significant deviations between the obtained solution set and the Pareto set (PS). In this paper, we propose a novel Composite Diffusion Model based Pareto Set Learning algorithm, namely CDM-PSL, for expensive MOBO. CDM-PSL includes both unconditional and conditional diffusion model for generating high-quality samples. Besides, we introduce an information entropy based weighting method to balance different objectives of EMOPs. This method is integrated with the guiding strategy, ensuring that all the objectives are appropriately balanced and given due consideration during the optimization process; Extensive experimental results on both synthetic benchmarks and real-world problems demonstrates that our proposed algorithm attains superior performance compared with various state-of-the-art MOBO algorithms.
多目标巴宜轩优化(MOBO)在各种昂贵的多目标优化问题(EMOPs)上表现出良好的性能。然而,在有限的功能评估下,有效地建模Pareto最优解的复杂分布是困难的。现有的Pareto集合学习算法在昂贵的场景中可能表现出相当的不稳定性,导致获得的解决方案集与Pareto集合(PS)之间的差异显著。在本文中,我们提出了一种基于Pareto集合学习的新颖组合扩散模型,即CDM-PSL,用于昂贵的MOBO。CDM-PSL包括生成高质量样本的条件的扩散模型和基于信息熵的权重方法。此外,我们还引入了EMOPs中不同目标之间平衡信息熵的方法。这种方法与指导策略相结合,确保在优化过程中适当平衡所有目标,并给予充分的考虑;在 synthetic 基准和现实世界问题上的广泛实验结果表明,与各种最先进的MOBO算法相比,我们提出的算法具有卓越的性能。
https://arxiv.org/abs/2405.08674
Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: this https URL.
现有的文本到图像模型很难跟上复杂的文本提示,因此需要额外的 grounded 输入以实现更好的可控制性。在这项工作中,我们提出了一种将场景分解为视觉基本单元的方法,称之为密集模糊表示,这些基本单元包含了场景的细粒度细节,同时具有模块化、人类可解释性和易于构建的特点。基于模糊表示,我们开发了一种基于模糊的文本到图像扩散模型,称为 BlobGEN,用于级联生成。特别地,我们引入了一个新的遮罩跨注意模块,用于区分模糊表示和视觉特征之间的融合。为了利用大型语言模型的可组合性(LLMs),我们引入了一种新的在上下文中学习从文本提示生成模糊表示的方法。我们的广泛实验结果表明,BlobGEN 在 MS-COCO 上实现了卓越的零散生成质量和更好的布局引导可控制性。当用LLM进行增强时,我们的方法在级联图像生成基准上展示了卓越的数值和空间正确性。项目页面:https:// this URL。
https://arxiv.org/abs/2405.08246
We present Infinite Texture, a method for generating arbitrarily large texture images from a text prompt. Our approach fine-tunes a diffusion model on a single texture, and learns to embed that statistical distribution in the output domain of the model. We seed this fine-tuning process with a sample texture patch, which can be optionally generated from a text-to-image model like DALL-E 2. At generation time, our fine-tuned diffusion model is used through a score aggregation strategy to generate output texture images of arbitrary resolution on a single GPU. We compare synthesized textures from our method to existing work in patch-based and deep learning texture synthesis methods. We also showcase two applications of our generated textures in 3D rendering and texture transfer.
我们提出了Infinite Texture方法,用于从文本提示生成任意大小的纹理图像。我们的方法在单个纹理上微调扩散模型,并学会了将统计分布嵌入模型输出域中。我们用一个纹理补丁作为微调过程的种子,该补丁可以从文本到图像模型DALL-E 2生成。在生成时,我们的微调扩散模型通过评分聚合策略用于生成单个GPU上的任意分辨率输出纹理图像。我们将我们方法生成的纹理与基于补丁和深度学习纹理合成方法的现有工作进行比较。我们还展示了我们生成的纹理在三维渲染和纹理传输中的应用。
https://arxiv.org/abs/2405.08210
Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.
生成多样且高质量的三维资产自动提出了3D计算机视觉的一个基本但具有挑战性的任务。尽管在3D生成方面进行了广泛的尝试,但现有的基于优化的方法在生成大规模三维资产时效率较低。同时,前馈方法通常只关注生成单一类别的或多类别的资产,限制了它们的普适性。因此,我们引入了一种基于扩散的前馈框架来解决这些挑战,使用单个模型实现。为了处理不同类别之间几何和纹理的大规模多样性和复杂性,我们1)采用改进的三面板来保证效率;2)引入3D意识Transformer,以聚合专门3D特征的推广3D知识;3)设计3D意识编码器/解码器以增强推广3D知识。在基于Transformer和DiffTF的3D意识扩散模型基础上,我们提出了一个更强的3D生成版本,即DiffTF++。它主要包括两个部分:多视图重构损失和三面板细化。具体来说,我们利用多视图重构损失微调扩散模型和三面板解码器,从而避免重构错误的影响,提高纹理合成。通过消除两个阶段之间的差异,生成性能得到了增强,尤其是在纹理方面。此外,我们还引入了3D意识细化过程来滤除伪影并细化三面板,从而生成更复杂和合理的细节。在ShapeNet和OmniObject3D等大量实验中,我们充分证明了我们提出的模块的有效性和与大型多样性、丰富语义和高质量三维物体生成性能。
https://arxiv.org/abs/2405.08055
As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.
作为人类,我们渴望创建既自由又易于控制的媒体内容。得益于生成技术的显著发展,我们现在可以轻松地利用2D扩散方法合成由原始草图或指定人体姿势控制的图像,甚至可以逐步编辑/再生带有遮罩的局部区域。然而,在3D建模任务中,类似的工作流程仍然无法实现,因为3D生成的可控性和效率不高。在本文中,我们提出了一个新颖的可控且具有交互性的3D资产建模框架,名为Coin3D。Coin3D允许用户使用由基本形状组成的粗略几何代理来控制3D生成,并引入了交互式生成工作流程,以支持在几秒钟内提供响应式的3D物体预览。为此,我们开发了几个技术,包括对扩散模型应用体积粗略形状控制的3D适配器,用于精确部分编辑的代理边界编辑策略,用于支持响应式预览的渐进式体积缓存,以及体积-SDS,以确保一致的网格重建。对不同形状代理的交互式生成和编辑的广泛实验证明,我们的方法在3D资产生成任务中实现了卓越的可控性和灵活性。
https://arxiv.org/abs/2405.08054
The proliferation of edge devices has brought Federated Learning (FL) to the forefront as a promising paradigm for decentralized and collaborative model training while preserving the privacy of clients' data. However, FL struggles with a significant performance reduction and poor convergence when confronted with Non-Independent and Identically Distributed (Non-IID) data distributions among participating clients. While previous efforts, such as client drift mitigation and advanced server-side model fusion techniques, have shown some success in addressing this challenge, they often overlook the root cause of the performance reduction - the absence of identical data accurately mirroring the global data distribution among clients. In this paper, we introduce Gen-FedSD, a novel approach that harnesses the powerful capability of state-of-the-art text-to-image foundation models to bridge the significant Non-IID performance gaps in FL. In Gen-FedSD, each client constructs textual prompts for each class label and leverages an off-the-shelf state-of-the-art pre-trained Stable Diffusion model to synthesize high-quality data samples. The generated synthetic data is tailored to each client's unique local data gaps and distribution disparities, effectively making the final augmented local data IID. Through extensive experimentation, we demonstrate that Gen-FedSD achieves state-of-the-art performance and significant communication cost savings across various datasets and Non-IID settings.
边缘设备的普及使得联邦学习(FL)成为了一个有前景的分布式和协作模型训练的范例,同时保护了客户端数据的隐私。然而,当面对参与客户端的非独立和等距分布数据时,FL的表现出现了显著的性能下降和收敛不良。虽然之前的一些努力,如客户端漂移减轻和高级服务器端模型融合技术,已经取得了一些成功来应对这个挑战,但它们往往忽视了这种下降的根本原因——客户端之间全球数据分布的缺失。在本文中,我们 introduce Gen-FedSD,一种利用最先进的文本到图像基础模型强大的能力来弥合FL中的显著非等距性能差距的新型方法。在Gen-FedSD中,每个客户端为每个类别构建文本提示,并利用一个标准的、先进的预训练的不稳定扩散模型来合成高质量的数据样本。生成的合成数据针对每个客户端的独特的局部数据缺口和分布不平等进行定制,有效地实现了最终的增强局部数据IID。通过广泛的实验,我们证明了Gen-FedSD在各种数据集和非等距设置中实现了最先进的性能和显著的通信成本节省。
https://arxiv.org/abs/2405.07925
Text-to-image generative models have become a prominent and powerful tool that excels at generating high-resolution realistic images. However, guiding the generative process of these models to consider detailed forms of conditioning reflecting style and/or structure information remains an open problem. In this paper, we present LoRAdapter, an approach that unifies both style and structure conditioning under the same formulation using a novel conditional LoRA block that enables zero-shot control. LoRAdapter is an efficient, powerful, and architecture-agnostic approach to condition text-to-image diffusion models, which enables fine-grained control conditioning during generation and outperforms recent state-of-the-art approaches
文本到图像生成模型已经成为一个突出且强大的工具,在生成高分辨率的高质量图像方面表现出色。然而,将这些模型的生成过程引导为考虑反映风格和/或结构信息的详细条件仍然是一个开放问题。在本文中,我们提出了LoRAdapter,一种方法,将样式和结构条件统一在一个公式中,使用一种新颖的条件下LoRA模块实现零样本控制。LoRAdapter是一种高效、强大且对架构无关的条件的文本到图像扩散模型,在生成过程中实现细粒度控制,并超越了最近的最先进方法。
https://arxiv.org/abs/2405.07913
Breast cancer is a significant cause of death from cancer in women globally, highlighting the need for improved diagnostic imaging to enhance patient outcomes. Accurate tumour identification is essential for diagnosis, treatment, and monitoring, emphasizing the importance of advanced imaging technologies that provide detailed views of tumour characteristics and disease. Synthetic correlated diffusion imaging (CDI$^s$) is a recent method that has shown promise for prostate cancer delineation compared to current MRI images. In this paper, we explore tuning the coefficients in the computation of CDI$^s$ for breast cancer tumour delineation by maximizing the area under the receiver operating characteristic curve (AUC) using a Nelder-Mead simplex optimization strategy. We show that the best AUC is achieved by the CDI$^s$ - Optimized modality, outperforming the best gold-standard modality by 0.0044. Notably, the optimized CDI$^s$ modality also achieves AUC values over 0.02 higher than the Unoptimized CDI$^s$ value, demonstrating the importance of optimizing the CDI$^s$ exponents for the specific cancer application.
乳腺癌是全球女性癌症死亡的主要原因,这凸显了改进的诊断影像技术对提高患者预后的必要性。准确的肿瘤识别对于诊断、治疗和监测至关重要,强调了对肿瘤特征和疾病提供详细视图的高级影像技术的重要性。合成相关扩散成像(CDI$^s$)是一种新兴的方法,已经在前列腺癌定位方面显示出比现有MRI图像更好的前景。在本文中,我们通过最大接收者操作特征曲线(AUC)下的系数调整来探索乳腺癌肿瘤定位的优化问题,采用Nelder-Mead简单优化策略。我们证明了CDI$^s$ - 优化模式的最佳AUC是由该模式获得的,比最佳黄金标准模式提高了0.0044。值得注意的是,优化后的CDI$^s$模式还实现了比未优化模式0.02更高的AUC值,充分证明了优化CDI$^s$指数对于特定癌症应用的重要性。
https://arxiv.org/abs/2405.08049
Breast cancer was diagnosed for over 7.8 million women between 2015 to 2020. Grading plays a vital role in breast cancer treatment planning. However, the current tumor grading method involves extracting tissue from patients, leading to stress, discomfort, and high medical costs. A recent paper leveraging volumetric deep radiomic features from synthetic correlated diffusion imaging (CDI$^s$) for breast cancer grade prediction showed immense promise for noninvasive methods for grading. Motivated by the impact of CDI$^s$ optimization for prostate cancer delineation, this paper examines using optimized CDI$^s$ to improve breast cancer grade prediction. We fuse the optimized CDI$^s$ signal with diffusion-weighted imaging (DWI) to create a multiparametric MRI for each patient. Using a larger patient cohort and training across all the layers of a pretrained MONAI model, we achieve a leave-one-out cross-validation accuracy of 95.79%, over 8% higher compared to that previously reported.
在2015年至2020年期间,有超过7800万人被确诊为乳腺癌。分级在乳腺癌治疗计划中起着关键作用。然而,目前的肿瘤分级方法需要从患者提取组织,导致不适、疼痛和高医疗费用。一篇利用合成相关扩散成像(CDI$^s$)的体积深度放射性特征进行乳腺癌分级预测的最近论文显示,非侵入性分级方法具有巨大的潜力。受到CDI$^s$优化对前列腺癌勾勒的影响,本文研究使用优化后的CDI$^s$改善乳腺癌分级预测。我们将优化后的CDI$^s$信号与扩散加权成像(DWI)融合,为每位患者创建一个多参数MRI。利用更大的患者人群和预训练的MONAI模型的所有层次进行训练,我们实现了95.79%的 leave-one-out 交叉验证准确率,比之前报道的8%更高。
https://arxiv.org/abs/2405.07861
In 2020, 685,000 deaths across the world were attributed to breast cancer, underscoring the critical need for innovative and effective breast cancer treatment. Neoadjuvant chemotherapy has recently gained popularity as a promising treatment strategy for breast cancer, attributed to its efficacy in shrinking large tumors and leading to pathologic complete response. However, the current process to recommend neoadjuvant chemotherapy relies on the subjective evaluation of medical experts which contain inherent biases and significant uncertainty. A recent study, utilizing volumetric deep radiomic features extracted from synthetic correlated diffusion imaging (CDI$^s$), demonstrated significant potential in noninvasive breast cancer pathologic complete response prediction. Inspired by the positive outcomes of optimizing CDI$^s$ for prostate cancer delineation, this research investigates the application of optimized CDI$^s$ to enhance breast cancer pathologic complete response prediction. Using multiparametric MRI that fuses optimized CDI$^s$ with diffusion-weighted imaging (DWI), we obtain a leave-one-out cross-validation accuracy of 93.28%, over 5.5% higher than that previously reported.
在2020年,全球有685,000人死于乳腺癌,这凸显了创新和有效的乳腺癌治疗迫切需要的重要性。最近,以缩小大肿瘤并导致病理完全反应为特征的辅助化疗( neoadjuvant chemotherapy)已经成为乳腺癌治疗领域的一种有前景的方法。然而,目前推荐 neoadjuvant 化疗的过程仍然依赖于医疗专家的主观评估,这些评估存在固有偏见和不确定性。 最近,一项利用从合成相关扩散成像(CDI$^s$)中提取的体积深度放射性特征的研究表明,非侵入性乳腺癌病理完全反应预测具有显著潜力。受到优化 CDI$^s$ 对前列腺癌分割阳性结果的启发,这项研究探讨了将优化 CDI$^s$ 应用于增强乳腺癌病理完全反应预测的应用。通过将优化后的 CDI$^s$ 与扩散加权成像(DWI)融合的multiparametric MRI,我们获得了93.28%的准确度,比之前报道的5.5%更高。
https://arxiv.org/abs/2405.07854
In recent years, diffusion models (DMs) have become a popular method for generating synthetic data. By achieving samples of higher quality, they quickly became superior to generative adversarial networks (GANs) and the current state-of-the-art method in generative modeling. However, their potential has not yet been exploited in radar, where the lack of available training data is a long-standing problem. In this work, a specific type of DMs, namely denoising diffusion probabilistic model (DDPM) is adapted to the SAR domain. We investigate the network choice and specific diffusion parameters for conditional and unconditional SAR image generation. In our experiments, we show that DDPM qualitatively and quantitatively outperforms state-of-the-art GAN-based methods for SAR image generation. Finally, we show that DDPM profits from pretraining on largescale clutter data, generating SAR images of even higher quality.
近年来,扩散模型(DMs)已成为生成合成数据的一种流行方法。通过实现高质量的样本,它们迅速成为生成对抗网络(GANs)和当前生成建模状态的佼佼者。然而,在雷达领域,缺乏可用训练数据是一个长期存在的问题。在这项工作中,我们针对SAR领域 adapt 一种特定类型的DM,即去噪扩散概率模型(DDPM)。我们研究了条件下的和无条件SAR图像生成网络选择和扩散参数。在我们的实验中,我们证明了DDPM在SAR图像生成方面既具有定性又具有定量优势。最后,我们证明了DDPM在大型杂乱数据上的预训练使其产生更高质量的SAR图像。
https://arxiv.org/abs/2405.07776
Expanding a dictionary of pre-selected keywords is crucial for tasks in information retrieval, such as database query and online data collection. Here we propose Local Graph-based Dictionary Expansion (LGDE), a method that uses tools from manifold learning and network science for the data-driven discovery of keywords starting from a seed dictionary. At the heart of LGDE lies the creation of a word similarity graph derived from word embeddings and the application of local community detection based on graph diffusion to discover semantic neighbourhoods of pre-defined seed keywords. The diffusion in the local graph manifold allows the exploration of the complex nonlinear geometry of word embeddings and can capture word similarities based on paths of semantic association. We validate our method on a corpus of hate speech-related posts from Reddit and Gab and show that LGDE enriches the list of keywords and achieves significantly better performance than threshold methods based on direct word similarities. We further demonstrate the potential of our method through a real-world use case from communication science, where LGDE is evaluated quantitatively on data collected and analysed by domain experts by expanding a conspiracy-related dictionary.
扩展预选关键词词典对于信息检索任务(如数据库查询和在线数据收集)至关重要。在这里,我们提出了基于局部图的词典扩展(LGDE)方法,这是一种利用多范式学习和网络科学工具从种子词典中数据驱动发现关键词的方法。LGDE的核心是基于词向量创建词相似图,并基于图扩散应用局部社区检测来发现预定义种子关键词的语义邻域。词向量在局部图多范式中的扩散允许探索词向量的复杂非线性几何,并基于语义关联路径捕获词相似性。我们在Reddit和Gab上收集的关于仇恨言论的帖子作为语料库,并评估我们的方法在基于直接词相似度的阈值方法上的性能。我们还通过一个通信科学领域的实际应用场景,将LGDE在扩展了阴谋相关词汇的数据集上进行定量评估。
https://arxiv.org/abs/2405.07764
Singing Accompaniment Generation (SAG), which generates instrumental music to accompany input vocals, is crucial to developing human-AI symbiotic art creation systems. The state-of-the-art method, SingSong, utilizes a multi-stage autoregressive (AR) model for SAG, however, this method is extremely slow as it generates semantic and acoustic tokens recursively, and this makes it impossible for real-time applications. In this paper, we aim to develop a Fast SAG method that can create high-quality and coherent accompaniments. A non-AR diffusion-based framework is developed, which by carefully designing the conditions inferred from the vocal signals, generates the Mel spectrogram of the target accompaniment directly. With diffusion and Mel spectrogram modeling, the proposed method significantly simplifies the AR token-based SingSong framework, and largely accelerates the generation. We also design semantic projection, prior projection blocks as well as a set of loss functions, to ensure the generated accompaniment has semantic and rhythm coherence with the vocal signal. By intensive experimental studies, we demonstrate that the proposed method can generate better samples than SingSong, and accelerate the generation by at least 30 times. Audio samples and code are available at this https URL.
唱歌伴奏生成(SAG)是一种生成伴奏以与输入歌词协同创作音乐的方法,对于发展人工智能与人类共生艺术创作系统至关重要。最先进的方法SingSong采用多阶段自回归(AR)模型来进行SAG,然而,这种方法非常慢,因为它通过递归生成语义和声学标记。这使得实时应用程序变得不可能。在本文中,我们旨在开发一种快速SAG方法,可以生成高质量且连贯的伴奏。 基于扩散的非AR扩散框架被开发出来,通过仔细设计从歌词信号中推断出的条件,直接生成目标伴奏的Mel频谱图。利用扩散和Mel频谱图建模,与基于AR标记的SingSong框架相比,所提出的方法大大简化了该框架,并大大加速了生成。我们还设计了一系列语义投影、预投影块以及损失函数,以确保生成的伴奏与歌词信号在语义和节奏上保持一致。通过密集的实验研究,我们证明了所提出的方法可以生成比SingSong更好的样本,并且至少加速30倍。音频样本和代码可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2405.07682
Existing Blind image Super-Resolution (BSR) methods focus on estimating either kernel or degradation information, but have long overlooked the essential content details. In this paper, we propose a novel BSR approach, Content-aware Degradation-driven Transformer (CDFormer), to capture both degradation and content representations. However, low-resolution images cannot provide enough content details, and thus we introduce a diffusion-based module $CDFormer_{diff}$ to first learn Content Degradation Prior (CDP) in both low- and high-resolution images, and then approximate the real distribution given only low-resolution information. Moreover, we apply an adaptive SR network $CDFormer_{SR}$ that effectively utilizes CDP to refine features. Compared to previous diffusion-based SR methods, we treat the diffusion model as an estimator that can overcome the limitations of expensive sampling time and excessive diversity. Experiments show that CDFormer can outperform existing methods, establishing a new state-of-the-art performance on various benchmarks under blind settings. Codes and models will be available at \href{this https URL}{this https URL}.
目前,现有的盲图像超分辨率(BSR)方法主要关注估计内核或退化信息,但已经忽略了关键的内容细节。在本文中,我们提出了一种新颖的BSR方法:内容感知退化驱动Transformer(CDFormer),以捕捉退化和内容表示。然而,低分辨率图像无法提供足够的内容细节,因此我们引入了一个扩散模块$CDFormer_{diff}$,首先在低和高分辨率图像中学习内容退化先验(CDP),然后仅基于低分辨率信息逼近真实分布。此外,我们还应用了一个自适应SR网络$CDFormer_{SR}$,有效利用CDP来优化特征。与之前的扩散基SR方法相比,我们将扩散模型视为一个可以克服昂贵采样时间和过度多样性的估价器。实验结果表明,CDFormer可以超越现有方法,在盲设置下建立各种基准的新最优性能。代码和模型将在此处可用:<https://this URL> 目前,现有的盲图像超分辨率(BSR)方法主要关注估计内核或退化信息,但已经忽略了关键的内容细节。在本文中,我们提出了一种新颖的BSR方法:内容感知退化驱动Transformer(CDFormer),以捕捉退化和内容表示。然而,低分辨率图像无法提供足够的内容细节,因此我们引入了一个扩散模块$CDFormer_{diff}$,首先在低和高分辨率图像中学习内容退化先验(CDP),然后仅基于低分辨率信息逼近真实分布。此外,我们还应用了一个自适应SR网络$CDFormer_{SR}$,有效利用CDP来优化特征。与之前的扩散基SR方法相比,我们将扩散模型视为一个可以克服昂贵采样时间和过度多样性的估价器。实验结果表明,CDFormer可以超越现有方法,在盲设置下建立各种基准的新最优性能。代码和模型将在此处可用:<https://this URL>
https://arxiv.org/abs/2405.07648
Care-giving and assistive robotics, driven by advancements in AI, offer promising solutions to meet the growing demand for care, particularly in the context of increasing numbers of individuals requiring assistance. This creates a pressing need for efficient and safe assistive devices, particularly in light of heightened demand due to war-related injuries. While cost has been a barrier to accessibility, technological progress is able to democratize these solutions. Safety remains a paramount concern, especially given the intricate interactions between assistive robots and humans. This study explores the application of reinforcement learning (RL) and imitation learning, in improving policy design for assistive robots. The proposed approach makes the risky policies safer without additional environmental interactions. Through experimentation using simulated environments, the enhancement of the conventional RL approaches in tasks related to assistive robotics is demonstrated.
照顾和辅助机器人技术,以人工智能的进步为基础,为满足不断增长的健康需求提供了有前景的解决方案,特别是在需要帮助的个体数量增加的情况下。这导致了对高效且安全的辅助设备的需求不断增加,尤其是在战争相关的伤害加剧的情况下。尽管成本是一个障碍,但技术进步能够实现这些解决方案的民主化。考虑到辅助机器人和人类之间的复杂交互,安全性始终是一个首要问题。本研究探讨了强化学习(RL)和模仿学习在改善辅助机器人政策设计中的应用。所提出的方法通过模拟环境实验证明了在辅助机器人任务方面传统RL方法的增强。
https://arxiv.org/abs/2405.07603
Object detection techniques for Unmanned Aerial Vehicles (UAVs) rely on Deep Neural Networks (DNNs), which are vulnerable to adversarial attacks. Nonetheless, adversarial patches generated by existing algorithms in the UAV domain pay very little attention to the naturalness of adversarial patches. Moreover, imposing constraints directly on adversarial patches makes it difficult to generate patches that appear natural to the human eye while ensuring a high attack success rate. We notice that patches are natural looking when their overall color is consistent with the environment. Therefore, we propose a new method named Environmental Matching Attack(EMA) to address the issue of optimizing the adversarial patch under the constraints of color. To the best of our knowledge, this paper is the first to consider natural patches in the domain of UAVs. The EMA method exploits strong prior knowledge of a pretrained stable diffusion to guide the optimization direction of the adversarial patch, where the text guidance can restrict the color of the patch. To better match the environment, the contrast and brightness of the patch are appropriately adjusted. Instead of optimizing the adversarial patch itself, we optimize an adversarial perturbation patch which initializes to zero so that the model can better trade off attacking performance and naturalness. Experiments conducted on the DroneVehicle and Carpk datasets have shown that our work can reach nearly the same attack performance in the digital attack(no greater than 2 in mAP$\%$), surpass the baseline method in the physical specific scenarios, and exhibit a significant advantage in terms of naturalness in visualization and color difference with the environment.
无人机(UAV)的目标检测技术依赖于深度神经网络(DNNs),这些网络对攻击非常敏感。然而,UAV领域现有算法生成的攻击补丁对攻击的自然性非常关注。此外,直接对攻击补丁施加约束会使得生成看起来自然的人工补丁变得困难,同时保证高攻击成功率。我们注意到,当补丁的整体颜色与环境相同时,它们看起来是自然的。因此,我们提出了一种名为环境匹配攻击(EMA)的新方法来解决在颜色约束下优化攻击补丁的问题。据我们所知,这是第一个考虑UAV领域自然补丁的论文。EMA方法利用预训练的稳定扩散的强烈先验知识引导攻击补丁的优化方向,其中文本指导可以限制补丁的颜色。为了更好地匹配环境,适当调整补丁的对比度和亮度。我们不是优化攻击补丁本身,而是优化一个攻击补丁,该补丁初始化为零,以便模型可以更好地平衡攻击性能和自然性。在DroneVehicle和Carpk数据集上进行的实验表明,我们的工作在数字攻击(MAP%不超过2)方面的攻击性能与基线方法相当,在物理特定场景中超过了基线方法,并且在可视化和颜色差异方面具有显著的优越性。
https://arxiv.org/abs/2405.07595
Unveiling the real appearance of retouched faces to prevent malicious users from deceptive advertising and economic fraud has been an increasing concern in the era of digital economics. This article makes the first attempt to investigate the face retouching reversal (FRR) problem. We first collect an FRR dataset, named deepFRR, which contains 50,000 StyleGAN-generated high-resolution (1024*1024) facial images and their corresponding retouched ones by a commercial online API. To our best knowledge, deepFRR is the first FRR dataset tailored for training the deep FRR models. Then, we propose a novel diffusion-based FRR approach (FRRffusion) for the FRR task. Our FRRffusion consists of a coarse-to-fine two-stage network: A diffusion-based Facial Morpho-Architectonic Restorer (FMAR) is constructed to generate the basic contours of low-resolution faces in the first stage, while a Transformer-based Hyperrealistic Facial Detail Generator (HFDG) is designed to create high-resolution facial details in the second stage. Tested on deepFRR, our FRRffusion surpasses the GP-UNIT and Stable Diffusion methods by a large margin in four widespread quantitative metrics. Especially, the de-retouched images by our FRRffusion are visually much closer to the raw face images than both the retouched face images and those restored by the GP-UNIT and Stable Diffusion methods in terms of qualitative evaluation with 85 subjects. These results sufficiently validate the efficacy of our work, bridging the recently-standing gap between the FRR and generic image restoration tasks. The dataset and code are available at this https URL.
在数字经济的时期,揭开修饰前后的脸的真实外观以防止恶意用户进行欺骗性广告和经济欺诈是一个越来越关注的问题。本文是首次调查了脸部修饰反向(FRR)问题。我们首先收集了一个名为deepFRR的数据集,其中包含50,000个由StyleGAN生成的具有1024*1024分辨率的高清(1024*1024)面部图像以及它们由商业在线API修整过的相应图像。据我们所知,deepFRR是第一个针对训练深度FRR模型的FRR数据集。然后,我们提出了一个新颖的扩散为基础的FRR方法(FRRffusion)用于FRR任务。我们的FRRffusion包括一个粗到细的两级网络:首先,通过扩散构建面部形态还原器(FMAR),以生成低分辨率面部的基本轮廓;其次,设计了一个Transformer-based超现实面部细节生成器(HFDG),用于在第二个阶段创建高分辨率面部细节。在deepFRR上测试我们的FRRffusion,我们的FRRffusion在四个广泛的定量指标上超过了GP-UNIT和Stable Diffusion方法。特别是,我们通过FRRffusion生成的去修补过的图像在视觉上与原始面部图像非常接近,而在定量评估中,与GP-UNIT和Stable Diffusion方法相比,修复后的图像在85个受试者中的质量评估结果也相差无几。这些结果充分验证了我们的工作,缩小了FRR和通用图像修复任务之间的最近空白。数据集和代码可在此https URL找到。
https://arxiv.org/abs/2405.07582
Many robotic systems, such as mobile manipulators or quadrotors, cannot be equipped with high-end GPUs due to space, weight, and power constraints. These constraints prevent these systems from leveraging recent developments in visuomotor policy architectures that require high-end GPUs to achieve fast policy inference. In this paper, we propose Consistency Policy, a faster and similarly powerful alternative to Diffusion Policy for learning visuomotor robot control. By virtue of its fast inference speed, Consistency Policy can enable low latency decision making in resource-constrained robotic setups. A Consistency Policy is distilled from a pretrained Diffusion Policy by enforcing self-consistency along the Diffusion Policy's learned trajectories. We compare Consistency Policy with Diffusion Policy and other related speed-up methods across 6 simulation tasks as well as two real-world tasks where we demonstrate inference on a laptop GPU. For all these tasks, Consistency Policy speeds up inference by an order of magnitude compared to the fastest alternative method and maintains competitive success rates. We also show that the Conistency Policy training procedure is robust to the pretrained Diffusion Policy's quality, a useful result that helps practioners avoid extensive testing of the pretrained model. Key design decisions that enabled this performance are the choice of consistency objective, reduced initial sample variance, and the choice of preset chaining steps. Code and training details will be released publicly.
许多机器人系统(如移动操作器或四旋翼),由于空间、重量和能源限制,无法配备高端GPU。这些限制使得这些系统无法利用需要高端GPU以实现快速策略推理的视觉运动规划架构的最近发展。在本文中,我们提出了Consistency Policy,这是一种比Diffusion Policy更快速且具有相似功率的替代方案,用于学习视觉运动控制。由于其快速的推理速度,Consistency Policy可以在资源受限的机器人设置中实现低延迟的决策。通过在扩散策略的学习轨迹上强制自一致性,将预训练的Diffusion Policy提炼为Consistency Policy。我们在6个模拟任务以及两个真实世界任务上与扩散策略和其他相关加速方法进行比较。对于所有这些任务,与最快替代方法相比,Consistency Policy加速了推理的倍数。我们还证明了Consistency Policy的训练过程对预训练的Diffusion Policy的质量具有鲁棒性,这是一个有益的结果,可以帮助实践者避免对预训练模型的广泛测试。实现这种性能的关键设计决策包括选择一致性目标、降低初始样本方差和选择预置链接步骤。代码和训练细节将公开发布。
https://arxiv.org/abs/2405.07503
The increasing prominence of e-commerce has underscored the importance of Virtual Try-On (VTON). However, previous studies predominantly focus on the 2D realm and rely heavily on extensive data for training. Research on 3D VTON primarily centers on garment-body shape compatibility, a topic extensively covered in 2D VTON. Thanks to advances in 3D scene editing, a 2D diffusion model has now been adapted for 3D editing via multi-viewpoint editing. In this work, we propose GaussianVTON, an innovative 3D VTON pipeline integrating Gaussian Splatting (GS) editing with 2D VTON. To facilitate a seamless transition from 2D to 3D VTON, we propose, for the first time, the use of only images as editing prompts for 3D editing. To further address issues, e.g., face blurring, garment inaccuracy, and degraded viewpoint quality during editing, we devise a three-stage refinement strategy to gradually mitigate potential issues. Furthermore, we introduce a new editing strategy termed Edit Recall Reconstruction (ERR) to tackle the limitations of previous editing strategies in leading to complex geometric changes. Our comprehensive experiments demonstrate the superiority of GaussianVTON, offering a novel perspective on 3D VTON while also establishing a novel starting point for image-prompting 3D scene editing.
随着电子商务的日益突出,虚拟试穿(VTON)的重要性得到了凸显。然而,以前的研究主要集中在2D领域,并且依赖大量数据进行训练。关于3D VTON的研究主要集中在衣身材形兼容性上,这一主题在2D VTON中得到了广泛覆盖。得益于3D场景编辑的进步,现在已经将2D扩散模型适应性地用于3D编辑通过多视角编辑。在这项工作中,我们提出了GaussianVTON,一种集成了Gaussian分形(GS)编辑和2D VTON的创新3D VTON管道。为了实现从2D到3D VTON的无缝过渡,我们提出了仅使用图像作为编辑提示的3D编辑的建议。为了进一步解决诸如面糊、衣物不准确和编辑视角质量下降等问题,我们设计了一个三阶段精炼策略,逐渐减轻潜在问题。此外,我们还引入了一种新编辑策略,称为Edit Recall Reconstruction(ERR)来解决以前编辑策略在导致复杂几何变化方面的限制。我们全面的实验证明,GaussianVTON具有优越性,为3D VTON提供了一种新的视角,同时也在图像提示3D场景编辑方面建立了新的起点。
https://arxiv.org/abs/2405.07472