In this paper, we delve into a new task known as small object editing (SOE), which focuses on text-based image inpainting within a constrained, small-sized area. Despite the remarkable success have been achieved by current image inpainting approaches, their application to the SOE task generally results in failure cases such as Object Missing, Text-Image Mismatch, and Distortion. These failures stem from the limited use of small-sized objects in training datasets and the downsampling operations employed by U-Net models, which hinders accurate generation. To overcome these challenges, we introduce a novel training-based approach, SOEDiff, aimed at enhancing the capability of baseline models like StableDiffusion in editing small-sized objects while minimizing training costs. Specifically, our method involves two key components: SO-LoRA, which efficiently fine-tunes low-rank matrices, and Cross-Scale Score Distillation loss, which leverages high-resolution predictions from the pre-trained teacher diffusion model. Our method presents significant improvements on the test dataset collected from MSCOCO and OpenImage, validating the effectiveness of our proposed method in small object editing. In particular, when comparing SOEDiff with SD-I model on the OpenImage-f dataset, we observe a 0.99 improvement in CLIP-Score and a reduction of 2.87 in FID. Our project page can be found in this https URL.
在本文中,我们探讨了一个名为小物体编辑(SOE)的新任务,其专注于在约束的小尺寸区域内进行基于文本的图像修复。尽管当前图像修复方法取得了显著的成功,但它们通常应用于SOE任务时会导致诸如对象缺失、文本图像不一致和扭曲等失败情况。这些失败源于训练数据集中小尺寸物体的使用受限以及U-Net模型采用的降采样操作,这阻碍了准确生成。为了克服这些挑战,我们引入了一种新的基于训练的方法,SOEDiff,旨在增强诸如StableDiffusion等基线模型的编辑小尺寸物体的能力,同时最小化训练成本。具体来说,我们的方法包括两个关键组件:SO-LoRA,它有效地微调低秩矩阵,以及跨尺度评分差异损失,它利用了预训练教师扩散模型的的高分辨率预测。我们的方法在收集的MSCOCO和OpenImage数据集的测试数据上取得了显著的改善,验证了我们在小物体编辑方面的提出方法的有效性。特别是,当SOEDiff与SD-I模型在OpenImage-f数据集上进行比较时,我们在CLIP-Score上观察到了0.99的提高,而在FID上减少了2.87。我们的项目页面可以在这个链接中找到。
https://arxiv.org/abs/2405.09114
As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.
作为人类,我们渴望创建既自由又易于控制的媒体内容。得益于生成技术的显著发展,我们现在可以轻松地利用2D扩散方法合成由原始草图或指定人体姿势控制的图像,甚至可以逐步编辑/再生带有遮罩的局部区域。然而,在3D建模任务中,类似的工作流程仍然无法实现,因为3D生成的可控性和效率不高。在本文中,我们提出了一个新颖的可控且具有交互性的3D资产建模框架,名为Coin3D。Coin3D允许用户使用由基本形状组成的粗略几何代理来控制3D生成,并引入了交互式生成工作流程,以支持在几秒钟内提供响应式的3D物体预览。为此,我们开发了几个技术,包括对扩散模型应用体积粗略形状控制的3D适配器,用于精确部分编辑的代理边界编辑策略,用于支持响应式预览的渐进式体积缓存,以及体积-SDS,以确保一致的网格重建。对不同形状代理的交互式生成和编辑的广泛实验证明,我们的方法在3D资产生成任务中实现了卓越的可控性和灵活性。
https://arxiv.org/abs/2405.08054
In NeRF-aided editing tasks, object movement presents difficulties in supervision generation due to the introduction of variability in object positions. Moreover, the removal operations of certain scene objects often lead to empty regions, presenting challenges for NeRF models in inpainting them effectively. We propose an implicit ray transformation strategy, allowing for direct manipulation of the 3D object's pose by operating on the neural-point in NeRF rays. To address the challenge of inpainting potential empty regions, we present a plug-and-play inpainting module, dubbed differentiable neural-point resampling (DNR), which interpolates those regions in 3D space at the original ray locations within the implicit space, thereby facilitating object removal & scene inpainting tasks. Importantly, employing DNR effectively narrows the gap between ground truth and predicted implicit features, potentially increasing the mutual information (MI) of the features across rays. Then, we leverage DNR and ray transformation to construct a point-based editable NeRF pipeline PR^2T-NeRF. Results primarily evaluated on 3D object removal & inpainting tasks indicate that our pipeline achieves state-of-the-art performance. In addition, our pipeline supports high-quality rendering visualization for diverse editing operations without necessitating extra supervision.
在 NeRF 辅助编辑任务中,由于引入了对象位置的变异性,物体运动监督生成存在困难。此外,某些场景对象的移除操作通常会导致空洞区域,这对 NeRF 模型在修复它们方面有效性的实现带来了挑战。我们提出了一个隐式光线变换策略,通过在 NeRF 光线上对神经点进行操作,使得可以直接操纵 3D 对象的姿势。为了应对修复可能存在的空洞区域的挑战,我们提出了一个可插拔的修复模块,被称为可导神经点采样(DNR),它在原始光线位置的 3D 空间中平滑地修复这些区域,从而促进物体移除和场景修复任务。重要的是,使用 DNR 有效地缩小了地面真实值和预测隐式特征之间的差距,可能增加了一维特征之间的互信息(MI)。然后,我们利用 DNR 和光线变换构建了一个基于点的可编辑 NeRF 管道 PR^2T-NeRF。在主要评估的 3D 物体移除和修复任务上,我们的管道实现了最先进的表现。此外,我们的管道还支持各种编辑操作的高质量渲染可视化,而无需增加额外的监督。
https://arxiv.org/abs/2405.07306
Emerging unsupervised reconstruction techniques based on implicit neural representation (INR), such as NeRP, CoIL, and SCOPE, have shown unique capabilities in CT linear inverse imaging. In this work, we propose a novel unsupervised density neural representation (Diner) to tackle the challenging problem of CT metal artifacts when scanned objects contain metals. The drastic variation of linear attenuation coefficients (LACs) of metals over X-ray spectra leads to a nonlinear beam hardening effect (BHE) in CT measurements. Recovering CT images from metal-affected measurements therefore poses a complicated nonlinear inverse problem. Existing metal artifact reduction (MAR) techniques mostly formulate the MAR as an image inpainting task, which ignores the energy-induced BHE and produces suboptimal performance. Instead, our Diner introduces an energy-dependent polychromatic CT forward model to the INR framework, addressing the nonlinear nature of the MAR problem. Specifically, we decompose the energy-dependent LACs into energy-independent densities and energy-dependent mass attenuation coefficients (MACs) by fully considering the physical model of X-ray absorption. Using the densities as pivot variables and the MACs as known prior knowledge, the LACs can be accurately reconstructed from the raw measurements. Technically, we represent the unknown density map as an implicit function of coordinates. Combined with a novel differentiable forward model simulating the physical acquisition from the densities to the measurements, our Diner optimizes a multi-layer perception network to approximate the implicit function by minimizing predicted errors between the estimated and real measurements. Experimental results on simulated and real datasets confirm the superiority of our unsupervised Diner against popular supervised techniques in MAR performance and robustness.
基于隐式神经表示(INR)的新兴无监督重建技术,如NeRP、CoIL和SCOPE,在CT线性反向成像中表现出独特的功能。在本文中,我们提出了一种新颖的无监督密度神经表示(Diner)来解决扫描物体中含有金属时CT金属伪影的挑战。金属在X射线光谱中的线性衰减系数(LACs)的剧烈变化导致CT测量中的非线性束硬化效应(BHE)。因此,从受金属影响的测量中恢复CT图像 poses复杂的非线性反向问题。现有的金属伪影去除技术通常将伪影表示为图像修复任务,这忽略了能量引起的BHE,并产生了低性能。相反,我们的Diner在INR框架中引入了一种能量相关的多色CT正向模型,解决了MAR问题的非线性性质。具体来说,我们通过完全考虑X射线吸收的物理模型,将能量相关的LACs分解为能量无关的密度和能量相关的质量衰减系数(MACs)。将密度作为原点变量,将MACs作为已知先验知识,LACs可以从原始测量准确地重构。从技术上讲,我们将未知密度图表示为坐标系中的隐函数。结合一个新的不同可导的正向模型,模拟从密度到测量物理采集,我们的Diner优化了一个多层感知网络,通过最小化预测误差间的预测误差来逼近隐函数。在模拟和真实数据集上的实验结果证实了我们的无监督Diner在MAR性能和鲁棒性方面优于流行的监督技术。
https://arxiv.org/abs/2405.07047
Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.
文本驱动的3D室内场景生成具有广泛的应用,从游戏和智能家居到AR/VR应用程序。快速且高保真的场景生成对确保用户友好体验至关重要。然而,现有的方法的特点是生成过程漫长,或者需要对运动参数进行复杂的手动指定,这给用户带来困扰。此外,这些方法通常依赖于狭视野迭代生成,导致全局一致性和平面整体质量妥协。为了应对这些问题,我们提出了FastScene,一个快速且高保真度的3D场景生成框架,同时保持场景的一致性。具体来说,给定文本提示,我们生成全景并估计其深度,因为全景涵盖了整个场景的信息并表现出明确的几何约束。为了获得高质量的新视角,我们引入了粗视图合成(CVS)和渐进式新视图修复(PNVI)策略,确保场景的一致性和视质。随后,我们利用多视投影(MVP)形成透视视图,并应用3D高斯膨胀(3DGS)进行场景重建。全面实验证明,FastScene在生成速度和质量方面都超过了其他方法,具有良好的场景一致性。值得注意的是,仅凭借文本提示,FastScene可以在短短的15分钟内生成一个3D场景,这比现有方法至少快一个小时,使得它在用户友好场景生成方面具有范例意义。
https://arxiv.org/abs/2405.05768
Point-based representations have recently gained popularity in novel view synthesis, for their unique advantages, e.g., intuitive geometric representation, simple manipulation, and faster convergence. However, based on our observation, these point-based neural re-rendering methods are only expected to perform well under ideal conditions and suffer from noisy, patchy points and unbounded scenes, which are challenging to handle but defacto common in real applications. To this end, we revisit one such influential method, known as Neural Point-based Graphics (NPBG), as our baseline, and propose Robust Point-based Graphics (RPBG). We in-depth analyze the factors that prevent NPBG from achieving satisfactory renderings on generic datasets, and accordingly reform the pipeline to make it more robust to varying datasets in-the-wild. Inspired by the practices in image restoration, we greatly enhance the neural renderer to enable the attention-based correction of point visibility and the inpainting of incomplete rasterization, with only acceptable overheads. We also seek for a simple and lightweight alternative for environment modeling and an iterative method to alleviate the problem of poor geometry. By thorough evaluation on a wide range of datasets with different shooting conditions and camera trajectories, RPBG stably outperforms the baseline by a large margin, and exhibits its great robustness over state-of-the-art NeRF-based variants. Code available at this https URL.
点基础表示最近在 novel view synthesis 中获得了越来越多的关注,因为它们具有独特的优势,例如直观的几何表示、简单的操作和较快的收敛速度。然而,根据我们的观察,这些点基础神经渲染方法只能在理想条件下表现出色,并受到噪声点和不规则场景的困扰,这些挑战很难处理,但在实际应用中相当普遍。因此,我们回顾了一种 influential 的方法,称为 Neural Point-Based Graphics (NPBG),作为我们的 baseline,并提出了 Robust Point-Based Graphics (RPBG)。我们深入分析了导致 NPBG 在通用数据集上无法实现令人满意的渲染的原因,并相应地改进了流程,使其在真实世界数据集上更加健壮。受到图像修复技术的启发,我们极大地增强了神经渲染器,使其能够通过点可见性注意力和完整纹理映射的修复来改善,同时具有可接受的性能开销。我们还寻求一种简单而轻便的环境建模方法和一个迭代方法来解决几何问题。通过在各种数据集上进行广泛的评估,RPBG 稳定地超过了基线,并表现出其在 State-of-the-art NeRF 基版本中的巨大稳健性。代码可在此处下载:https://www.example.com/
https://arxiv.org/abs/2405.05663
The search for refining 3D LiDAR data has attracted growing interest motivated by recent techniques such as supervised learning or generative model-based methods. Existing approaches have shown the possibilities for using diffusion models to generate refined LiDAR data with high fidelity, although the performance and speed of such methods have been limited. These limitations make it difficult to execute in real-time, causing the approaches to struggle in real-world tasks such as autonomous navigation and human-robot interaction. In this work, we introduce a novel approach based on conditional diffusion models for fast and high-quality sparse-to-dense upsampling of 3D scene point clouds through an image representation. Our method employs denoising diffusion probabilistic models trained with conditional inpainting masks, which have been shown to give high performance on image completion tasks. We introduce a series of experiments, including multiple datasets, sampling steps, and conditional masks, to determine the ideal configuration, striking a balance between performance and inference speed. This paper illustrates that our method outperforms the baselines in sampling speed and quality on upsampling tasks using the KITTI-360 dataset. Furthermore, we illustrate the generalization ability of our approach by simultaneously training on real-world and synthetic datasets, introducing variance in quality and environments.
寻找精化3D LiDAR数据的搜索吸引了越来越多的关注,这是由最近使用的如监督学习或基于生成模型的方法等技术引起的。虽然已经证明了使用扩散模型生成具有高保真度的精化LiDAR数据的可能性,但这种方法的性能和速度仍然有限。这些限制使得在实时执行中很难实现,导致在现实世界的任务(如自主导航和人类机器人交互)中,这些方法遇到困难。 在本文中,我们介绍了一种基于条件扩散模型的新的方法,用于通过图像表示对3D场景点云进行高保真度的平滑和压缩。我们的方法采用带条件修补掩码的噪声扩散概率模型进行训练,这些模型已经在图像完成任务中表现出良好的性能。我们介绍了一系列实验,包括多个数据集、采样步骤和条件掩码,以确定理想的配置,在性能和推理速度之间取得平衡。本文证明了,我们的方法在KITTI-360数据集上的采样速度和质量方面超过了基线。此外,我们还通过同时在一手真实世界和合成数据上训练,展示了我们方法的一般化能力。我们还展示了在不同质量和环境下的平滑和压缩效果。
https://arxiv.org/abs/2405.04889
We propose a pipeline that leverages Stable Diffusion to improve inpainting results in the context of defurnishing -- the removal of furniture items from indoor panorama images. Specifically, we illustrate how increased context, domain-specific model fine-tuning, and improved image blending can produce high-fidelity inpaints that are geometrically plausible without needing to rely on room layout estimation. We demonstrate qualitative and quantitative improvements over other furniture removal techniques.
我们提出了一个利用Stable Diffusion的管道来提高在去 furniture(从室内全景图像中移除家具)场景下的修复结果的设想。具体来说,我们说明了增加上下文、领域特定模型微调以及图像混合改善可以产生高保真的修复,而无需依赖于房间布局估计。我们证明了与其他家具移除技术相比,具有定量和定性改进。
https://arxiv.org/abs/2405.03682
Despite the emergence of successful NeRF inpainting methods built upon explicit RGB and depth 2D inpainting supervisions, these methods are inherently constrained by the capabilities of their underlying 2D inpainters. This is due to two key reasons: (i) independently inpainting constituent images results in view-inconsistent imagery, and (ii) 2D inpainters struggle to ensure high-quality geometry completion and alignment with inpainted RGB images. To overcome these limitations, we propose a novel approach called MVIP-NeRF that harnesses the potential of diffusion priors for NeRF inpainting, addressing both appearance and geometry aspects. MVIP-NeRF performs joint inpainting across multiple views to reach a consistent solution, which is achieved via an iterative optimization process based on Score Distillation Sampling (SDS). Apart from recovering the rendered RGB images, we also extract normal maps as a geometric representation and define a normal SDS loss that motivates accurate geometry inpainting and alignment with the appearance. Additionally, we formulate a multi-view SDS score function to distill generative priors simultaneously from different view images, ensuring consistent visual completion when dealing with large view variations. Our experimental results show better appearance and geometry recovery than previous NeRF inpainting methods.
尽管基于显式 RGB 和深度 2D 修复方法的成功 NeRF 修复算法已经出现,但这些方法在本质上受到了其底层 2D 修复算法的限制。这是因为两个关键原因:(一)独立修复成分图像会导致视图不一致的图像,而(二)2D 修复算法很难确保修复的高质量几何形状和与修复的 RGB 图像的对齐。为了克服这些限制,我们提出了一个名为 MVIP-NeRF 的新方法,它利用扩散 prior 的潜力来进行 NeRF 修复,解决了两方面的问题: appearance 和 geometry。MVIP-NeRF 在多个视图上进行联合修复,达到一致的解决方案,这是通过基于 Score Distillation Sampling(SDS)的迭代优化过程实现的。除了恢复渲染的 RGB 图像之外,我们还提取出法线贴图作为几何表示,并定义了一个法线 SDS 损失,以激励准确的几何修复和与外观的匹配。此外,我们定义了一个多视图 SDS 分数函数,可以从不同的视图图像中同时提取生成概率。这将确保在处理大视图变化时,视觉完成的一致性。我们的实验结果表明,与之前 NeRF 修复方法相比,具有更好的外观和几何修复。
https://arxiv.org/abs/2405.02859
Video-based remote photoplethysmography (rPPG) has emerged as a promising technology for non-contact vital sign monitoring, especially under controlled conditions. However, the accurate measurement of vital signs in real-world scenarios faces several challenges, including artifacts induced by videocodecs, low-light noise, degradation, low dynamic range, occlusions, and hardware and network constraints. In this article, we systematically investigate comprehensive investigate these issues, measuring their detrimental effects on the quality of rPPG measurements. Additionally, we propose practical strategies for mitigating these challenges to improve the dependability and resilience of video-based rPPG systems. We detail methods for effective biosignal recovery in the presence of network limitations and present denoising and inpainting techniques aimed at preserving video frame integrity. Through extensive evaluations and direct comparisons, we demonstrate the effectiveness of the approaches in enhancing rPPG measurements under challenging environments, contributing to the development of more reliable and effective remote vital sign monitoring technologies.
基于视频的远程脉搏测量(rPPG)作为一种非接触式生命体征监测的有前景的技术,尤其是在受控条件下,已经得到了广泛的应用。然而,在现实世界的场景中准确测量生命体征面临着几个挑战,包括由视频编码器产生的伪影、低光噪声、失真、低动态范围、遮挡和硬件及网络限制等。在本文中,我们系统地研究了这些问题,并测量了它们对rPPG测量质量的损害。此外,我们提出了应对这些挑战的实际策略,以提高基于视频的rPPG系统的可靠性和弹性。我们详细介绍了在网络限制下进行生物信号恢复的方法,并提出了旨在保持视频帧完整性的去噪和修复技术。通过广泛的评估和直接比较,我们证明了这些方法在具有挑战性的环境中增强rPPG测量效果,为开发更可靠和有效的远程生命体征监测技术做出了贡献。
https://arxiv.org/abs/2405.01230
In Virtual Product Placement (VPP) applications, the discrete integration of specific brand products into images or videos has emerged as a challenging yet important task. This paper introduces a novel three-stage fully automated VPP system. In the first stage, a language-guided image segmentation model identifies optimal regions within images for product inpainting. In the second stage, Stable Diffusion (SD), fine-tuned with a few example product images, is used to inpaint the product into the previously identified candidate regions. The final stage introduces an "Alignment Module", which is designed to effectively sieve out low-quality images. Comprehensive experiments demonstrate that the Alignment Module ensures the presence of the intended product in every generated image and enhances the average quality of images by 35%. The results presented in this paper demonstrate the effectiveness of the proposed VPP system, which holds significant potential for transforming the landscape of virtual advertising and marketing strategies.
在虚拟产品置入(VPP)应用中,将特定品牌产品离散地集成到图像或视频中已成为一个具有挑战性但重要的问题。本文介绍了一种新颖的三阶段完全自动VPP系统。在第一阶段,受语言指导的图像分割模型在图像中确定产品修复的最佳区域。在第二阶段,使用经过几例产品图像微调的Stable Diffusion(SD)对产品进行修复,将产品修复到之前确定的候选区域中。最后阶段引入了一个“对齐模块”,旨在有效地筛选出低质量的图像。全面的实验结果表明,对齐模块确保了每个生成的图像中都含有意图产品,并提高了图像的平均质量35%。本文所呈现的结果证明了所提出的VPP系统的有效性,该系统在改变虚拟广告和营销策略的地图方面具有巨大的潜力。
https://arxiv.org/abs/2405.01130
Denoising diffusion models have recently gained prominence as powerful tools for a variety of image generation and manipulation tasks. Building on this, we propose a novel tool for real-time editing of images that provides users with fine-grained region-targeted supervision in addition to existing prompt-based controls. Our novel editing technique, termed Layered Diffusion Brushes, leverages prompt-guided and region-targeted alteration of intermediate denoising steps, enabling precise modifications while maintaining the integrity and context of the input image. We provide an editor based on Layered Diffusion Brushes modifications, which incorporates well-known image editing concepts such as layer masks, visibility toggles, and independent manipulation of layers; regardless of their order. Our system renders a single edit on a 512x512 image within 140 ms using a high-end consumer GPU, enabling real-time feedback and rapid exploration of candidate edits. We validated our method and editing system through a user study involving both natural images (using inversion) and generated images, showcasing its usability and effectiveness compared to existing techniques such as InstructPix2Pix and Stable Diffusion Inpainting for refining images. Our approach demonstrates efficacy across a range of tasks, including object attribute adjustments, error correction, and sequential prompt-based object placement and manipulation, demonstrating its versatility and potential for enhancing creative workflows.
去噪扩散模型最近因成为各种图像生成和编辑任务的强大工具而受到了广泛关注。在此基础上,我们提出了一种名为分层扩散刷的新工具,为用户提供了在现有提示基础上的细粒度区域目标指导,以及对输入图像的完整性和上下文的保留。我们提出的编辑技术被称为分层扩散刷,利用了提示引导的中间去噪步骤的区域的修改,可以在保持输入图像的完整性和上下文的同时进行精确修改。我们还基于分层扩散刷的编辑器,该编辑器包含了著名的图像编辑概念,如层遮罩、可见性开关和层级的独立操作;无论它们的顺序如何。我们的系统在高端消费级GPU上对512x512图像进行一次编辑,用时140毫秒,实现了实时的反馈和快速的选择编辑。我们对我们的方法进行了用户研究,包括自然图像(使用反向映射)和生成图像,证明了它的可用性和效果与现有技术(如InstructPix2Pix和稳定扩散修复)相比优越。我们的方法在各种任务中表现出有效性,包括对象属性的调整、错误纠正和基于序列提示的对象放置和编辑,证明了它的多才多艺和提高创意工作流程的潜力。
https://arxiv.org/abs/2405.00313
Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
目前最先进的视频修复方法通常依赖于光学流或基于注意力的方法,通过在帧之间传播视觉信息来修复遮罩区域。虽然这些方法在标准基准测试中取得了显著的进展,但它们在需要合成不存在的其他帧中合成新内容时遇到困难。在本文中,我们将视频修复重新定义为条件生成建模问题,并提出了使用条件视频扩散模型解决此类问题的框架。我们强调了使用生成方法解决此任务的优点,证明了我们的方法能够生成多样、高质量的视频修复,并合成与提供上下文一致的时空语义内容。
https://arxiv.org/abs/2405.00251
3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.
3D场景生成已经成为一个快速发展的研究方向,得到了2D生成扩散模型的持续改进。在这个领域中,之前的工作通常通过迭代地将新产生的帧与现有的几何知识拼接在一起来生成场景。这些工作通常依赖于预训练的单目深度估计器将生成的图像提升到3D,将它们与现有的场景表示融合。然后,这些方法通常通过文本度量来评估,测量生成图像与给定文本提示之间的相似性。在这项工作中,我们为3D场景生成领域做出了两个根本性的贡献。首先,我们指出,使用单目深度估计模型将图像提升到3D是不最优的,因为它忽略了现有场景的几何结构。因此,我们引入了一种新的深度完成模型,通过教师蒸馏和自训练学习3D融合过程,从而改善了场景的几何连贯性。其次,我们引入了一种基于真实几何的新场景生成基准方案,因此它衡量了场景结构的质量。
https://arxiv.org/abs/2404.19758
Recent advancements in image inpainting, particularly through diffusion modeling, have yielded promising outcomes. However, when tested in scenarios involving the completion of images based on the foreground objects, current methods that aim to inpaint an image in an end-to-end manner encounter challenges such as "over-imagination", inconsistency between foreground and background, and limited diversity. In response, we introduce Anywhere, a pioneering multi-agent framework designed to address these issues. Anywhere utilizes a sophisticated pipeline framework comprising various agents such as Visual Language Model (VLM), Large Language Model (LLM), and image generation models. This framework consists of three principal components: the prompt generation module, the image generation module, and the outcome analyzer. The prompt generation module conducts a semantic analysis of the input foreground image, leveraging VLM to predict relevant language descriptions and LLM to recommend optimal language prompts. In the image generation module, we employ a text-guided canny-to-image generation model to create a template image based on the edge map of the foreground image and language prompts, and an image refiner to produce the outcome by blending the input foreground and the template image. The outcome analyzer employs VLM to evaluate image content rationality, aesthetic score, and foreground-background relevance, triggering prompt and image regeneration as needed. Extensive experiments demonstrate that our Anywhere framework excels in foreground-conditioned image inpainting, mitigating "over-imagination", resolving foreground-background discrepancies, and enhancing diversity. It successfully elevates foreground-conditioned image inpainting to produce more reliable and diverse results.
近期在图像修复算法的进步,特别是扩散建模,已经取得了很好的结果。然而,在基于前景对象完成图像的场景中进行测试时,当前试图通过端到端方法修复图像的方法遇到了一些挑战,如“过度想象”、“前景和背景之间不一致”和“缺乏多样性”。为了应对这些挑战,我们引入了Anywhere,这是一个首创的多代理框架,旨在解决这些问题。Anywhere采用了一个复杂的管道框架,包括各种代理,如Visual Language Model(VLM)、大型语言模型(LLM)和图像生成模型。这个框架包括三个主要组件:提示生成模块、图像生成模块和结果分析器。提示生成模块对输入的前景图像进行语义分析,利用VLM预测相关的语言描述,并利用LLM推荐最优的语言提示。在图像生成模块中,我们使用基于文本指导的Canny-to-图像生成模型根据前景图像的边缘图和语言提示创建模板图像,并使用图像平滑器根据输入前景和模板图像产生结果。结果分析器利用VLM评估图像内容的合理性、美学分数和前景与背景的相关性,根据需要触发提示和图像再生。大量实验证明,我们的Anywhere框架在前景条件下的图像修复方面表现出色,减轻了“过度想象”,解决了前景和背景之间的不一致,并增强了多样性。它将前景条件下的图像修复提升到了产生更可靠和多样结果的水平。
https://arxiv.org/abs/2404.18598
Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks remains a challenge. We address this by leveraging the insight that removing objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks. Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction. Additionally, we utilize a large Vision-Language Model to provide detailed descriptions of the removed objects and a Large Language Model to convert these descriptions into diverse, natural-language instructions. We show that the trained model surpasses existing ones both qualitatively and quantitatively, and release the large-scale dataset alongside the trained models for the community.
图像编辑已经取得了显著的进步,得益于引入了基于文本指令的扩散模型,现在可以轻松地将对象添加到图像中,而无需用户提供输入掩码。尽管如此,在不需要用户提供输入掩码的情况下,平滑地将文本指令添加到图像中仍然具有挑战性。为了应对这个挑战,我们通过利用消除对象(Inpaint)远比添加它们(Paint)的逆过程简单的见解,利用结合了分割掩码数据集的inpainting模型,为inpainting模型提供更多的信息。抓住这个启示,通过实现自动且广泛的流程,我们策划了一个包含对配对图像及其相应去对象版本的过滤大规模图像数据集。利用这些配对,我们训练了一个扩散模型来反向扩散,有效地将对象添加到图像中。 与其他编辑数据集不同,我们的数据集包含自然目标图像,而不是合成图像。此外,它通过自定义构建了源图像和目标图像之间的 consistency。此外,我们还利用了一个大型的视觉语言模型来提供对移除对象的详细描述,以及一个大型语言模型将描述转换为各种自然语言指令。我们证明了训练后的模型在质量和数量上超过了现有的模型,并将大型数据集与训练好的模型一起发布,供社区使用。
https://arxiv.org/abs/2404.18212
Existing image inpainting methods have achieved remarkable accomplishments in generating visually appealing results, often accompanied by a trend toward creating more intricate structural textures. However, while these models excel at creating more realistic image content, they often leave noticeable traces of tampering, posing a significant threat to security. In this work, we take the anti-forensic capabilities into consideration, firstly proposing an end-to-end training framework for anti-forensic image inpainting named SafePaint. Specifically, we innovatively formulated image inpainting as two major tasks: semantically plausible content completion and region-wise optimization. The former is similar to current inpainting methods that aim to restore the missing regions of corrupted images. The latter, through domain adaptation, endeavors to reconcile the discrepancies between the inpainted region and the unaltered area to achieve anti-forensic goals. Through comprehensive theoretical analysis, we validate the effectiveness of domain adaptation for anti-forensic performance. Furthermore, we meticulously crafted a region-wise separated attention (RWSA) module, which not only aligns with our objective of anti-forensics but also enhances the performance of the model. Extensive qualitative and quantitative evaluations show our approach achieves comparable results to existing image inpainting methods while offering anti-forensic capabilities not available in other methods.
现有的图像修复方法在生成视觉上吸引人的结果方面取得了显著的成就,通常伴随着创建更复杂结构纹理的趋势。然而,虽然这些模型在创建更真实的图像内容方面表现出色,但它们通常会留下明显的篡改痕迹,对安全性构成重大威胁。在这项工作中,我们考虑了反取证能力,首先提出了一个名为SafePaint的端到端反取证图像修复框架。具体来说,我们创新地将图像修复定义为两个主要任务:语义上合理的內容完成和区域优化。前者类似于当前修复方法,旨在恢复损坏图像的缺失部分。后者通过领域适应努力调和修复区域与未修改区域的差异,以实现反取证目标。通过全面的理论分析,我们验证了领域适应在反取证性能上的有效性。此外,我们精心 crafted了一个区域分离注意(RWSA)模块,该模块不仅符合我们反取证的目标,还提高了模型的性能。广泛的定性和定量评估显示,我们的方法在实现与现有图像修复方法相当的结果的同时,提供了反取证能力在其他方法中无法实现。
https://arxiv.org/abs/2404.18136
We introduce ObjectAdd, a training-free diffusion modification method to add user-expected objects into user-specified area. The motive of ObjectAdd stems from: first, describing everything in one prompt can be difficult, and second, users often need to add objects into the generated image. To accommodate with real world, our ObjectAdd maintains accurate image consistency after adding objects with technical innovations in: (1) embedding-level concatenation to ensure correct text embedding coalesce; (2) object-driven layout control with latent and attention injection to ensure objects accessing user-specified area; (3) prompted image inpainting in an attention refocusing & object expansion fashion to ensure rest of the image stays the same. With a text-prompted image, our ObjectAdd allows users to specify a box and an object, and achieves: (1) adding object inside the box area; (2) exact content outside the box area; (3) flawless fusion between the two areas
我们提出了ObjectAdd,一种无需训练的扩散修改方法,可以将用户期望的对象添加到用户指定的区域中。ObjectAdd的动机源于:首先,在仅有一个提示的情况下描述一切可能很难;其次,用户通常需要将对象添加到生成的图像中。为了适应现实世界,我们的ObjectAdd在添加对象时保持了准确的图像一致性:通过(1)在嵌入层级连接中进行连接以确保正确文本嵌入聚类;(2)使用潜在和注意注入的对象驱动布局控制确保访问用户指定区域的物体;(3)在关注重新聚焦和物体扩展的方式中进行提示图像修复,确保其余部分与初始图像相同。有了文本提示的图像,我们的ObjectAdd允许用户指定一个框和一个物体,并实现了: (1)在框内添加物体;(2)超出框外的精确内容;(3)两个区域的无缝融合
https://arxiv.org/abs/2404.17230
Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods.
视频电线修复(VWI)是视频修复中的一个突出应用,旨在通过完美地去除电影或电视剧中的电线,从而节省大量的时间和劳动力。然而,由于电线比通常针对的视频修复任务中的物体更长且更细,电线 removal 带来了更大的挑战,并且经常与人员和背景物体不规则地相交,这增加了修复过程的复杂性。为了克服现有视频电线数据集中的限制,其特征是数据量小、质量差且场景有限,我们提出了一个新的 VWI 数据集,称为电线移除视频数据集 2(WRV2)和高仿电线形状(PWS)掩码。WRV2 数据集包括超过 4,000 个视频,平均长度为 80 帧,旨在促进修复模型的开发和效果。在此基础上,我们的研究提出了 Redundancy-Aware Transformer(Raformer)方法,该方法解决了视频修复中电线去除的独特挑战。与传统方法不同,Raformer 采用了一种新的策略,通过粗粒度、窗口 based 的注意力机制选择性地绕过冗余部分,例如缺乏对修复有用信息的静态背景段。Raformer 的核心是 Redundancy-Aware 注意力(RAA)模块,通过粗粒度、窗口 based 的注意力机制隔离和强调关键内容。这还由软特征对齐(SFA)模块补充,该模块细化这些特征并实现端到端特征对齐。对传统视频修复数据集和我们的 WRV2 数据集的实验证明,Raformer 优于其他最先进的治疗方法。
https://arxiv.org/abs/2404.15802
The scarcity of green spaces, in urban environments, consists a critical challenge. There are multiple adverse effects, impacting the health and well-being of the citizens. Small scale interventions, e.g. pocket parks, is a viable solution, but comes with multiple constraints, involving the design and implementation over a specific area. In this study, we harness the capabilities of generative AI for multi-scale intervention planning, focusing on nature based solutions. By leveraging image-to-image and image inpainting algorithms, we propose a methodology to address the green space deficit in urban areas. Focusing on two alleys in Thessaloniki, where greenery is lacking, we demonstrate the efficacy of our approach in visualizing NBS interventions. Our findings underscore the transformative potential of emerging technologies in shaping the future of urban intervention planning processes.
城市环境中绿色空间的稀缺是一个关键挑战。这种短缺对公民的健康和福祉产生了多种不利影响。小规模干预措施,例如口袋公园,是一种可行的解决方案,但需要考虑多个限制,包括在特定区域的设计和实施。在这项研究中,我们利用生成式人工智能的多尺度干预规划功能,重点关注基于自然的解决方案。通过利用图像到图像和图像修复算法,我们提出了一种解决城市地区绿色空间不足的方法。聚焦于希腊塞萨洛尼基的两个小巷,其中绿化不足,我们证明了我们方法在可视化基于自然的干预措施方面的有效性。我们的研究结果强调了新兴技术在塑造城市干预规划过程未来方面的变革潜力。
https://arxiv.org/abs/2404.15492