Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.
文本驱动的3D室内场景生成具有广泛的应用,从游戏和智能家居到AR/VR应用程序。快速且高保真的场景生成对确保用户友好体验至关重要。然而,现有的方法的特点是生成过程漫长,或者需要对运动参数进行复杂的手动指定,这给用户带来困扰。此外,这些方法通常依赖于狭视野迭代生成,导致全局一致性和平面整体质量妥协。为了应对这些问题,我们提出了FastScene,一个快速且高保真度的3D场景生成框架,同时保持场景的一致性。具体来说,给定文本提示,我们生成全景并估计其深度,因为全景涵盖了整个场景的信息并表现出明确的几何约束。为了获得高质量的新视角,我们引入了粗视图合成(CVS)和渐进式新视图修复(PNVI)策略,确保场景的一致性和视质。随后,我们利用多视投影(MVP)形成透视视图,并应用3D高斯膨胀(3DGS)进行场景重建。全面实验证明,FastScene在生成速度和质量方面都超过了其他方法,具有良好的场景一致性。值得注意的是,仅凭借文本提示,FastScene可以在短短的15分钟内生成一个3D场景,这比现有方法至少快一个小时,使得它在用户友好场景生成方面具有范例意义。
https://arxiv.org/abs/2405.05768
Existing VLMs can track in-the-wild 2D video objects while current generative models provide powerful visual priors for synthesizing novel views for the highly under-constrained 2D-to-3D object lifting. Building upon this exciting progress, we present DreamScene4D, the first approach that can generate three-dimensional dynamic scenes of multiple objects from monocular in-the-wild videos with large object motion across occlusions and novel viewpoints. Our key insight is to design a "decompose-then-recompose" scheme to factorize both the whole video scene and each object's 3D motion. We first decompose the video scene by using open-vocabulary mask trackers and an adapted image diffusion model to segment, track, and amodally complete the objects and background in the video. Each object track is mapped to a set of 3D Gaussians that deform and move in space and time. We also factorize the observed motion into multiple components to handle fast motion. The camera motion can be inferred by re-rendering the background to match the video frames. For the object motion, we first model the object-centric deformation of the objects by leveraging rendering losses and multi-view generative priors in an object-centric frame, then optimize object-centric to world-frame transformations by comparing the rendered outputs against the perceived pixel and optical flow. Finally, we recompose the background and objects and optimize for relative object scales using monocular depth prediction guidance. We show extensive results on the challenging DAVIS, Kubric, and self-captured videos, detail some limitations, and provide future directions. Besides 4D scene generation, our results show that DreamScene4D enables accurate 2D point motion tracking by projecting the inferred 3D trajectories to 2D, while never explicitly trained to do so.
现有的VLMs可以在野外追踪2D视频对象,而当前的生成模型可以为高度约束的2D到3D物体提升提供强大的视觉先验,以生成新颖的视角。在此基础上,我们提出了DreamScene4D,这是第一个可以从单目野生动物视频生成多物体三维动态场景的方法,具有大物体运动跨越遮挡和新视角。我们的关键见解是设计一个“分解-然后-重构”方案,将整个视频场景和每个对象的3D运动分解。我们首先通过使用开箱即用的词汇mask跟踪器和适应性图像扩散模型来分解视频场景,分割和跟踪视频中的物体和背景。每个物体跟踪映射到一组3D高斯,它们在空间和时间上扭曲和移动。此外,我们还将观察到的运动分解为多个组件,以处理快速运动。通过重新渲染背景以匹配视频帧,可以推断出相机运动。对于物体运动,我们首先通过利用渲染损失和物体中心帧的多视图生成先验在物体中心建模物体本体的变形,然后通过将渲染输出与感知像素和光学流进行比较,优化物体本体到世界帧的变换。最后,我们通过单目深度预测指导来重构背景和物体,并优化相对物体比例。我们在具有挑战性的DAVIS、Kubric和自捕获视频中展示了广泛的结果,详细介绍了其局限性,并提供了未来的方向。除了4D场景生成外,我们的结果表明,DreamScene4D通过将推断的3D轨迹投影到2D来准确跟踪2D物体运动,而从未明确训练过这样做。
https://arxiv.org/abs/2405.02280
We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.
我们提出了EchoScene,一种交互式且可控制的三维室内场景生成模型,它在场景图上生成3D室内场景。EchoScene利用了双分支扩散模型,该模型动态地适应场景图。由于节点数量、边组合和操纵器诱导的节点边缘操作等因素的不同,现有的方法在处理场景图时遇到困难。EchoScene通过将每个节点与去噪过程相关联,实现了协同信息交流,增强了可控制和一致性的生成,同时考虑了全局约束。这是通过形状和布局分支的信息回声方案实现的。在去噪步骤中,所有进程将去噪数据共享给一个信息交换单元,该单元使用图卷积对这些更新进行组合。该方案确保了去噪过程受到场景图的整体理解的影响,从而促进了全局一致场景的生成。生成的场景在推理过程中可以进行编辑,并从扩散模型的噪声中采样。大量实验验证了我们的方法,保持场景的可控性并超过了前方法在生成质量上的表现。此外,生成的场景具有高质量,因此与现成的纹理生成兼容。代码和训练的模型都是开源的。
https://arxiv.org/abs/2405.00915
Neural Radiance Fields (NeRF) have shown impressive results in 3D reconstruction and generating novel views. A key challenge within NeRF is the editing of reconstructed scenes, such as object removal, which requires maintaining consistency across multiple views and ensuring high-quality synthesised perspectives. Previous studies have incorporated depth priors, typically from LiDAR or sparse depth measurements provided by COLMAP, to improve the performance of object removal in NeRF. However, these methods are either costly or time-consuming. In this paper, we propose a novel approach that integrates monocular depth estimates with NeRF-based object removal models to significantly reduce time consumption and enhance the robustness and quality of scene generation and object removal. We conducted a thorough evaluation of COLMAP's dense depth reconstruction on the KITTI dataset to verify its accuracy in depth map generation. Our findings suggest that COLMAP can serve as an effective alternative to a ground truth depth map where such information is missing or costly to obtain. Additionally, we integrated various monocular depth estimation methods into the removal NeRF model, i.e., SpinNeRF, to assess their capacity to improve object removal performance. Our experimental results highlight the potential of monocular depth estimation to substantially improve NeRF applications.
Neural Radiance Fields (NeRF) 在 3D 重建和生成新视图方面已经取得了令人印象深刻的成果。 NeRF 中的关键挑战之一是编辑重构场景,例如物体移除,这需要在多个视图中保持一致并确保高质合成视角。之前的研究已经利用深度优先项,通常来自 LiDAR 或稀疏深度测量提供的 COLMAP,来提高 NeRF 中物体移除的性能。然而,这些方法要么代价高昂,要么费时。在本文中,我们提出了一种新方法,将单目深度估计与基于 NeRF 的物体移除模型相结合,显著减少了时间消耗,并提高了场景生成和物体移除的稳健性和质量。我们对 COLMAP 在 KITTI 数据集上的密集深度重建进行了详细的评估,以验证其深度图生成的准确性。我们的研究结果表明,COLMAP 可以作为当深度图缺失或昂贵无法获得时的有效地面真值深度图的替代。此外,我们将各种单目深度估计方法(例如 SpinNeRF)集成到移除 NeRF 模型中,以评估它们提高物体移除性能的能力。我们的实验结果突出了单目深度估计在极大地改善 NeRF 应用中的潜力。
https://arxiv.org/abs/2405.00630
3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.
3D场景生成已经成为一个快速发展的研究方向,得到了2D生成扩散模型的持续改进。在这个领域中,之前的工作通常通过迭代地将新产生的帧与现有的几何知识拼接在一起来生成场景。这些工作通常依赖于预训练的单目深度估计器将生成的图像提升到3D,将它们与现有的场景表示融合。然后,这些方法通常通过文本度量来评估,测量生成图像与给定文本提示之间的相似性。在这项工作中,我们为3D场景生成领域做出了两个根本性的贡献。首先,我们指出,使用单目深度估计模型将图像提升到3D是不最优的,因为它忽略了现有场景的几何结构。因此,我们引入了一种新的深度完成模型,通过教师蒸馏和自训练学习3D融合过程,从而改善了场景的几何连贯性。其次,我们引入了一种基于真实几何的新场景生成基准方案,因此它衡量了场景结构的质量。
https://arxiv.org/abs/2404.19758
The evaluation and training of autonomous driving systems require diverse and scalable corner cases. However, most existing scene generation methods lack controllability, accuracy, and versatility, resulting in unsatisfactory generation results. To address this problem, we propose Dragtraffic, a generalized, point-based, and controllable traffic scene generation framework based on conditional diffusion. Dragtraffic enables non-experts to generate a variety of realistic driving scenarios for different types of traffic agents through an adaptive mixture expert architecture. We use a regression model to provide a general initial solution and a refinement process based on the conditional diffusion model to ensure diversity. User-customized context is introduced through cross-attention to ensure high controllability. Experiments on a real-world driving dataset show that Dragtraffic outperforms existing methods in terms of authenticity, diversity, and freedom.
自动驾驶系统的评估和训练需要各种具有多样性和可扩展性的场景。然而,大多数现有的场景生成方法缺乏可控制性、准确性和多样性,导致不满意的生成结果。为了解决这个问题,我们提出了Dragtraffic,一种基于条件扩散的泛化、基于点的自动驾驶场景生成框架。Dragtraffic通过自适应混合专家架构使非专家用户为不同类型的交通代理生成各种真实的驾驶场景。我们使用回归模型提供了一个通用的初始解决方案,并基于条件扩散模型进行优化,以确保多样性。通过跨注意力的引入,我们引入了用户自定义上下文,以确保高可控性。在现实世界的驾驶数据集上进行的实验表明,Dragtraffic在真实性、多样性和自由度方面超过了现有方法。
https://arxiv.org/abs/2404.12624
With recent developments in Embodied Artificial Intelligence (EAI) research, there has been a growing demand for high-quality, large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes, the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity, we introduce PhyScene, a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts, articulated objects, and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts, we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision, room layout, and object reachability. Through extensive experiments, we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis, outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments, thereby catalyzing further advancements in embodied AI research. Project website: this http URL.
随着最近在Embodied人工智能(EAI)研究中的发展,对于高质量、大规模互动场景生成的高需求不断增加。然而,以前的方法在场景合成中过于关注生成场景的自然性和真实性,而场景的物理可行性和互动性却被大大忽视了。为了应对这一差异,我们引入了PhyScene,一种专为身体代理生成具有真实布局、关节和丰富物理交互性的交互式3D场景的新方法。基于条件扩散模型来捕捉场景布局,我们设计了一种新的基于物理和交互性的指导机制,结合了物体碰撞、房间布局和物体可达性等方面的约束。通过大量实验,我们证明了PhyScene有效地利用了这些指导功能进行物理交互式场景生成,在现有状态最先进的方法之上取得了很大的优势。我们的研究结果表明,PhyScene生成的场景在促进交互环境中的代理多样化技能学习方面具有相当大的潜力,从而推动了在身体人工智能研究中的进一步发展。项目网站:此链接。
https://arxiv.org/abs/2404.09465
Cinemagraph is a unique form of visual media that combines elements of still photography and subtle motion to create a captivating experience. However, the majority of videos generated by recent works lack depth information and are confined to the constraints of 2D image space. In this paper, inspired by significant progress in the field of novel view synthesis (NVS) achieved by 3D Gaussian Splatting (3D-GS), we propose LoopGaussian to elevate cinemagraph from 2D image space to 3D space using 3D Gaussian modeling. To achieve this, we first employ the 3D-GS method to reconstruct 3D Gaussian point clouds from multi-view images of static scenes,incorporating shape regularization terms to prevent blurring or artifacts caused by object deformation. We then adopt an autoencoder tailored for 3D Gaussian to project it into feature space. To maintain the local continuity of the scene, we devise SuperGaussian for clustering based on the acquired features. By calculating the similarity between clusters and employing a two-stage estimation method, we derive an Eulerian motion field to describe velocities across the entire scene. The 3D Gaussian points then move within the estimated Eulerian motion field. Through bidirectional animation techniques, we ultimately generate a 3D Cinemagraph that exhibits natural and seamlessly loopable dynamics. Experiment results validate the effectiveness of our approach, demonstrating high-quality and visually appealing scene generation.
电影图像是将静态照片和微动态运动的元素结合在一起,创造了一种引人入胜的体验。然而,由最近作品生成的多数视频缺乏深度信息,并局限于2D图像空间的限制。在本文中,我们受到3D高斯平滑(3D-GS)在 novel view synthesis(NVS)领域的重大进步的启发,提出了一种将电影图象从2D图像空间提升到3D空间使用3D高斯建模的方法。为了实现这一目标,我们首先使用3D-GS方法从静态场景的多视角图像中重构3D高斯点云,包括形状正则化项,以防止由于物体变形引起的模糊或伪影。然后我们采用一个专为3D高斯设计的自动编码器将其投影到特征空间。为了保持场景的局部连续性,我们根据获得的特征设计 SuperGaussian for clustering。通过计算聚类之间的相似度并使用双阶段估计方法,我们得到一个欧拉运动场,描述了场景中整个空间的瞬时速度。然后,通过双向动画技术,我们最终生成一个3D电影图象,展示了自然和无缝的循环动态。实验结果证实了我们的方法的有效性,表明了高质量和视觉上吸引人的场景生成。
https://arxiv.org/abs/2404.08966
We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.
我们介绍了一种从文本描述生成通用面向未来的3D场景的技术,名为RealmDreamer。我们的技术通过优化3D高斯平铺表示来匹配复杂的文本提示。我们通过利用最先进的文本到图像生成器的状态,将样本提升到3D并计算遮挡体积。然后,在多个视角上对这种表示进行优化,将其作为3D修复任务与图像条件扩散模型一起进行。为了学习正确的几何结构,我们在修复模型上通过条件于修复模型的样本,从而赋予了丰富几何结构。最后,我们使用增强的生成器样本对模型进行微调。值得注意的是,我们的技术不需要视频或多视角数据,可以生成不同风格的高质量3D场景,包括多个物体。此外,其普遍性还允许从单张图像合成3D。
https://arxiv.org/abs/2404.07199
Many safety-critical applications, especially in autonomous driving, require reliable object detectors. They can be very effectively assisted by a method to search for and identify potential failures and systematic errors before these detectors are deployed. Systematic errors are characterized by combinations of attributes such as object location, scale, orientation, and color, as well as the composition of their respective backgrounds. To identify them, one must rely on something other than real images from a test set because they do not account for very rare but possible combinations of attributes. To overcome this limitation, we propose a pipeline for generating realistic synthetic scenes with fine-grained control, allowing the creation of complex scenes with multiple objects. Our approach, BEV2EGO, allows for a realistic generation of the complete scene with road-contingent control that maps 2D bird's-eye view (BEV) scene configurations to a first-person view (EGO). In addition, we propose a benchmark for controlled scene generation to select the most appropriate generative outpainting model for BEV2EGO. We further use it to perform a systematic analysis of multiple state-of-the-art object detection models and discover differences between them.
许多关键安全应用程序,尤其是在自动驾驶领域,需要可靠的物体检测器。在部署这些检测器之前,通过一种方法搜索和识别潜在故障和系统误差,可以大大有效地辅助这些检测器。系统误差的特点是包括物体位置、尺寸、方向和颜色的属性,以及它们各自的背景的组合。要识别它们,一个人必须依赖除了测试集中的真实图像之外的其他东西,因为它们没有考虑到很少见但可能出现的属性组合。为了克服这一局限性,我们提出了一个生成真实合成场景的流水线,具有细粒度控制,允许创建具有多个物体的复杂场景。我们的方法BEV2EGO允许通过道路 contingent控制生成完整的场景,将2D鸟类视场(BEV)场景配置映射到第一人称视角(EGO)。此外,我们还提出了一个基准,用于控制场景生成,以选择最合适的生成修复模型为BEV2EGO。我们进一步使用它进行对多个最先进的物体检测模型的系统分析,并发现了它们之间的差异。
https://arxiv.org/abs/2404.07045
The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: this http URL
虚拟现实应用程序的需求不断增加,凸显了创建沉浸式的3D资产的重要性。我们提出了一个文本到3D 360$^{\circ}$场景生成管道,用于在几分钟内创建全面的360$^{\circ}$场景。我们的方法利用了2D扩散模型的生成能力以及提示自优化来创建高质量和高全球一致性的全景图像。这个图像作为初步的"平"(2D)场景表示。接着,它被提升到3D高斯分布中,利用插值技术实现实时探索。为了产生一致的3D几何,我们的管道通过将2D单目深度对齐到一个全局优化点云中来构建空间一致的结构。这个点云作为3D高斯圆心的初始状态。为了解决单视图输入中固有的可见问题,我们对合成视图和输入视图施加语义和几何约束作为正则化。这些指导Gaussians的优化,有助于重构未见区域。总之,我们的方法在360$^{\circ}$的视角内提供了一个全球一致的3D场景,提高了现有技术的沉浸体验。项目网站:http:// this http URL
https://arxiv.org/abs/2404.06903
Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: this https URL.
通过大规模的文本到图像扩散模型,文本到3D生成功能取得了显著的成功。然而,在扩展方法以达到城市规模方面,尚无范式。城市场景具有大量的元素、复杂的布局关系和广泛的规模,这使得对模糊文本描述的有效模型优化具有挑战性。在这项工作中,我们克服了限制,通过将组件化的3D布局表示引入文本到3D范式中,作为额外的先验。它包括一系列语义原型,具有简单的几何结构和明确的布局关系,补充了文本描述,并实现了可引导的生成。在此基础上,我们提出了两个修改建议--(1) 我们引入了布局引导的变分 score distillation 以解决模型优化不足的问题。它将分数扩散采样过程与3D布局的几何和语义约束相结合。(2) 为了处理城市场景的无界性质,我们使用可扩展哈希网格结构表示3D场景,并随着城市场景规模的增长,逐步适应。大量实验证实了我们的框架可以将文本到3D生成扩展到覆盖超过1000米驾驶距离的大型城市场景,这是第一次实现。我们还展示了各种场景编辑示例,展示了我们框架的可引导城市场景生成的力量。网站:https://this URL。
https://arxiv.org/abs/2404.06780
Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: this https URL.
生成具有更高分辨率的人中心场景,具有详细信息和控制仍然是对现有文本到图像扩散模型的挑战。这一挑战源于训练图像大小有限、文本编码器能力有限以及生成涉及多个人的复杂场景的固有难度。虽然现有的方法试图解决训练大小限制,但它们通常产生具有严重伪影的人为中心场景。我们提出BeyondScene,一种新框架,克服了先前的限制,使用现有的预训练扩散模型生成卓越的高分辨率(超过8K)人中心场景,具有出色的文本图像匹配和自然性。BeyondScene采用阶段性和层次结构的方法,首先生成关注多个人类实例创建关键元素的详细基础图像,并超越了扩散模型的token limit,然后平滑地将基础图像转换为高分辨率输出,超过训练图像大小,并利用我们提出的实例感知层次结构扩展过程,其中包含我们提出的频高注入前向扩散和自适应联合扩散,超越了现有的方法在详细文本描述和自然性方面的表现。BeyondScene在详细文本描述和自然性方面超过了现有的方法,为高级应用于高分辨率人中心场景创建打开了道路,而无需进行昂贵的重新训练。项目页面:https://this URL。
https://arxiv.org/abs/2404.04544
Text-to-3D scene generation holds immense potential for the gaming, film, and architecture sectors. Despite significant progress, existing methods struggle with maintaining high quality, consistency, and editing flexibility. In this paper, we propose DreamScene, a 3D Gaussian-based novel text-to-3D scene generation framework, to tackle the aforementioned three challenges mainly via two strategies. First, DreamScene employs Formation Pattern Sampling (FPS), a multi-timestep sampling strategy guided by the formation patterns of 3D objects, to form fast, semantically rich, and high-quality representations. FPS uses 3D Gaussian filtering for optimization stability, and leverages reconstruction techniques to generate plausible textures. Second, DreamScene employs a progressive three-stage camera sampling strategy, specifically designed for both indoor and outdoor settings, to effectively ensure object-environment integration and scene-wide 3D consistency. Last, DreamScene enhances scene editing flexibility by integrating objects and environments, enabling targeted adjustments. Extensive experiments validate DreamScene's superiority over current state-of-the-art techniques, heralding its wide-ranging potential for diverse applications. Code and demos will be released at this https URL .
文本到3D场景生成在游戏、电影和建筑领域具有巨大的潜力。尽管已经取得了一定的进展,但现有的方法在保持高质量、一致性和编辑灵活性方面仍然存在挑战。在本文中,我们提出了DreamScene,一种基于3D高斯的新文本到3D场景生成框架,通过两种策略来解决上述三个挑战。首先,DreamScene采用形成模式采样(FPS),一种由3D物体形成模式指导的多时间步采样策略,以形成快速、语义丰富、高质量的代表。FPS使用3D高斯滤波器优化稳定性,并利用重构技术生成逼真的纹理。其次,DreamScene采用一种渐进式的三阶段相机采样策略,特别为室内和室外场景设计,以有效确保物体环境和场景范围的3D一致性。最后,DreamScene通过将物体和环境集成,增强了场景编辑的灵活性,实现了针对性的调整。大量实验证实了DreamScene在现有技术水平上的优越性,预示着其在各种应用领域广泛的潜力。代码和演示将在此处发布:https://www.dreamscene.org 。
https://arxiv.org/abs/2404.03575
In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a.k.a., diffusion models) and transformers. Our proposed framework, termed the "World-Centric Diffusion Transformer" (WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems.
在本文中,我们提出了一个新颖的方法,通过利用扩散概率模型的互补优势来设计自动驾驶轨迹生成框架。我们的方法被称为"世界中心扩散Transformer"(WcDT),在整个轨迹生成过程中优化了特征提取到模型推理。为了增强场景多样性和随机性,首先对历史轨迹数据进行预处理,并使用Denoising Diffusion Probabilistic Models(DDPM)增强的Diffusion with Transformer(DiT)块将它们编码到潜在空间中。然后,将潜在特征、历史轨迹、高程图特征和历史交通信号信息与各种Transformer基编码器进行融合。接着,编码的交通场景被轨迹解码器解码,生成多模态的未来轨迹。全面的实验结果表明,与传统的轨迹生成方法相比,所提出的方法在生成真实和多样轨迹方面表现出优异的性能,表明其潜在用于自动驾驶模拟系统。
https://arxiv.org/abs/2404.02082
Synthesizing realistic and diverse indoor 3D scene layouts in a controllable fashion opens up applications in simulated navigation and virtual reality. As concise and robust representations of a scene, scene graphs have proven to be well-suited as the semantic control on the generated layout. We present a variant of the conditional variational autoencoder (cVAE) model to synthesize 3D scenes from scene graphs and floor plans. We exploit the properties of self-attention layers to capture high-level relationships between objects in a scene, and use these as the building blocks of our model. Our model, leverages graph transformers to estimate the size, dimension and orientation of the objects in a room while satisfying relationships in the given scene graph. Our experiments shows self-attention layers leads to sparser (HOW MUCH) and more diverse scenes (HOW MUCH)\. Included in this work, we publish the first large-scale dataset for conditioned scene generation from scene graphs, containing over XXX rooms (of floor plans and scene graphs).
控制生成现实和多样室内3D场景布局是一种可控制的方式为模拟导航和虚拟现实打开了许多应用。作为场景的简洁而强大的表示,场景图已被证明作为对生成的布局的语义控制。我们提出了一种从场景图和 floor plan 合成 3D 场景的条件下变分自编码器(cVAE)模型。我们利用自注意层的特性来捕捉场景中物体之间的高层次关系,并利用这些作为我们模型的构建模块。我们的模型利用图变换器估计场景中物体的大小、维度和方向,同时满足给定场景图中的关系。我们的实验表明,自注意层导致更稀疏(多少)的场景(多少)。本工作中,我们发布了第一个大规模条件场景生成数据集,包含超过XXX个房间(地板图和场景图)。
https://arxiv.org/abs/2404.01887
Diffusion models (DMs) excel in photo-realistic image synthesis, but their adaptation to LiDAR scene generation poses a substantial hurdle. This is primarily because DMs operating in the point space struggle to preserve the curve-like patterns and 3D geometry of LiDAR scenes, which consumes much of their representation power. In this paper, we propose LiDAR Diffusion Models (LiDMs) to generate LiDAR-realistic scenes from a latent space tailored to capture the realism of LiDAR scenes by incorporating geometric priors into the learning pipeline. Our method targets three major desiderata: pattern realism, geometry realism, and object realism. Specifically, we introduce curve-wise compression to simulate real-world LiDAR patterns, point-wise coordinate supervision to learn scene geometry, and patch-wise encoding for a full 3D object context. With these three core designs, our method achieves competitive performance on unconditional LiDAR generation in 64-beam scenario and state of the art on conditional LiDAR generation, while maintaining high efficiency compared to point-based DMs (up to 107$\times$ faster). Furthermore, by compressing LiDAR scenes into a latent space, we enable the controllability of DMs with various conditions such as semantic maps, camera views, and text prompts. Our code and pretrained weights are available at this https URL.
扩散模型(DMs)在照片现实图像合成方面表现出色,但将它们应用于激光雷达(LiDAR)场景生成存在重大挑战。这主要是因为在点空间操作的DM很难保留LiDAR场景的曲线状模式和3D几何,这会消耗它们的代表力。在本文中,我们提出了LiDAR扩散模型(LiDMs)来从针对LiDAR场景的潜在空间中生成LiDAR真实场景,通过将几何先验集成到学习管道中来实现这一目标。我们的方法针对三个主要需求:模式现实、几何现实和物体现实。具体来说,我们引入了曲线级压缩来模拟真实世界LiDAR模式,点级坐标监督来学习场景几何,以及完整的3D对象编码来提供对3D对象的全面信息。 通过这三个核心设计,我们的方法在64束场景下的无条件LiDAR生成上实现了与条件LiDAR生成相匹敌的竞争性能,同时在条件LiDAR生成方面达到了最先进的水平,同时保持了与基于点的DM相比高达107倍的高效率。此外,通过将LiDAR场景压缩到潜在空间中,我们使得DM具有各种条件下的可控性,如语义图、相机视图和文本提示。 我们的代码和预训练权重可以从以下链接获取:https://www.xxx.com/。
https://arxiv.org/abs/2404.00815
The generation of 3D scenes from user-specified conditions offers a promising avenue for alleviating the production burden in 3D applications. Previous studies required significant effort to realize the desired scene, owing to limited control conditions. We propose a method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts. Combining these conditions to generate a 3D scene involves the following significant difficulties: (1) the creation of large datasets, (2) reflection on the interaction of multimodal conditions, and (3) domain dependence of the layout conditions. We decompose the process of 3D scene generation into 2D image generation from the given conditions and 3D scene generation from 2D images. 2D image generation is achieved by fine-tuning a pretrained text-to-image model with a small artificial dataset of partial images and layouts, and 3D scene generation is achieved by layout-conditioned depth estimation and neural radiance fields (NeRF), thereby avoiding the creation of large datasets. The use of a common representation of spatial information using 360-degree images allows for the consideration of multimodal condition interactions and reduces the domain dependence of the layout control. The experimental results qualitatively and quantitatively demonstrated that the proposed method can generate 3D scenes in diverse domains, from indoor to outdoor, according to multimodal conditions.
从用户指定条件生成3D场景为减轻3D应用程序的生产负担提供了有前途的途径。之前的研究因为控制条件有限而需要大量努力来实现期望的场景。我们提出了一种使用部分图像、表示在顶视图中的布局信息以及文本提示控制和生成3D场景的方法。将这些条件组合生成3D场景涉及以下重大困难:(1)创建大量数据集,(2)反思多模态条件的交互,(3)布局条件的领域依赖性。我们将3D场景生成的过程分解为从给定条件的2D图像生成和从2D图像生成的3D场景生成。2D图像生成是通过预训练的文本到图像模型的小人工数据集微调来实现的,而3D场景生成是通过布局条件下的深度估计和神经辐射场(NeRF)实现的,从而避免了创建大量数据集。使用360度图像的常见表示空间信息允许考虑多模态条件的交互,并减少了布局控制的领域依赖性。实验结果既定性地又定量地证明了所提出的方法可以根据多模态条件生成各种领域的3D场景。
https://arxiv.org/abs/2404.00345
Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: this https URL.
近年来,基于预训练的文本到视频模型的监督文本-4D生成技术已经出现。然而,现有的运动表示方法,如变形模型或时间依赖的神经表示,在生成运动方面的能力有限,它们无法生成超出体积渲染 bounding 盒的范围之外的运动。缺乏更灵活的运动模型导致4D生成方法和最近近实感的视频生成模型之间的现实 gap。在这里,我们提出 TC4D:轨迹条件文本-4D生成,其中将运动因素化为全局和局部组件。我们通过参数化由平滑曲线定义的轨迹上的全局运动来表示场景的边界框全局运动。我们通过监督来自文本到视频模型来学习局部变形,使其符合全局轨迹。我们的方法能够合成沿着任意轨迹运动的场景,进行级联场景生成,以及显著提高真实感和生成运动的数量。我们通过定性评估和用户研究来评估这种方法的质量和效果。视频结果可以在我们的网站上查看:https://www.的这个URL。
https://arxiv.org/abs/2403.17920
Recent advancements in diffusion models for 2D and 3D content creation have sparked a surge of interest in generating 4D content. However, the scarcity of 3D scene datasets constrains current methodologies to primarily object-centric generation. To overcome this limitation, we present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Utilizing Large Language Models (LLMs), the framework begins by decomposing an input text prompt into distinct entities and maps out their trajectories. It then constructs the compositional 4D scene by accurately positioning these objects along their designated paths. To refine the scene, our method employs a compositional score distillation technique guided by the pre-defined trajectories, utilizing pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains. Extensive experiments demonstrate our outstanding 4D content creation capability compared to prior arts, showcasing superior visual quality, motion fidelity, and enhanced object interactions.
近年来,在二维和三维内容创作中扩散模型的最新进展引发了人们对生成4D内容的浓厚兴趣。然而,3D场景数据集的稀缺性限制了当前方法主要集中在以物体为中心的生成。为了克服这一限制,我们提出了Comp4D,一种用于合成4D内容的全新框架。与传统方法生成整个场景的单一4D表示不同,Comp4D创新地构建了场景中的每个4D物体。利用大型语言模型(LLMs),该框架首先将输入文本提示分解为不同的实体,并对其轨迹进行探索。然后,它准确地将这些物体沿着其预设路径定位,构建了合成4D场景。为了优化场景,我们的方法采用了一个基于预定义轨迹的合成分数蒸馏技术,利用了跨文本到图像、文本到视频和文本到3D领域的预训练扩散模型。大量实验证明,与以前的艺术作品相比,我们的4D内容创作能力非凡,展示了卓越的视觉质量、动作流畅性和增强的物体交互。
https://arxiv.org/abs/2403.16993