3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
3D内容创作在各种应用中发挥着重要作用,如游戏、机器人模拟和虚拟现实。然而,该过程费力且耗时,需要熟练的设计师投入大量精力创作单个3D资产。为应对这一挑战,文本到3D生成技术作为一种有前途的自动化3D创作的解决方案应运而生。通过利用大型视觉语言模型的成功,这些技术旨在根据文本描述生成3D内容。尽管在最近一段时间内这一领域取得了进展,但现有的解决方案在生成质量和效率方面仍然存在显著的限制。在本次调查中,我们深入研究了最新的文本到3D创作方法。我们提供了关于文本到3D创作的全面背景,包括讨论训练和评估指标所使用的数据集以及用于评估生成3D模型的质量的评估指标。接着,我们深入探讨了作为3D生成过程基础的各种3D表示。此外,我们还对迅速发展的关于生成管道的研究进行了全面的比较,并将它们分为前馈生成、基于优化的生成和视图重构方法。通过分析这些方法的优缺点,我们希望揭示它们各自的潜能和局限。最后,我们指出了未来研究的几个有前景的方向。通过这次调查,我们希望激励研究人员进一步探索开放词汇文本条件下3D内容创作的潜力。
https://arxiv.org/abs/2405.09431
This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel registration algorithms and guidewire segmentation methods as its perception modules. We additionally propose three modules: a topology-constrained 2D-3D instrument end-point lifting method, a tree-based fast path planning algorithm, and a prior-free endovascular navigation strategy. VascularPilot3D is compatible with most mainstream endovascular robots. Ex-vivo experiments validate that VascularPilot3D achieves 100% success rate among 25 trials. It reduces the human surgeon's overall control loops by 18.38%. VascularPilot3D is promising for general clinical autonomous endovascular navigations.
这项研究报道了VascularPilot3D,这是第一个3D完全自主式内窥镜导航系统。作为自主引导线导航探索,VascularPilot3D是基于内窥镜成像系统(本研究中的荧光X射线)和典型内窥镜机器人开发的完整导航系统。VascularPilot3D采用之前研究过的快速3D-2D血管配准算法和引导线分割方法作为其感知模块。此外,我们还提出了三个模块:基于树的高速路径规划算法、基于约束的2D-3D器械端点提升方法和无需先验的内窥镜导航策略。VascularPilot3D兼容大多数主流内窥镜机器人。实验验证表明,VascularPilot3D在25个试点研究中实现了100%的成功率。它减少了人类外科医生的总操作循环次数 by 18.38%。VascularPilot3D在一般临床自主内窥镜导航方面具有前景。
https://arxiv.org/abs/2405.09375
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
In this paper, we present an innovative technique for the path planning of flying robots in a 3D environment in Rough Mereology terms. The main goal was to construct the algorithm that would generate the mereological potential fields in 3-dimensional space. To avoid falling into the local minimum, we assist with a weighted Euclidean distance. Moreover, a searching path from the start point to the target, with respect to avoiding the obstacles was applied. The environment was created by connecting two cameras working in real-time. To determine the gate and elements of the world inside the map was responsible the Python Library OpenCV [1] which recognized shapes and colors. The main purpose of this paper is to apply the given results to drones.
在本文中,我们提出了一种创新的方法,用于在 rough melee 环境下对飞行机器人的路径进行规划。主要目标是为 3D 空间中的飞行机器人生成只论域 potential fields。为了避免陷入局部最小值,我们使用加权欧氏距离来协助算法。此外,我们还应用了从起点到目标点的搜索路径,以避免障碍物。环境是由实时连接的两个相机创建的。确定地图内世界的门和元素的是 Python 库 OpenCV [1],它识别形状和颜色。本文的主要目的是将所得到的结果应用于无人机。
https://arxiv.org/abs/2405.09282
The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features and the feasibility based on temporal features of realizing effective detection. According to this analysis, we use a multi-frame as a detection unit and propose a detection method based on temporal energy selective scaling (TESS). Specifically, we investigated the composition of intensity temporal profiles (ITPs) formed by pixels on a multi-frame detection unit. For the target-present pixel, the target passing through the pixel will bring a weak transient disturbance on the ITP and introduce a change in the statistical properties of ITP. We use a well-designed function to amplify the transient disturbance, suppress the background and noise components, and output the trajectory of the target on the multi-frame detection unit. Subsequently, to solve the contradiction between the detection rate and the false alarm rate brought by the traditional threshold segmentation, we associate the temporal and spatial features of the output trajectory and propose a trajectory extraction method based on the 3D Hough transform. Finally, we model the trajectory of the target and propose a trajectory-based multi-target tracking method. Compared with the various state-of-the-art detection and tracking methods, experiments in multiple scenarios prove the superiority of our proposed methods.
被动光学遥感(PORS)中检测和跟踪小目标具有广泛的应用价值。然而,之前提出的大多数方法很少利用目标运动产生的丰富时变特征,导致低信号-噪声比(SCR)目标检测和跟踪性能较差。在本文中,我们分析基于空间特征和基于时变特征实现有效检测的难度,并根据分析结果提出了一种基于时变能量选择性缩放(TESS)的检测方法。具体来说,我们研究了多帧中像素产生的强度时变轮廓(ITP)的组成。对于目标存在的像素,穿过像素的目标会对ITP产生弱暂态干扰,并改变ITP的统计特性。我们使用一个精心设计的函数来放大暂态干扰,抑制背景和噪声分量,并输出目标在多帧检测单元上的轨迹。为了解决传统阈值分割带来的检测率和误报警率之间的矛盾,我们将输出轨迹的时域和空间特征相关联,并提出了基于3D Hough变换的轨迹提取方法。最后,我们建模了目标轨迹,并提出了基于轨迹的多目标跟踪方法。与各种最先进的检测和跟踪方法相比,多个场景下的实验证明了我们提出方法的优越性。
https://arxiv.org/abs/2405.09054
Recent advancements in deep learning for 3D models have propelled breakthroughs in generation, detection, and scene understanding. However, the effectiveness of these algorithms hinges on large training datasets. We address the challenge by introducing Efficient 3D Seam Carving (E3SC), a novel 3D model augmentation method based on seam carving, which progressively deforms only part of the input model while ensuring the overall semantics are unchanged. Experiments show that our approach is capable of producing diverse and high-quality augmented 3D shapes across various types and styles of input models, achieving considerable improvements over previous methods. Quantitative evaluations demonstrate that our method effectively enhances the novelty and quality of shapes generated by other subsequent 3D generation algorithms.
近年来,在深度学习领域为3D模型取得突破性的进展,主要体现在生成、检测和场景理解方面的提升。然而,这些算法的有效性依赖于大型训练数据集。为了解决这一挑战,我们引入了Efficient 3D Seam Carving(E3SC),一种基于缝合切割的新3D模型增强方法,在确保整体语义不变的前提下,逐步改变输入模型的部分部分。实验结果表明,我们的方法能够为各种输入模型的多样性和高质量生成3D形状,并在很大程度上超过了以前的方法。定量的评估结果表明,我们的方法有效地增强了后续3D生成算法生成的形状的新奇度和质量。
https://arxiv.org/abs/2405.09050
We perform detailed theoretical analysis of an expectation-maximization-based algorithm recently proposed in for solving a variation of the 3D registration problem, named multi-model 3D registration. Despite having shown superior empirical results, did not theoretically justify the conditions under which the EM approach converges to the ground truth. In this project, we aim to close this gap by establishing such conditions. In particular, the analysis revolves around the usage of probabilistic tail bounds that are developed and applied in various instances throughout the course. The problem studied in this project stands as another example, different from those seen in the course, in which tail-bounds help advance our algorithmic understanding in a probabilistic way. We provide self-contained background materials on 3D Registration
我们对最近在会议上提出的基于期望最大化(EM)的3D配准问题的详细理论分析进行了研究,该问题被称为多模型3D配准。尽管已经展示了出色的实证结果,但尚未在理论层面上证明EM方法在何时收敛到真实值。在这个项目中,我们旨在通过建立这样的条件来填补这一空白。 尤其是在该项目中,分析的核心在于应用了在各个实例中开发和应用的具有概率尾界的算法。这个项目所研究的问题与课程中看到的其他例子不同,尾界帮助以概率方式推动我们增进对算法的理解。 我们提供了关于3D配准的完整背景材料。
https://arxiv.org/abs/2405.08991
Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm, detecting objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms. Code is available at this https URL.
许多基于查询的3D多对象跟踪(MOT)方法采用了关注点的跟踪范式,利用跟踪查询进行身份一致的检测,利用对象查询进行身份无关的跟踪生成。然而,关注点的跟踪范式将检测和跟踪查询在同一个嵌入中纠缠在一起,对于检测和跟踪任务来说不是最优解。其他方法类似于跟踪-by-detection范式,使用解耦的跟踪和检测查询然后进行后续的相关联来检测物体。然而,这些方法并未利用检测和关联任务之间的协同作用。通过结合这两种范式的优势,我们引入了ADA-Track,一种从多视角摄像机视角的3D MOT的新型端到端框架。我们基于边缘增强交叉注意力的可学习数据关联模块,利用外观和几何特征。此外,我们将该关联模块集成到基于DETR的3D检测器的解码层中,实现同时检测和查询到图像的交叉注意。通过堆叠这些解码层,查询在检测和关联任务上进行 alternating refine,有效利用了任务依赖关系。我们在nuScenes数据集上评估我们的方法,并证明了与前两种范式相比,我们的方法具有优势。代码可在此处下载:https://www.xxx.com/。
https://arxiv.org/abs/2405.08909
The Prostate Imaging Reporting and Data System (PI-RADS) is pivotal in the diagnosis of clinically significant prostate cancer through MRI imaging. Current deep learning-based PI-RADS scoring methods often lack the incorporation of essential PI-RADS clinical guidelines~(PICG) utilized by radiologists, potentially compromising scoring accuracy. This paper introduces a novel approach that adapts a multi-modal large language model (MLLM) to incorporate PICG into PI-RADS scoring without additional annotations and network parameters. We present a two-stage fine-tuning process aimed at adapting MLLMs originally trained on natural images to the MRI data domain while effectively integrating the PICG. In the first stage, we develop a domain adapter layer specifically tailored for processing 3D MRI image inputs and design the MLLM instructions to differentiate MRI modalities effectively. In the second stage, we translate PICG into guiding instructions for the model to generate PICG-guided image features. Through feature distillation, we align scoring network features with the PICG-guided image feature, enabling the scoring network to effectively incorporate the PICG information. We develop our model on a public dataset and evaluate it in a real-world challenging in-house dataset. Experimental results demonstrate that our approach improves the performance of current scoring networks.
前列腺影像报告和数据系统(PI-RADS)在通过MRI成像诊断临床显著的前列腺癌中具有关键作用。当前的深度学习为基础的PI-RADS评分方法通常缺乏使用放射学家所使用的关键PI-RADS临床指南(PICG)进行整合,这可能影响评分准确性。本文介绍了一种新方法,将一个多模态的大语言模型(MLLM)适应性地整合到PI-RADS评分中,而无需额外注释和网络参数。我们提出了一个两阶段微调过程,旨在将最初在自然图像上训练的MLLM适应性地映射到MRI数据领域,同时有效地整合PICG。在第一阶段,我们开发了一个专门针对处理3D MRI图像输入的领域适配层,并设计MLLM指令以有效区分MRI模式。在第二阶段,我们将PICG转换为指导模型生成PICG指导的图像特征的指导指令。通过特征蒸馏,我们将评分网络特征与PICG指导的图像特征对齐,使评分网络能够有效整合PICG信息。我们在公共数据集上开发我们的模型,并在真实世界具有挑战性的内部数据集上进行评估。实验结果表明,我们的方法提高了当前评分网络的性能。
https://arxiv.org/abs/2405.08786
We present an analytic solution to the 3D Dubins path problem for paths composed of an initial circular arc, a straight component, and a final circular arc. These are commonly called CSC paths. By modeling the start and goal configurations of the path as the base frame and final frame of an RRPRR manipulator, we treat this as an inverse kinematics problem. The kinematic features of the 3D Dubins path are built into the constraints of our manipulator model. Furthermore, we show that the number of solutions is not constant, with up to seven valid CSC path solutions even in non-singular regions. An implementation of solution is available at this https URL.
我们提出了一个解析解来解决由初始圆弧、直线段和最终圆弧组成的路径问题,这些路径通常称为CSC路径。通过将路径的起点和终点配置作为RRPRR操作器的基帧和目标帧,我们将此问题视为反向运动学问题。3D Dubins路径的刚性特征已融入了我们的操作器模型约束中。此外,我们还证明了解决方案的数量不是常数,即使在非奇异区域内,也有多达七种有效的CSC路径解决方案。解决方案的实现可通过此链接https://www.xxx处获得。
https://arxiv.org/abs/2405.08710
Autonomous intersection management (AIM) poses significant challenges due to the intricate nature of real-world traffic scenarios and the need for a highly expensive centralised server in charge of simultaneously controlling all the vehicles. This study addresses such issues by proposing a novel distributed approach to AIM utilizing multi-agent reinforcement learning (MARL). We show that by leveraging the 3D surround view technology for advanced assistance systems, autonomous vehicles can accurately navigate intersection scenarios without needing any centralised controller. The contributions of this paper thus include a MARL-based algorithm for the autonomous management of a 4-way intersection and also the introduction of a new strategy called prioritised scenario replay for improved training efficacy. We validate our approach as an innovative alternative to conventional centralised AIM techniques, ensuring the full reproducibility of our results. Specifically, experiments conducted in virtual environments using the SMARTS platform highlight its superiority over benchmarks across various metrics.
自动驾驶交叉管理(AIM)由于现实交通场景复杂性和需要一个昂贵的集中式服务器同时控制所有车辆而带来了显著的挑战。为了应对这些问题,本研究通过提出一种新型的分布式AIM方法利用多智能体强化学习(MARL)来解决这些问题。我们证明了通过利用高级辅助系统3D环绕视技术,自动驾驶车辆可以在不需要任何集中式控制器的情况下准确地导航路口场景。因此,本文的贡献包括基于MARL的自動管理4个路口的算法和引入了一种名为优先场景回放的新策略,以提高训练效果。我们验证了我们的方法作为传统集中AIM技术的一个创新替代方案,确保了我们的结果的完整可重复性。具体来说,使用SMARTS平台在虚拟环境中进行的实验强调了其在各种指标上优于基准测试的优越性。
https://arxiv.org/abs/2405.08655
Neural Radiance Field(NeRF) is an novel implicit method to achieve the 3D reconstruction and representation with a high resolution. After the first research of NeRF is proposed, NeRF has gained a robust developing power and is booming in the 3D modeling, representation and reconstruction areas. However the first and most of the followed research projects based on NeRF is static, which are weak in the practical applications. Therefore, more researcher are interested and focused on the study of dynamic NeRF that is more feasible and useful in practical applications or situations. Compared with the static NeRF, implementing the Dynamic NeRF is more difficult and complex. But Dynamic is more potential in the future even is the basic of Editable NeRF. In this review, we made a detailed and abundant statement for the development and important implementation principles of Dynamci NeRF. The analysis of main principle and development of Dynamic NeRF is from 2021 to 2023, including the most of the Dynamic NeRF projects. What is more, with colorful and novel special designed figures and table, We also made a detailed comparison and analysis of different features of various of Dynamic. Besides, we analyzed and discussed the key methods to implement a Dynamic NeRF. The volume of the reference papers is large. The statements and comparisons are multidimensional. With a reading of this review, the whole development history and most of the main design method or principles of Dynamic NeRF can be easy understood and gained.
Neural Radiance Field(NeRF)是一种新型 implicit方法,旨在以高分辨率实现三维重建和表示。在NeRF首次研究提出后,NeRF获得了强大的发展动力,并在三维建模、表示和重建领域蓬勃发展。然而,大多数基于NeRF的研究项目是静态的,在实际应用中效果较弱。因此,越来越多的研究者对研究动态NeRF感兴趣,这是一个更实用且具有前景的方法。与静态NeRF相比,实现动态NeRF更具挑战性和复杂性。但动态NeRF在未来的发展前景仍相当广阔,即使是最基本的编辑NeRF方法。 在本文中,我们对动态NeRF的发展和重要实施原则进行了详细而丰富的阐述。分析主要原则和动态NeRF的发展是从2021年到2023年,包括大部分动态NeRF项目。此外,我们还通过丰富的彩色和新颖的图案以及对比,对各种动态特征进行了深入的比较和分析。此外,我们分析了并讨论了实现动态NeRF的关键方法。参考文献的体积很大。陈述和比较是多维的。通过阅读本综述,可以轻松理解和掌握动态NeRF的发展历程和主要设计原则。
https://arxiv.org/abs/2405.08609
In this work, we introduce a novel method for calculating the 6DoF pose of an object using a single RGB-D image. Unlike existing methods that either directly predict objects' poses or rely on sparse keypoints for pose recovery, our approach addresses this challenging task using dense correspondence, i.e., we regress the object coordinates for each visible pixel. Our method leverages existing object detection methods. We incorporate a re-projection mechanism to adjust the camera's intrinsic matrix to accommodate cropping in RGB-D images. Moreover, we transform the 3D object coordinates into a residual representation, which can effectively reduce the output space and yield superior performance. We conducted extensive experiments to validate the efficacy of our approach for 6D pose estimation. Our approach outperforms most previous methods, especially in occlusion scenarios, and demonstrates notable improvements over the state-of-the-art methods. Our code is available on this https URL.
在这项工作中,我们提出了一种使用单张RGB-D图像计算物体6DoF姿态的新方法。与现有的方法不同,它们要么直接预测物体的姿态,要么依赖于稀疏的关键点来进行姿态恢复。我们的方法通过密集匹配解决了这一具有挑战性的任务,即我们对于每个可见像素回归物体的坐标。我们的方法依赖于现有的物体检测方法。我们引入了一个重投影机制来调整相机的固有矩阵以适应RGB-D图像的裁剪。此外,我们将3D物体坐标转换为残差表示,可以有效地降低输出空间并产生卓越的性能。我们对我们的方法在6DoF姿态估计方面的有效性进行了广泛的实验验证。与大多数先前的方法相比,我们的方法在遮挡场景中表现优异,并显著超越了最先进的 methods。我们的代码可以在这个https:// URL上找到。
https://arxiv.org/abs/2405.08483
Image matching is still challenging in such scenes with large viewpoints or illumination changes or with low textures. In this paper, we propose a Transformer-based pseudo 3D image matching method. It upgrades the 2D features extracted from the source image to 3D features with the help of a reference image and matches to the 2D features extracted from the destination image by the coarse-to-fine 3D matching. Our key discovery is that by introducing the reference image, the source image's fine points are screened and furtherly their feature descriptors are enriched from 2D to 3D, which improves the match performance with the destination image. Experimental results on multiple datasets show that the proposed method achieves the state-of-the-art on the tasks of homography estimation, pose estimation and visual localization especially in challenging scenes.
图像匹配在具有大视角或光照变化或低纹理的场景中仍然具有挑战性。在本文中,我们提出了一种基于Transformer的伪3D图像匹配方法。它通过参考图像升级源图像中提取的2D特征为3D特征,并通过粗到细的3D匹配将目标图像中提取的2D特征与源图像中的匹配。我们的关键发现是,通过引入参考图像,可以筛选出源图像中微小的点,并从2D到3D进一步丰富其特征描述符,从而提高与目标图像的匹配性能。在多个数据集上的实验结果表明,与其他方法相比,尤其是在具有挑战性场景的任务中,所提出的方法在姿态估计和视觉定位方面实现了最先进的水平。
https://arxiv.org/abs/2405.08434
Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as \textit{SqueezeTime}, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves $+1.2\%$ accuracy and $+80\%$ GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at this https URL and this https URL.
目前用于视频理解的架构主要基于3D卷积块或2D卷积块,并添加了用于时间建模的操作。然而,这些方法都将时间轴视为视频序列的单独维度,需要大量的计算和内存预算,因此限制了它们在移动设备上的使用。在本文中,我们提出了一种将视频序列的时间轴压缩到通道维度,并提出了一个轻量级的移动视频理解网络,称为\textit{SqueezeTime},用于移动视频理解。为了增强所提出的网络的时序建模能力,我们设计了一个通道时学习(CTL)模块来捕捉序列的时变动态。这个模块有两个互补的分支,其中一个是用于时间重要性学习,另一个是用于时间位置恢复能力的分支,以增强跨时间物体建模能力。所提出的SqueezeTime在移动视频理解中非常轻便且快速,具有很高的准确性。在各种视频识别和动作检测基准上进行的广泛实验(即Kinetics400、Kinetics600、HMDB51、AVA2.1和THUMOS14)证明了我们的模型的优越性。例如,我们的SqueezeTime在Kinetics400上实现了比 prior 方法 $+1.2\%$ 的准确性和$+80\%$的GPU吞吐量增益。代码公开可用,请访问以下链接:https://this URL 和 https://this URL。
https://arxiv.org/abs/2405.08344
Perivascular spaces(PVSs) form a central component of the brainś waste clearance system, the glymphatic system. These structures are visible on MRI images, and their morphology is associated with aging and neurological disease. Manual quantification of PVS is time consuming and subjective. Numerous deep learning methods for PVS segmentation have been developed, however the majority have been developed and evaluated on homogenous datasets and high resolution scans, perhaps limiting their applicability for the wide range of image qualities acquired in clinic and research. In this work we train a nnUNet, a top-performing biomedical image segmentation algorithm, on a heterogenous training sample of manually segmented MRI images of a range of different qualities and resolutions from 6 different datasets. These are compared to publicly available deep learning methods for 3D segmentation of PVS. The resulting model, PINGU (Perivascular space Identification Nnunet for Generalised Usage), achieved voxel and cluster level dice scores of 0.50(SD=0.15), 0.63(0.17) in the white matter(WM), and 0.54(0.11), 0.66(0.17) in the basal ganglia(BG). Performance on data from unseen sites was substantially lower for both PINGU(0.20-0.38(WM, voxel), 0.29-0.58(WM, cluster), 0.22-0.36(BG, voxel), 0.46-0.60(BG, cluster)) and the publicly available algorithms(0.18-0.30(WM, voxel), 0.29-0.38(WM cluster), 0.10-0.20(BG, voxel), 0.15-0.37(BG, cluster)), but PINGU strongly outperformed the publicly available algorithms, particularly in the BG. Finally, training PINGU on manual segmentations from a single site with homogenous scan properties gave marginally lower performances on internal cross-validation, but in some cases gave higher performance on external validation. PINGU stands out as broad-use PVS segmentation tool, with particular strength in the BG, an area of PVS related to vascular disease and pathology.
皮层外间隙(PVS)是清除系统中的一个重要组成部分,称为糖质系统。这些结构在MRI图像上是可见的,它们的形态与衰老和神经系统疾病有关。手动量化PVS是耗时且主观的。已经开发了许多用于PVS分割的深度学习方法,然而,大多数都针对具有相同质量和分辨率的高质量MRI数据集进行开发和评估,这可能使它们在广泛的诊所和研究的图像质量上应用有限。在这项工作中,我们使用了一个nnUNet,一种在各种质量和分辨率下手动分割的生物医学图像分割算法的顶级性能,对6个不同数据集的异质训练样本进行训练。这些与公开可用的深度学习方法进行比较,用于3D分割PVS。得到的模型PINGU(Perivascular space Identification Nnunet for Generalised Usage)在白质(WM)的体积和聚类级别 dice 分数分别为0.50(SD=0.15),0.63(0.17),在黑质(BG)的体积和聚类级别 dice 分数分别为0.54(0.11),0.66(0.17)。对于未见过的站点数据,PINGU的性能显著较低,尤其是在BG方面(0.20-0.38(WM,体积),0.29-0.58(WM,聚类),0.22-0.36(BG,体积),0.46-0.60(BG,聚类))。然而,与公开可用的算法相比,PINGU在BG方面表现出色。最后,使用同质扫描属性从单一站点训练PINGU,在内部交叉验证上的性能稍低,但有时在 external validation 上表现出更高的性能。总的来说,PINGU是一个通用的 PVS 分割工具,尤其是在 BG 方面,这是一个与血管疾病和病理学相关的 PVS 区域。
https://arxiv.org/abs/2405.08337
Point cloud filtering is a fundamental 3D vision task, which aims to remove noise while recovering the underlying clean surfaces. State-of-the-art methods remove noise by moving noisy points along stochastic trajectories to the clean surfaces. These methods often require regularization within the training objective and/or during post-processing, to ensure fidelity. In this paper, we introduce StraightPCF, a new deep learning based method for point cloud filtering. It works by moving noisy points along straight paths, thus reducing discretization errors while ensuring faster convergence to the clean surfaces. We model noisy patches as intermediate states between high noise patch variants and their clean counterparts, and design the VelocityModule to infer a constant flow velocity from the former to the latter. This constant flow leads to straight filtering trajectories. In addition, we introduce a DistanceModule that scales the straight trajectory using an estimated distance scalar to attain convergence near the clean surface. Our network is lightweight and only has $\sim530K$ parameters, being 17% of IterativePFN (a most recent point cloud filtering network). Extensive experiments on both synthetic and real-world data show our method achieves state-of-the-art results. Our method also demonstrates nice distributions of filtered points without the need for regularization. The implementation code can be found at: this https URL.
点云筛选是一个基本的三维视觉任务,旨在在删除噪声的同时恢复底层干净表面。最先进的方法通过将噪音点沿着随机轨迹移动到干净表面来消除噪音。这些方法通常需要在训练目标内进行正则化,或在后处理过程中进行正则化,以确保精度。在本文中,我们介绍了StraightPCF,一种基于直线的点云筛选新方法。它通过将噪音点沿着直线移动来减少离散误差,同时确保更快地达到干净表面。我们将噪音斑块建模为高噪音块变体和干净块之间的中间状态,并设计VelocityModule从前者推断到后者的恒定流速。这种恒定流速导致直线过滤轨迹。此外,我们还引入了距离模块,通过估计距离标量来缩放直线轨迹以达到清洁表面的收敛。我们的网络轻量化,仅含有约530K个参数,是IterativePFN(一种最先进的点云过滤网络)的17%。对合成和真实世界数据的广泛实验表明,我们的方法达到了最先进水平。我们的方法还展示了无需正则化即可良好地分布滤波点的分布。实现代码可以在:这个链接找到。
https://arxiv.org/abs/2405.08322
We present Infinite Texture, a method for generating arbitrarily large texture images from a text prompt. Our approach fine-tunes a diffusion model on a single texture, and learns to embed that statistical distribution in the output domain of the model. We seed this fine-tuning process with a sample texture patch, which can be optionally generated from a text-to-image model like DALL-E 2. At generation time, our fine-tuned diffusion model is used through a score aggregation strategy to generate output texture images of arbitrary resolution on a single GPU. We compare synthesized textures from our method to existing work in patch-based and deep learning texture synthesis methods. We also showcase two applications of our generated textures in 3D rendering and texture transfer.
我们提出了Infinite Texture方法,用于从文本提示生成任意大小的纹理图像。我们的方法在单个纹理上微调扩散模型,并学会了将统计分布嵌入模型输出域中。我们用一个纹理补丁作为微调过程的种子,该补丁可以从文本到图像模型DALL-E 2生成。在生成时,我们的微调扩散模型通过评分聚合策略用于生成单个GPU上的任意分辨率输出纹理图像。我们将我们方法生成的纹理与基于补丁和深度学习纹理合成方法的现有工作进行比较。我们还展示了我们生成的纹理在三维渲染和纹理传输中的应用。
https://arxiv.org/abs/2405.08210
Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.
生成多样且高质量的三维资产自动提出了3D计算机视觉的一个基本但具有挑战性的任务。尽管在3D生成方面进行了广泛的尝试,但现有的基于优化的方法在生成大规模三维资产时效率较低。同时,前馈方法通常只关注生成单一类别的或多类别的资产,限制了它们的普适性。因此,我们引入了一种基于扩散的前馈框架来解决这些挑战,使用单个模型实现。为了处理不同类别之间几何和纹理的大规模多样性和复杂性,我们1)采用改进的三面板来保证效率;2)引入3D意识Transformer,以聚合专门3D特征的推广3D知识;3)设计3D意识编码器/解码器以增强推广3D知识。在基于Transformer和DiffTF的3D意识扩散模型基础上,我们提出了一个更强的3D生成版本,即DiffTF++。它主要包括两个部分:多视图重构损失和三面板细化。具体来说,我们利用多视图重构损失微调扩散模型和三面板解码器,从而避免重构错误的影响,提高纹理合成。通过消除两个阶段之间的差异,生成性能得到了增强,尤其是在纹理方面。此外,我们还引入了3D意识细化过程来滤除伪影并细化三面板,从而生成更复杂和合理的细节。在ShapeNet和OmniObject3D等大量实验中,我们充分证明了我们提出的模块的有效性和与大型多样性、丰富语义和高质量三维物体生成性能。
https://arxiv.org/abs/2405.08055
As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.
作为人类,我们渴望创建既自由又易于控制的媒体内容。得益于生成技术的显著发展,我们现在可以轻松地利用2D扩散方法合成由原始草图或指定人体姿势控制的图像,甚至可以逐步编辑/再生带有遮罩的局部区域。然而,在3D建模任务中,类似的工作流程仍然无法实现,因为3D生成的可控性和效率不高。在本文中,我们提出了一个新颖的可控且具有交互性的3D资产建模框架,名为Coin3D。Coin3D允许用户使用由基本形状组成的粗略几何代理来控制3D生成,并引入了交互式生成工作流程,以支持在几秒钟内提供响应式的3D物体预览。为此,我们开发了几个技术,包括对扩散模型应用体积粗略形状控制的3D适配器,用于精确部分编辑的代理边界编辑策略,用于支持响应式预览的渐进式体积缓存,以及体积-SDS,以确保一致的网格重建。对不同形状代理的交互式生成和编辑的广泛实验证明,我们的方法在3D资产生成任务中实现了卓越的可控性和灵活性。
https://arxiv.org/abs/2405.08054