With the advent of image super-resolution (SR) algorithms, how to evaluate the quality of generated SR images has become an urgent task. Although full-reference methods perform well in SR image quality assessment (SR-IQA), their reliance on high-resolution (HR) images limits their practical applicability. Leveraging available reconstruction information as much as possible for SR-IQA, such as low-resolution (LR) images and the scale factors, is a promising way to enhance assessment performance for SR-IQA without HR for reference. In this letter, we attempt to evaluate the perceptual quality and reconstruction fidelity of SR images considering LR images and scale factors. Specifically, we propose a novel dual-branch reduced-reference SR-IQA network, \ie, Perception- and Fidelity-aware SR-IQA (PFIQA). The perception-aware branch evaluates the perceptual quality of SR images by leveraging the merits of global modeling of Vision Transformer (ViT) and local relation of ResNet, and incorporating the scale factor to enable comprehensive visual perception. Meanwhile, the fidelity-aware branch assesses the reconstruction fidelity between LR and SR images through their visual perception. The combination of the two branches substantially aligns with the human visual system, enabling a comprehensive SR image evaluation. Experimental results indicate that our PFIQA outperforms current state-of-the-art models across three widely-used SR-IQA benchmarks. Notably, PFIQA excels in assessing the quality of real-world SR images.
随着图像超分辨率(SR)算法的出现,如何评估生成SR图像的质量已成为一个紧迫的任务。尽管全参考方法在SR图像质量评估(SR-IQA)中表现良好,但它们对高分辨率(HR)图像的依赖限制了其实用性。充分利用SR-IQA中可用的重建信息,如低分辨率(LR)图像和比例因子,是一种提高SR-IQA性能而无需参考HR图像的方法。在本文中,我们试图评估SR图像的感知质量和重构准确性,同时考虑LR图像和比例因子。具体来说,我们提出了一个新颖的双分支减少参考SR-IQA网络,即感知-和可靠性-感知SR-IQA(PFIQA)。感知分支通过利用Vision Transformer(ViT)的全局建模优点和ResNet的局部关系,以及包含比例因子来提高综合视觉 perception。同时,可靠性分支通过它们的视觉感知评估LR和SR图像之间的重构准确性。两个分支的组合使得PFIQA与人类视觉系统高度契合,实现了全面的SR图像评估。实验结果表明,我们的PFIQA在三个广泛使用的SR-IQA基准测试中超过了最先进的模型。值得注意的是,PFIQA在评估真实世界SR图像的质量方面表现出色。
https://arxiv.org/abs/2405.09472
3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
3D内容创作在各种应用中发挥着重要作用,如游戏、机器人模拟和虚拟现实。然而,该过程费力且耗时,需要熟练的设计师投入大量精力创作单个3D资产。为应对这一挑战,文本到3D生成技术作为一种有前途的自动化3D创作的解决方案应运而生。通过利用大型视觉语言模型的成功,这些技术旨在根据文本描述生成3D内容。尽管在最近一段时间内这一领域取得了进展,但现有的解决方案在生成质量和效率方面仍然存在显著的限制。在本次调查中,我们深入研究了最新的文本到3D创作方法。我们提供了关于文本到3D创作的全面背景,包括讨论训练和评估指标所使用的数据集以及用于评估生成3D模型的质量的评估指标。接着,我们深入探讨了作为3D生成过程基础的各种3D表示。此外,我们还对迅速发展的关于生成管道的研究进行了全面的比较,并将它们分为前馈生成、基于优化的生成和视图重构方法。通过分析这些方法的优缺点,我们希望揭示它们各自的潜能和局限。最后,我们指出了未来研究的几个有前景的方向。通过这次调查,我们希望激励研究人员进一步探索开放词汇文本条件下3D内容创作的潜力。
https://arxiv.org/abs/2405.09431
The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in computational and memory footprints as the convolutional kernel size increases. To address the first issue, we propose the multi-scale blueprint separable convolutions (MBSConv) as highly efficient building block with multi-scale receptive field, it can focus on the learning for the multi-scale information which is a vital component of discriminative representation. As for the second issue, we revisit the key properties of LKA in which we find that the adjacent direct interaction of local information and long-distance dependencies is crucial to provide remarkable performance. Thus, taking this into account and in order to mitigate the complexity of LKA, we propose a large coordinate kernel attention (LCKA) module which decomposes the 2D convolutional kernels of the depth-wise convolutional layers in LKA into horizontal and vertical 1-D kernels. LCKA enables the adjacent direct interaction of local information and long-distance dependencies not only in the horizontal direction but also in the vertical. Besides, LCKA allows for the direct use of extremely large kernels in the depth-wise convolutional layers to capture more contextual information, which helps to significantly improve the reconstruction performance, and it incurs lower computational complexity and memory footprints. Integrating MBSConv and LCKA, we propose a large coordinate kernel attention network (LCAN).
多尺度 receptive 场和大型内核注意 (LKA) 模块已经被证明在轻量图像超分辨率任务中显著提高了性能。然而,现有的轻量级超分辨率(SR)方法很少关注设计具有多尺度 receptive 场的有效构建模块,并且随着卷积核大小的增加,它们的 LKA 模块的计算和内存足迹呈指数增长。为解决第一个问题,我们提出了多尺度蓝色模板分离卷积(MBSConv)作为具有多尺度 receptive 场的非常高效构建模块,它可以关注多尺度信息,这是判别表示的重要组成部分。对于第二个问题,我们重新审视了 LKA 的关键特性,我们发现邻近信息之间的直接相互作用和长距离依赖关系对提供出色的性能至关重要。因此,考虑到这一点,为了减轻 LKA 的复杂性,我们提出了大型坐标卷积注意(LCKA)模块,它将 LKA 的深度卷积层中的 2D 卷积核拆分为水平和垂直 1D 卷积核。LCKA 不仅使相邻直接相互作用于局部信息和长距离依赖关系,而且在水平和垂直方向上都有。此外,LCKA 允许在深度卷积层中直接使用极其大的卷积核来捕捉更多的上下文信息,从而显著提高重构性能,并使其计算复杂性和内存足迹更低。将 MBSConv 和 LCKA 集成起来,我们提出了大型坐标卷积注意网络 (LCAN)。
https://arxiv.org/abs/2405.09353
Anomaly detection and localization without any manual annotations and prior knowledge is a challenging task under the setting of unsupervised learning. The existing works achieve excellent performance in the anomaly detection, but with complex networks or cumbersome pipelines. To address this issue, this paper explores a simple but effective architecture in the anomaly detection. It consists of a well pre-trained encoder to extract hierarchical feature representations and a decoder to reconstruct these intermediate features from the encoder. In particular, it does not require any data augmentations and anomalous images for training. The anomalies can be detected when the decoder fails to reconstruct features well, and then errors of hierarchical feature reconstruction are aggregated into an anomaly map to achieve anomaly localization. The difference comparison between those features of encoder and decode lead to more accurate and robust localization results than the comparison in single feature or pixel-by-pixel comparison in the conventional works. Experiment results show that the proposed method outperforms the state-of-the-art methods on MNIST, Fashion-MNIST, CIFAR-10, and MVTec Anomaly Detection datasets on both anomaly detection and localization.
在无需手动注释和先前知识的情况下,检测异常并定位异常是一个具有挑战性的任务,尤其是在无监督学习环境中。现有的作品在异常检测方面表现出色,但使用了复杂的网络或繁琐的流程。为解决这个问题,本文探索了一种简单但有效的异常检测架构。它由一个预训练的编码器和一个解码器组成,编码器用于提取分层次的特征表示,解码器用于从编码器中重构这些中间特征。特别地,它不需要进行数据增强或异常图像的训练。当解码器无法很好地重构特征时,可以检测到异常。然后将层次特征重构的错误聚集在异常地图上,实现异常的局部化。编码器和解码器的特征差异比较比传统工作中的单个特征或像素逐像素比较更准确和稳健的局部化结果。实验结果表明,与最先进的 methods相比,所提出的方法在MNIST、Fashion-MNIST、CIFAR-10和MVTec异常检测数据集上 both anomaly detection and localization outperforms.
https://arxiv.org/abs/2405.09148
Haze severely degrades the visual quality of remote sensing images and hampers the performance of automotive navigation, intelligent monitoring, and urban management. The emerging denoising diffusion probabilistic model (DDPM) exhibits the significant potential for dense haze removal with its strong generation ability. Since remote sensing images contain extensive small-scale texture structures, it is important to effectively restore image details from hazy images. However, current wisdom of DDPM fails to preserve image details and color fidelity well, limiting its dehazing capacity for remote sensing images. In this paper, we propose a novel unified Fourier-aware diffusion model for remote sensing image dehazing, termed RSHazeDiff. From a new perspective, RSHazeDiff explores the conditional DDPM to improve image quality in dense hazy scenarios, and it makes three key contributions. First, RSHazeDiff refines the training phase of diffusion process by performing noise estimation and reconstruction constraints in a coarse-to-fine fashion. Thus, it remedies the unpleasing results caused by the simple noise estimation constraint in DDPM. Second, by taking the frequency information as important prior knowledge during iterative sampling steps, RSHazeDiff can preserve more texture details and color fidelity in dehazed images. Third, we design a global compensated learning module to utilize the Fourier transform to capture the global dependency features of input images, which can effectively mitigate the effects of boundary artifacts when processing fixed-size patches. Experiments on both synthetic and real-world benchmarks validate the favorable performance of RSHazeDiff over multiple state-of-the-art methods. Source code will be released at this https URL.
雾霾严重地降低了遥感图像的视觉质量,并阻碍了汽车导航、智能监测和城市管理的效果。新兴的去雾扩散概率模型(DDPM)通过其强大的生成能力在遥感图像去雾方面具有显著的潜力。由于遥感图像含有广泛的微小纹理结构,因此从雾霾图像中有效地恢复图像细节非常重要。然而,DDPM的现有智慧无法保留图像细节和色彩保真度,从而限制了其对遥感图像去雾的能力。在本文中,我们提出了一个名为RSHazeDiff的新统一的傅里叶感知扩散模型,用于遥感图像去雾,称之为RSHazeDiff。从新的角度来看,RSHazeDiff通过在粗到细的粒度上执行扩散过程的训练阶段噪声估计和重构约束,改善了雾霾场景下图像的质量,并做出了三个关键贡献。首先,RSHazeDiff通过在粗-细粒度上执行扩散过程的噪声估计和重构约束,细化了DDPM的训练阶段。因此,它消除了DDPM中简单噪声估计约束带来的令人不满意的后果。其次,通过在迭代采样步骤中将频率信息视为重要的先验知识,RSHazeDiff可以在去雾图像中保留更多的纹理细节和色彩保真度。第三,我们设计了一个全局补偿学习模块,利用傅里叶变换来捕捉输入图像的全局依赖特征,这可以有效减轻处理固定大小的补丁时边界伪影的影响。在模拟和现实世界的基准测试中,RSHazeDiff的表现优于多种最先进的算法。源代码将在此处发布:https://www.xxxxxx。
https://arxiv.org/abs/2405.09083
In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. We additionally perform song classification based on the generated tracks. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.
在本文中,我们探讨了使用潜在扩散模型(一种强大的生成模型家族)从脑电图(EEG)录音中重构自然主义音乐的潜力。与简单音乐且音色有限的作品(如MIDI生成的曲目或单旋律作品)相比,这里的重点是复杂音乐,具有多样化的乐器、声音和效果,丰富和谐和音色。本研究代表了一种使用非侵入性EEG数据实现高质量音乐重建的初步尝试,采用端到端训练方法直接在原始数据上进行,无需手动预处理和通道选择。我们将模型 training 放在公共 NMED-T 数据集上,并通过提出基于神经嵌入的指标进行定量评估。此外,我们还基于生成的曲目进行歌曲分类。我们的工作对神经解码和脑-机接口的持续研究做出了贡献,揭示了使用EEG数据进行复杂听觉信息重建的可行性。
https://arxiv.org/abs/2405.09062
Depth estimation plays a crucial role in various tasks within endoscopic surgery, including navigation, surface reconstruction, and augmented reality visualization. Despite the significant achievements of foundation models in vision tasks, including depth estimation, their direct application to the medical domain often results in suboptimal performance. This highlights the need for efficient adaptation methods to adapt these models to endoscopic depth estimation. We propose Endoscopic Depth Any Camera (EndoDAC) which is an efficient self-supervised depth estimation framework that adapts foundation models to endoscopic scenes. Specifically, we develop the Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks to tailor the foundational model to the surgical domain, utilizing remarkably few trainable parameters. Given that camera information is not always accessible, we also introduce a self-supervised adaptation strategy that estimates camera intrinsics using the pose encoder. Our framework is capable of being trained solely on monocular surgical videos from any camera, ensuring minimal training costs. Experiments demonstrate that our approach obtains superior performance even with fewer training epochs and unaware of the ground truth camera intrinsics. Code is available at this https URL.
深度估计在内窥镜手术的各种任务中扮演着至关重要的角色,包括导航、表面重建和增强现实可视化。尽管基础模型在视觉任务中取得了显著的成就,包括深度估计,但它们的直接应用到医学领域通常会导致性能较低。这表明了需要有效的适应方法将这些模型应用于内窥镜深度估计。我们提出了Endoscopic Depth Any Camera(EndoDAC),这是一种有效的自监督深度估计框架,将基础模型适应内窥镜场景。具体来说,我们开发了基于动态向量的高级低秩适应(DV-LoRA)方法,并使用卷积颈块将基本模型裁剪为手术领域,利用训练参数的数量非常少。鉴于相机信息通常不可用,我们还引入了一种自监督的适应策略,使用姿态编码器估计相机内参。我们的框架能够仅通过单目手术视频进行训练,确保最小化训练成本。实验结果表明,即使训练次数较少,甚至不知道真实相机内参,我们的方法也能获得卓越的性能。代码位于此链接处。
https://arxiv.org/abs/2405.08672
Neural Radiance Field(NeRF) is an novel implicit method to achieve the 3D reconstruction and representation with a high resolution. After the first research of NeRF is proposed, NeRF has gained a robust developing power and is booming in the 3D modeling, representation and reconstruction areas. However the first and most of the followed research projects based on NeRF is static, which are weak in the practical applications. Therefore, more researcher are interested and focused on the study of dynamic NeRF that is more feasible and useful in practical applications or situations. Compared with the static NeRF, implementing the Dynamic NeRF is more difficult and complex. But Dynamic is more potential in the future even is the basic of Editable NeRF. In this review, we made a detailed and abundant statement for the development and important implementation principles of Dynamci NeRF. The analysis of main principle and development of Dynamic NeRF is from 2021 to 2023, including the most of the Dynamic NeRF projects. What is more, with colorful and novel special designed figures and table, We also made a detailed comparison and analysis of different features of various of Dynamic. Besides, we analyzed and discussed the key methods to implement a Dynamic NeRF. The volume of the reference papers is large. The statements and comparisons are multidimensional. With a reading of this review, the whole development history and most of the main design method or principles of Dynamic NeRF can be easy understood and gained.
Neural Radiance Field(NeRF)是一种新型 implicit方法,旨在以高分辨率实现三维重建和表示。在NeRF首次研究提出后,NeRF获得了强大的发展动力,并在三维建模、表示和重建领域蓬勃发展。然而,大多数基于NeRF的研究项目是静态的,在实际应用中效果较弱。因此,越来越多的研究者对研究动态NeRF感兴趣,这是一个更实用且具有前景的方法。与静态NeRF相比,实现动态NeRF更具挑战性和复杂性。但动态NeRF在未来的发展前景仍相当广阔,即使是最基本的编辑NeRF方法。 在本文中,我们对动态NeRF的发展和重要实施原则进行了详细而丰富的阐述。分析主要原则和动态NeRF的发展是从2021年到2023年,包括大部分动态NeRF项目。此外,我们还通过丰富的彩色和新颖的图案以及对比,对各种动态特征进行了深入的比较和分析。此外,我们分析了并讨论了实现动态NeRF的关键方法。参考文献的体积很大。陈述和比较是多维的。通过阅读本综述,可以轻松理解和掌握动态NeRF的发展历程和主要设计原则。
https://arxiv.org/abs/2405.08609
Stereo image super-resolution (SR) refers to the reconstruction of a high-resolution (HR) image from a pair of low-resolution (LR) images as typically captured by a dual-camera device. To enhance the quality of SR images, most previous studies focused on increasing the number and size of feature maps and introducing complex and computationally intensive structures, resulting in models with high computational complexity. Here, we propose a simple yet efficient stereo image SR model called NAFRSSR, which is modified from the previous state-of-the-art model NAFSSR by introducing recursive connections and lightweighting the constituent modules. Our NAFRSSR model is composed of nonlinear activation free and group convolution-based blocks (NAFGCBlocks) and depth-separated stereo cross attention modules (DSSCAMs). The NAFGCBlock improves feature extraction and reduces number of parameters by removing the simple channel attention mechanism from NAFBlock and using group convolution. The DSSCAM enhances feature fusion and reduces number of parameters by replacing 1x1 pointwise convolution in SCAM with weight-shared 3x3 depthwise convolution. Besides, we propose to incorporate trainable edge detection operator into NAFRSSR to further improve the model performance. Four variants of NAFRSSR with different sizes, namely, NAFRSSR-Mobile (NAFRSSR-M), NAFRSSR-Tiny (NAFRSSR-T), NAFRSSR-Super (NAFRSSR-S) and NAFRSSR-Base (NAFRSSR-B) are designed, and they all exhibit fewer parameters, higher PSNR/SSIM, and faster speed than the previous state-of-the-art models. In particular, to the best of our knowledge, NAFRSSR-M is the lightest (0.28M parameters) and fastest (50 ms inference time) model achieving an average PSNR/SSIM as high as 24.657 dB/0.7622 on the benchmark datasets. Codes and models will be released at this https URL.
立体图像超分辨率(SR)是指通常由双相机设备捕获的低分辨率(LR)图像的一对高分辨率(HR)图像的重建。为了提高SR图像的质量,以前的研究主要集中在增加特征图的数量和大小,并引入复杂且计算密集的结构,导致具有高计算复杂性的模型。在这里,我们提出了一种简单而有效的立体图像SR模型,称为NAFRSSR,它基于前 state-of-the-art模型NAFSSR,通过引入递归连接和轻量化构成模块。我们的NAFRSSR模型由非线性激活自由和基于组卷积的块(NAFGCBlocks)以及深度分离的立体跨注意模块(DSSCAMs)组成。NAFGCBlock通过从NAFBlock中移除简单的通道关注机制并使用组卷积来减少参数数量并提高特征提取。DSSCAM通过用权共享的3x3深度卷积来替换SCAM中的1x1点卷积,从而增强特征融合并减少参数数量。此外,我们还提出将可训练的边缘检测操作器集成到NAFRSSR中,以进一步提高模型性能。设计有四种不同大小的NAFRSSR变体,分别为:NAFRSSR-Mobile(NAFRSSR-M),NAFRSSR-Tiny(NAFRSSR-T),NAFRSSR-Super(NAFRSSR-S)和NAFRSSR-Base(NAFRSSR-B),它们都具有更少的参数、更高的PSNR/SSIM和更快的推理速度。特别是,据我们所知,NAFRSSR-M是最轻便(0.28M参数)且最快的(50ms推理时间)模型,在基准数据集上达到平均PSNR/SSIM 24.657 dB/0.7622。代码和模型发布在https://这个URL上。
https://arxiv.org/abs/2405.08423
Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.
生成多样且高质量的三维资产自动提出了3D计算机视觉的一个基本但具有挑战性的任务。尽管在3D生成方面进行了广泛的尝试,但现有的基于优化的方法在生成大规模三维资产时效率较低。同时,前馈方法通常只关注生成单一类别的或多类别的资产,限制了它们的普适性。因此,我们引入了一种基于扩散的前馈框架来解决这些挑战,使用单个模型实现。为了处理不同类别之间几何和纹理的大规模多样性和复杂性,我们1)采用改进的三面板来保证效率;2)引入3D意识Transformer,以聚合专门3D特征的推广3D知识;3)设计3D意识编码器/解码器以增强推广3D知识。在基于Transformer和DiffTF的3D意识扩散模型基础上,我们提出了一个更强的3D生成版本,即DiffTF++。它主要包括两个部分:多视图重构损失和三面板细化。具体来说,我们利用多视图重构损失微调扩散模型和三面板解码器,从而避免重构错误的影响,提高纹理合成。通过消除两个阶段之间的差异,生成性能得到了增强,尤其是在纹理方面。此外,我们还引入了3D意识细化过程来滤除伪影并细化三面板,从而生成更复杂和合理的细节。在ShapeNet和OmniObject3D等大量实验中,我们充分证明了我们提出的模块的有效性和与大型多样性、丰富语义和高质量三维物体生成性能。
https://arxiv.org/abs/2405.08055
As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.
作为人类,我们渴望创建既自由又易于控制的媒体内容。得益于生成技术的显著发展,我们现在可以轻松地利用2D扩散方法合成由原始草图或指定人体姿势控制的图像,甚至可以逐步编辑/再生带有遮罩的局部区域。然而,在3D建模任务中,类似的工作流程仍然无法实现,因为3D生成的可控性和效率不高。在本文中,我们提出了一个新颖的可控且具有交互性的3D资产建模框架,名为Coin3D。Coin3D允许用户使用由基本形状组成的粗略几何代理来控制3D生成,并引入了交互式生成工作流程,以支持在几秒钟内提供响应式的3D物体预览。为此,我们开发了几个技术,包括对扩散模型应用体积粗略形状控制的3D适配器,用于精确部分编辑的代理边界编辑策略,用于支持响应式预览的渐进式体积缓存,以及体积-SDS,以确保一致的网格重建。对不同形状代理的交互式生成和编辑的广泛实验证明,我们的方法在3D资产生成任务中实现了卓越的可控性和灵活性。
https://arxiv.org/abs/2405.08054
Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model's robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities. The code and dataset are available on the project page.
实现独立手语词的富有表现力的三维运动重建和自动生成可能是具有挑战性的,由于缺乏现实世界的三维手语词数据、手语动作的复杂细微差别以及手语语义跨模态理解,这些挑战可能导致实现困难。为了应对这些挑战,我们引入了SignAvatar,一个能够进行词级手语重建和生成的框架。SignAvatar采用基于变换器的条件变分自编码器架构,有效建立了不同语义模态之间的关系。此外,这种方法还采用学习曲线策略来增强模型的稳健性和泛化能力,从而产生更逼真的动作。此外,我们还贡献了ASL3DWord数据集,由身体、手和脸的3D关节旋转数据组成,用于独特的手语词。我们通过广泛的实验验证了SignAvatar的有效性,展示了其卓越的重建和自动生成能力。代码和数据集可在项目页上获取。
https://arxiv.org/abs/2405.07974
Place recognition is the foundation for enabling autonomous systems to achieve independent decision-making and safe operations. It is also crucial in tasks such as loop closure detection and global localization within SLAM. Previous methods utilize mundane point cloud representations as input and deep learning-based LiDAR-based Place Recognition (LPR) approaches employing different point cloud image inputs with convolutional neural networks (CNNs) or transformer architectures. However, the recently proposed Mamba deep learning model, combined with state space models (SSMs), holds great potential for long sequence modeling. Therefore, we developed OverlapMamba, a novel network for place recognition, which represents input range views (RVs) as sequences. In a novel way, we employ a stochastic reconstruction approach to build shift state space models, compressing the visual representation. Evaluated on three different public datasets, our method effectively detects loop closures, showing robustness even when traversing previously visited locations from different directions. Relying on raw range view inputs, it outperforms typical LiDAR and multi-view combination methods in time complexity and speed, indicating strong place recognition capabilities and real-time efficiency.
位置识别是使自动驾驶系统实现独立决策和安全的操作的基础,同时在SLAM任务中(例如闭环检测和全局定位)也非常关键。以前的方法利用乏味的点云表示作为输入,并采用基于深度学习的激光雷达(LiDAR)基于点云图像的识别方法(LPR)或Transformer架构的不同点云图像输入。然而,最近提出的Mamba深度学习模型与状态空间模型(SSMs)相结合,具有很大的长期序列建模潜力。因此,我们开发了OverlapMamba,一种新型的用于位置识别的网络,将输入范围视(RVs)表示为序列。与传统方法不同,我们采用随机重构方法构建了转移状态空间模型,压缩了视觉表示。在三个不同的公开数据集上评估,我们的方法有效地检测到了闭环,即使从不同的方向访问之前访问过的位置时,表现依然稳健。依赖原始范围视图输入,它在时间和速度上优于典型的LiDAR和多视图组合方法,表明具有强大的定位能力和实时效率。
https://arxiv.org/abs/2405.07966
We present SceneFactory, a workflow-centric and unified framework for incremental scene modeling, that supports conveniently a wide range of applications, such as (unposed and/or uncalibrated) multi-view depth estimation, LiDAR completion, (dense) RGB-D/RGB-L/Mono//Depth-only reconstruction and SLAM. The workflow-centric design uses multiple blocks as the basis for building different production lines. The supported applications, i.e., productions avoid redundancy in their designs. Thus, the focus is on each block itself for independent expansion. To support all input combinations, our implementation consists of four building blocks in SceneFactory: (1) Mono-SLAM, (2) depth estimation, (3) flexion and (4) scene reconstruction. Furthermore, we propose an unposed & uncalibrated multi-view depth estimation model (U2-MVD) to estimate dense geometry. U2-MVD exploits dense bundle adjustment for solving for poses, intrinsics, and inverse depth. Then a semantic-awared ScaleCov step is introduced to complete the multi-view depth. Relying on U2-MVD, SceneFactory both supports user-friendly 3D creation (with just images) and bridges the applications of Dense RGB-D and Dense Mono. For high quality surface and color reconstruction, we propose due-purpose Multi-resolutional Neural Points (DM-NPs) for the first surface accessible Surface Color Field design, where we introduce Improved Point Rasterization (IPR) for point cloud based surface query. We implement and experiment with SceneFactory to demonstrate its broad practicability and high flexibility. Its quality also competes or exceeds the tightly-coupled state of the art approaches in all tasks. We contribute the code to the community (this https URL).
我们提出了SceneFactory,一个以工作流为中心且统一的框架,用于增量式场景建模,支持各种应用,例如(未经校准和/或未校准)多视角深度估计、激光雷达完成、(密集) RGB-D/RGB-L/单/深度重建和SLAM。工作流中心的设计使用多个模块作为构建不同生产线的基线。支持的应用程序在其设计中避免了冗余。因此,关注点在于每个模块本身,进行独立扩展。为了支持所有输入组合,我们的实现包括SceneFactory中的四个构建模块:(1)单SLAM,(2)深度估计,(3)伸展和(4)场景重建。此外,我们提出了一个未经校准且未校准的多视角深度估计模型(U2-MVD)来估计密集几何。U2-MVD利用密集Bundle Adjustment来解决姿态、内参和逆深度。然后引入了一个语义感知的尺度 Cov 步来完成多视角深度。凭借 U2-MVD,SceneFactory支持用户友好的3D创建(仅使用图像)。同时,它还桥接了Dense RGB-D和Dense Mono的应用程序。为了实现高品质表面和色彩重建,我们为第一个可访问表面颜色场设计提出了目的明确的Multi-resolutional Neural Points(DM-NPs),并在基于点云的表面查询中引入了改进点映射(IPR)。我们在SceneFactory上实现并实验了它,以证明其广泛适用性和高度灵活性。其在所有任务中的质量也与其他紧密耦合的方法竞争或超过。我们将代码贡献给社区(此https URL)。
https://arxiv.org/abs/2405.07847
Based on the principles of information theory, measure theory, and theoretical computer science, we introduce a univariate signal deconvolution method with a wide range of applications to coding theory, particularly in zero-knowledge one-way communication channels, such as in deciphering messages from unknown generating sources about which no prior knowledge is available and to which no return message can be sent. Our multidimensional space reconstruction method from an arbitrary received signal is proven to be agnostic vis-a-vis the encoding-decoding scheme, computation model, programming language, formal theory, the computable (or semi-computable) method of approximation to algorithmic complexity, and any arbitrarily chosen (computable) probability measure of the events. The method derives from the principles of an approach to Artificial General Intelligence capable of building a general-purpose model of models independent of any arbitrarily assumed prior probability distribution. We argue that this optimal and universal method of decoding non-random data has applications to signal processing, causal deconvolution, topological and geometric properties encoding, cryptography, and bio- and technosignature detection.
根据信息论、测度理论和理论计算机科学的原理,我们提出了一种单变量信号解卷积方法,具有广泛的编码理论应用,特别是对于无需先验知识即可进行单向通信信道中的加密和解码,例如从未知的生成源中解码消息以及无法发送返回消息的信道。我们在任意接收信号的多维度空间重构方法已经被证明对编码-解码方案、计算模型、编程语言、形式理论以及近似算法复杂度的可计算(或半可计算)方法是无关的。这个方法源于能够构建独立于任何随意假设先验概率分布的通用人工智能方法的原则。我们认为,这种最优和普遍的解码非随机数据的办法具有对信号处理、因果解卷积、拓扑和几何编码、密码学和生物和科技签名检测等领域的应用。
https://arxiv.org/abs/2405.07803
Mainstream approaches to spectral reconstruction (SR) primarily focus on designing Convolution- and Transformer-based architectures. However, CNN methods often face challenges in handling long-range dependencies, whereas Transformers are constrained by computational efficiency limitations. Recent breakthroughs in state-space model (e.g., Mamba) has attracted significant attention due to its near-linear computational efficiency and superior performance, prompting our investigation into its potential for SR problem. To this end, we propose the Gradient-guided Mamba for Spectral Reconstruction from RGB Images, dubbed GMSR-Net. GMSR-Net is a lightweight model characterized by a global receptive field and linear computational complexity. Its core comprises multiple stacked Gradient Mamba (GM) blocks, each featuring a tri-branch structure. In addition to benefiting from efficient global feature representation by Mamba block, we further innovatively introduce spatial gradient attention and spectral gradient attention to guide the reconstruction of spatial and spectral cues. GMSR-Net demonstrates a significant accuracy-efficiency trade-off, achieving state-of-the-art performance while markedly reducing the number of parameters and computational burdens. Compared to existing approaches, GMSR-Net slashes parameters and FLOPS by substantial margins of 10 times and 20 times, respectively. Code is available at this https URL.
主流的频谱重建方法(SR)主要集中在设计卷积神经网络(CNN)和Transformer基架构。然而,CNN方法在处理长距离依赖方面存在挑战,而Transformer则受到计算效率限制。最近在状态空间模型(如Mamba)的突破引起了相当大的关注,因为其接近线性计算效率和卓越的性能,促使我们对该模型在SR问题上的潜在进行调查。为此,我们提出了Gradient-guided Mamba for Spectral Reconstruction from RGB Images(GMSR-Net)。GMSR-Net是一个轻量级模型,具有全局感受野和线性计算复杂性。其核心包括多个堆叠的Gradient Mamba(GM)块,每个具有三分支结构。除了利用Mamba块的高效的全局特征表示外,我们还创新地引入了空间梯度注意力和谱梯度注意,以引导对空间和谱信息的重建。GMSR-Net在保持与Mamba块同样高效的全球特征表示的同时,显著减少了参数数量和计算负担。与现有方法相比,GMSR-Net在参数和FLOPS上分别降低了10倍和20倍。代码可以从该链接获取。
https://arxiv.org/abs/2405.07777
Federated learning (FL) is a novel collaborative machine learning framework designed to preserve privacy while enabling the creation of robust models. This paradigm addresses a growing need for data security by allowing multiple participants to contribute to a model without exposing their individual datasets. A pivotal issue within this framework, however, concerns the fair and accurate attribution of contributions from various participants to the creation of the joint global model. Incorrect contribution distribution can erode trust among participants, result in inequitable compensation, and ultimately diminish the willingness of parties to engage or actively contribute to the federation. While several methods for remunerating participants have been proposed, little attention was given to the analysis of the stability of these methods when evaluating contributions, which is critical to ensure the long-term viability and fairness of FL systems. In this paper, we analyse this stability through the calculation of contributions by gradient-based model reconstruction techniques with Shapley values. Our investigation reveals that Shapley values fail to reflect baseline contributions, especially when employing different aggregation techniques. To address this issue, we extend on established aggregation techniques by introducing FedRandom, which is designed to sample contributions in a more equitable and distributed manner. We demonstrate that this approach not only serves as a viable aggregation technique but also significantly improves the accuracy of contribution assessment compared to traditional methods. Our results suggest that FedRandom enhances the overall fairness and stability of the federated learning system, making it a superior choice for federations with limited number of participants.
联邦学习(FL)是一种新颖的合作机器学习框架,旨在在保护隐私的同时,允许创建出稳健的模型。这一范式通过允许多个参与者在不暴露个人数据的情况下为模型做出贡献来解决数据安全问题。然而,这个框架的一个关键问题是如何公平和准确地分配来自各个参与者的贡献,以创建共同全局模型。错误的贡献分配会削弱参与者的信任,导致不公正的补偿,最终降低参与各方积极参与或主动贡献的意愿。虽然已经提出了几种用于奖励参与者的方法,但很少关注评估贡献时的这些方法的稳定性,这对于确保FL系统的长期可行性和公平性至关重要。在本文中,我们通过计算基于梯度模型的建模技术中的贡献来分析这种稳定性。我们的研究揭示了Shapley值在不同的聚合技术下无法反映基线贡献的问题。为了应对这个问题,我们扩展了现有的聚合技术,引入了FedRandom,这是一种设计更公平和分布式的抽样方法。我们证明了这种方法不仅是一种可行的聚合技术,而且比传统方法显著提高了贡献评估的准确性。我们的结果表明,FedRandom增强了联邦学习系统的整体公平性和稳定性,为参与者数量有限的联邦提供了更优秀的选择。
https://arxiv.org/abs/2405.08044
Collaborative path planning for robot swarms in complex, unknown environments without external positioning is a challenging problem. This requires robots to find safe directions based on real-time environmental observations, and to efficiently transfer and fuse these observations within the swarm. This study presents a filtering method based on Fast Fourier Transform (FFT) to address these two issues. We treat sensors' environmental observations as a digital sampling process. Then, we design two different types of filters for safe direction extraction, as well as for the compression and reconstruction of environmental data. The reconstructed data is mapped to probabilistic domain, achieving efficient fusion of swarm observations and planning decision. The computation time is only on the order of microseconds, and the transmission data in communication systems is in bit-level. The performance of our algorithm in sensor data processing was validated in real world experiments, and the effectiveness in swarm path optimization was demonstrated through extensive simulations.
协同路径规划对于在复杂、未知的环境中没有外部定位的机器人蜂群是一个具有挑战性的问题。这要求机器人根据实时环境观察找到安全方向,并在蜂群内有效地传输和融合这些观察。本研究基于快速傅里叶变换(FFT)提出了一种滤波方法来解决这些问题。我们将传感器的环境观察视为数字采样过程。然后,我们设计了两种类型的滤波器来提取安全方向以及压缩和重构环境数据。重构后的数据映射到概率域,实现了对蜂群观察的 efficient 融合和规划决策。计算时间仅在微秒级别,通信系统中的传输数据是位级别。通过在现实世界实验中验证我们算法的传感器数据处理性能,并通过广泛的仿真证明了在蜂群路径优化中的有效性。
https://arxiv.org/abs/2405.07687
Current multi-modality driving frameworks normally fuse representation by utilizing attention between single-modality branches. However, the existing networks still suppress the driving performance as the Image and LiDAR branches are independent and lack a unified observation representation. Thus, this paper proposes MaskFuser, which tokenizes various modalities into a unified semantic feature space and provides a joint representation for further behavior cloning in driving contexts. Given the unified token representation, MaskFuser is the first work to introduce cross-modality masked auto-encoder training. The masked training enhances the fusion representation by reconstruction on masked tokens. Architecturally, a hybrid-fusion network is proposed to combine advantages from both early and late fusion: For the early fusion stage, modalities are fused by performing monotonic-to-BEV translation attention between branches; Late fusion is performed by tokenizing various modalities into a unified token space with shared encoding on it. MaskFuser respectively reaches a driving score of 49.05 and route completion of 92.85% on the CARLA LongSet6 benchmark evaluation, which improves the best of previous baselines by 1.74 and 3.21%. The introduced masked fusion increases driving stability under damaged sensory inputs. MaskFuser outperforms the best of previous baselines on driving score by 6.55 (27.8%), 1.53 (13.8%), 1.57 (30.9%), respectively given sensory masking ratios 25%, 50%, and 75%.
目前的多模态驾驶框架通常通过在单模态分支之间利用注意力来融合表示。然而,现有的网络仍然会抑制在 Image 和 LiDAR 分支之间缺乏统一观察表示的情况下驱动性能。因此,本文提出了 MaskFuser,它将各种模块统一到一个语义特征空间中,并为进一步在驾驶场景中行为复制提供联合表示。在统一 token 表示下,MaskFuser 是第一个引入跨模态掩码自编码器训练的工作。通过在掩码标记的token上进行重建,遮码训练增强了融合表示。架构上,一种混合融合网络被提出,结合了早期和晚期融合的优点:在早期融合阶段,通过在分支之间执行单调到 BEV 转换注意力和;在晚期融合阶段,将各种模块元分类到统一的标记空间,并在其中共享编码。MaskFuser 在CARLA LongSet6基准评估中的驾驶分数为49.05,路线完成率为92.85%,比最先进的基线提高了1.74和3.21%。引入的遮码融合在受损的感知输入下提高了驾驶稳定性。MaskFuser在驾驶分数上比最先进的基线提高了6.55(27.8%),1.53(13.8%),1.57(30.9%)倍。在感官掩码比率为25%,50%,75%时,MaskFuser的表现也分别比最先进的基线提高了2.67%,3.48%,和3.92%。
https://arxiv.org/abs/2405.07573
Tattoos have been used effectively as soft biometrics to assist law enforcement in the identification of offenders and victims, as they contain discriminative information, and are a useful indicator to locate members of a criminal gang or organisation. Due to various privacy issues in the acquisition of images containing tattoos, only a limited number of databases exists. This lack of databases has delayed the development of new methods to effectively retrieve a potential suspect's tattoo images from a candidate gallery. To mitigate this issue, in our work, we use an unsupervised generative approach to create a balanced database consisting of 28,550 semi-synthetic images with tattooed subjects from 571 tattoo categories. Further, we introduce a novel Tattoo Template Reconstruction Network (TattTRN), which learns to map the input tattoo sample to its respective tattoo template to enhance the distinguishing attributes of the final feature embedding. Experimental results with real data, i.e., WebTattoo and BIVTatt databases, demonstrate the soundness of the presented approach: an accuracy of up to 99% is achieved for checking at most the first 20 entries of the candidate list.
纹身作为一种软生物识别技术,在协助警方识别罪犯和受害者的过程中取得了有效果,因为它们包含有歧视性信息,并且是联系犯罪集团或组织的有用指标。然而,由于纹身图像收集过程中存在各种隐私问题,因此只有少数数据库存在。纹身数据库的缺乏导致新方法有效地检索潜在嫌疑人的纹身图像的发展受到了延迟。为了减轻这个问题,在我们的工作中,我们使用无监督生成方法创建了一个由571个纹身类别中纹身 subject 的28,550个半合成图像组成的平衡数据库。此外,我们引入了一种名为TattTRN的新纹身模板重建网络(TattTRN),它学会了将纹身样本映射到其相应的纹身模板,以增强最终特征嵌入的区分性特征。用真实数据进行实验结果(即WebTattoo和BIVTatt数据库)证明了所提出方法的有效性:在检查候选名单的前20个条目时,达到99%的准确率。
https://arxiv.org/abs/2405.07571