In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
鉴于自动语音识别(ASR)系统的广泛应用,其安全性问题比以往任何时候都受到更多的关注,主要原因是深度神经网络的易受性。以前的研究表明,在约束条件下悄悄地生成对抗扰动能够操纵语音识别系统,从而产生恶意命令。这些攻击方法主要需要在$\ell_p$范数约束下添加噪声扰动,不可避免地留下了手动修改的残影。最近的研究通过将文本到语音(TTS)合成音频的对抗样本,缓解了这一限制。然而,基于优化目标的风格修改会显著降低音频风格的可控性和可编辑性。在本文中,我们提出了基于用户自定义风格迁移的ASR系统攻击。我们首先测试了顺序风格迁移攻击(STA)的效果。然后,作为改进,我们提出了一个迭代式风格码攻击(SCA)来保持音频质量。实验结果表明,我们的方法可以满足用户自定义风格的需求,攻击成功率为82%,同时保持声音的自然度。
https://arxiv.org/abs/2405.09470
We present StyleMamba, an efficient image style transfer framework that translates text prompts into corresponding visual styles while preserving the content integrity of the original images. Existing text-guided stylization requires hundreds of training iterations and takes a lot of computing resources. To speed up the process, we propose a conditional State Space Model for Efficient Text-driven Image Style Transfer, dubbed StyleMamba, that sequentially aligns the image features to the target text prompts. To enhance the local and global style consistency between text and image, we propose masked and second-order directional losses to optimize the stylization direction to significantly reduce the training iterations by 5 times and the inference time by 3 times. Extensive experiments and qualitative evaluation confirm the robust and superior stylization performance of our methods compared to the existing baselines.
我们提出了StyleMamba,一种高效的图像风格迁移框架,它将文本提示翻译为相应的视觉风格,同时保留原始图像内容的完整性。现有的基于文本的样式化方法需要数百次训练迭代,并且需要大量的计算资源。为了加速过程,我们提出了一个有条件的State Space模型,名为StyleMamba,它通过顺序对图像特征与目标文本提示对齐。为了增强文本和图像之间的局部和全局样式一致性,我们提出了遮罩和第二层方向损失来优化样式化方向,从而显著降低训练迭代次数和推理时间。大量的实验和定性评估证实了我们的方法与现有基线相比具有稳健和卓越的样式化性能。
https://arxiv.org/abs/2405.05027
Long text understanding is important yet challenging for natural language processing. A long article or document usually contains many redundant words that are not pertinent to its gist and sometimes can be regarded as noise. With recent advances of abstractive summarization, we propose our \emph{Gist Detector} to leverage the gist detection ability of a summarization model and integrate the extracted gist into downstream models to enhance their long text understanding ability. Specifically, Gist Detector first learns the gist detection knowledge distilled from a summarization model, and then produces gist-aware representations to augment downstream models. We evaluate our method on three different tasks: long document classification, distantly supervised open-domain question answering, and non-parallel text style transfer. The experimental results show that our method can significantly improve the performance of baseline models on all tasks.
长文本理解对于自然语言处理来说很重要,但也具有挑战性。一篇长文章或文档通常包含许多与文章主旨不相关的冗余单词,有时可以被视为噪音。随着最近摘要性总结的进步,我们提出了我们的\emph{长文本理解检测器},利用摘要模型的摘要检测能力,并将在下游模型中整合提取的摘要以提高其长文本理解能力。具体来说,Gist Detector首先从摘要模型中学习摘要检测知识,然后产生具有摘要意识的表示来增强下游模型。我们在三个不同的任务上评估我们的方法:长文档分类、远离监督的问题回答和非平行文本风格转移。实验结果表明,我们的方法可以在所有任务上显著提高基线模型的性能。
https://arxiv.org/abs/2405.04955
Motion style transfer is a significant research direction in multimedia applications. It enables the rapid switching of different styles of the same motion for virtual digital humans, thus vastly increasing the diversity and realism of movements. It is widely applied in multimedia scenarios such as movies, games, and the Metaverse. However, most of the current work in this field adopts the GAN, which may lead to instability and convergence issues, making the final generated motion sequence somewhat chaotic and unable to reflect a highly realistic and natural style. To address these problems, we consider style motion as a condition and propose the Style Motion Conditioned Diffusion (SMCD) framework for the first time, which can more comprehensively learn the style features of motion. Moreover, we apply Mamba model for the first time in the motion style transfer field, introducing the Motion Style Mamba (MSM) module to handle longer motion sequences. Thirdly, aiming at the SMCD framework, we propose Diffusion-based Content Consistency Loss and Content Consistency Loss to assist the overall framework's training. Finally, we conduct extensive experiments. The results reveal that our method surpasses state-of-the-art methods in both qualitative and quantitative comparisons, capable of generating more realistic motion sequences.
运动风格迁移是多媒体应用中的一个重要研究方向。它使得虚拟数字人可以快速切换相同运动风格的不同样式,从而极大地增加了运动的多样性和现实感。它广泛应用于电影、游戏和元宇宙等多媒体场景。然而,目前这个领域的大多数工作采用生成对抗网络(GAN),这可能导致不稳定和收敛问题,使得最终生成的运动序列有些混乱,无法反映高度真实和自然风格。为了解决这些问题,我们考虑将风格迁移作为一种条件,并首次提出了Style Motion Conditioned Diffusion(SMCD)框架。此外,我们在运动风格迁移领域中首次应用了Mamba模型,引入了Motion Style Mamba(MSM)模块来处理较长的运动序列。第三,为了支持SMCD框架,我们提出了基于扩散的内容的一致性损失和一致性损失来协助整个框架的训练。最后,我们进行了广泛的实验。结果表明,我们的方法在质量和数量上均超过了最先进的Method,能够生成更真实的运动序列。
https://arxiv.org/abs/2405.02844
This paper handles the problem of converting real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though this problem could be realized by a wide range of image-to-image translation models, a notable issue with all these methods is that the original image content details could be easily erased or corrupted due to transfer of ink-wash style elements. To solve or ameliorate this issue, we propose to incorporate saliency detection into the unpaired image-to-image translation framework to regularize content information of the generated paintings. The saliency map is utilized for content regularization from two aspects, both explicitly and implicitly: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize saliency consistency before and after stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances content integrity of the generated paintings by injecting saliency information to the generator network to guide painting generation. Besides, we also propose saliency attended discriminator network which harnesses saliency mask to focus generative adversarial attention onto salient image regions, it contributes to producing finer ink-wash stylization effect for salient objects of images. Qualitative and quantitative experiments consistently demonstrate superiority of our model over related advanced methods for Chinese ink-wash painting style transfer.
本文处理将真实图像转换成传统中国水墨画的问题,即水墨画风格的图像-to-图像转换问题。尽管这个问题可以通过广泛的图像到图像转换模型来解决,但所有这些方法的一个显着的问题是,由于墨水风格元素的转移,原始图像内容细节很容易被轻松地擦除或损坏。为解决这个问题或减轻这个问题,我们提出将局部可用性检测引入无配对图像到图像转换框架中,以规范化生成的绘画的内容信息。局部可用性图用于从两个方面进行内容规范化:(罗马数字1)我们提出局部可用性IOU(SIOU)损失,以明确规范化和 stylization 之前的和之后的 saliency 一致性;(罗马数字2)我们提出局部可用性自适应规范化(SANorm),它通过向生成网络注入 saliency 信息来增强生成的绘画的内容完整性,从而指导绘画生成。此外,我们还提出了 saliency 注意的判别器网络,它利用 saliency 掩码将生成对抗注意力聚焦于显眼的图像区域,有助于产生更清晰的墨水画风格效果。定性和定量的实验结果都一致地证明了我们的模型在相关先进方法中的优越性。
https://arxiv.org/abs/2404.15743
Creating artistic 3D scenes can be time-consuming and requires specialized knowledge. To address this, recent works such as ARF, use a radiance field-based approach with style constraints to generate 3D scenes that resemble a style image provided by the user. However, these methods lack fine-grained control over the resulting scenes. In this paper, we introduce Controllable Artistic Radiance Fields (CoARF), a novel algorithm for controllable 3D scene stylization. CoARF enables style transfer for specified objects, compositional 3D style transfer and semantic-aware style transfer. We achieve controllability using segmentation masks with different label-dependent loss functions. We also propose a semantic-aware nearest neighbor matching algorithm to improve the style transfer quality. Our extensive experiments demonstrate that CoARF provides user-specified controllability of style transfer and superior style transfer quality with more precise feature matching.
创建艺术化的3D场景可能需要花费时间,并且需要专业知识。为了解决这个问题,最近的工作如ARF,采用基于辐射场的方法,带有风格约束,从用户提供的风格图像中生成类似于用户风格的3D场景。然而,这些方法缺乏对生成场景的细粒度控制。在本文中,我们介绍了可控制的艺术化辐射场(CoARF),一种用于可控制3D场景的风格化的新算法。CoARF允许指定对象的样式转移、合成3D样式转移和语义感知样式转移。我们通过具有不同标签相关损失函数的分割掩码实现可控性。我们还提出了一个语义感知最近邻匹配算法,以提高样式转移质量。我们广泛的实验证明,CoARF提供了用户指定风格转移的可控性和卓越的样式转移质量,具有更精确的特征匹配。
https://arxiv.org/abs/2404.14967
Previous studies on music style transfer have mainly focused on one-to-one style conversion, which is relatively limited. When considering the conversion between multiple styles, previous methods required designing multiple modes to disentangle the complex style of the music, resulting in large computational costs and slow audio generation. The existing music style transfer methods generate spectrograms with artifacts, leading to significant noise in the generated audio. To address these issues, this study proposes a music style transfer framework based on diffusion models (DM) and uses spectrogram-based methods to achieve multi-to-multi music style transfer. The GuideDiff method is used to restore spectrograms to high-fidelity audio, accelerating audio generation speed and reducing noise in the generated audio. Experimental results show that our model has good performance in multi-mode music style transfer compared to the baseline and can generate high-quality audio in real-time on consumer-grade GPUs.
之前关于音乐风格迁移的研究主要集中在一对一的风格转换,这相对有限。当考虑多种风格之间的转换时,以前的方法需要设计多个模式来区分音乐的复杂风格,导致计算成本较大且音频生成速度较慢。现有的音乐风格迁移方法生成的频谱图存在伪影,导致生成的音频中存在较大噪声。为了解决这些问题,本研究基于扩散模型(DM)提出了一种音乐风格迁移框架,并使用频谱图为基础实现多对多音乐风格迁移。GuideDiff方法被用于将频谱图恢复到高保真音频,加速音频生成速度并减少生成的音频中的噪声。实验结果表明,与基线相比,我们的模型在多模式音乐风格迁移方面具有较好的性能,可以在消费级GPU上实时生成高质量音频。
https://arxiv.org/abs/2404.14771
This paper presents a novel contribution to the field of regional style transfer. Existing methods often suffer from the drawback of applying style homogeneously across the entire image, leading to stylistic inconsistencies or foreground object twisted when applied to image with foreground elements such as person figures. To address this limitation, we propose a new approach that leverages a segmentation network to precisely isolate foreground objects within the input image. Subsequently, style transfer is applied exclusively to the background region. The isolated foreground objects are then carefully reintegrated into the style-transferred background. To enhance the visual coherence between foreground and background, a color transfer step is employed on the foreground elements prior to their rein-corporation. Finally, we utilize feathering techniques to achieve a seamless amalgamation of foreground and background, resulting in a visually unified and aesthetically pleasing final composition. Extensive evaluations demonstrate that our proposed approach yields significantly more natural stylistic transformations compared to conventional methods.
本文在区域风格迁移领域做出了一个新颖的贡献。现有的方法通常存在一个问题,即在整张图像上应用相同的风格,导致风格的不一致性,或者在将风格应用于具有前景元素(如人物形象)的图像时,出现前景对象扭曲。为了应对这个局限,我们提出了一个新的方法,该方法利用分割网络精确地将在输入图像中隔离前景对象。随后,将风格应用于背景区域。隔离后的前景对象 then 被小心地重新整合到风格转移后的背景中。为了增强前景和背景之间的视觉连贯性,在它们重新合并之前,对前景元素进行了颜色转移。最后,我们利用羽化技术实现前景和背景的无缝混合,从而产生视觉上统一和美观的最终构图。 extensive 评估证明,与传统方法相比,我们所提出的方法产生了更自然、更美观的样式转换效果。
https://arxiv.org/abs/2404.13880
Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.
任意风格迁移在研究和实践中具有广泛的关注,并拥有许多实际应用。现有方法中,e要么采用跨注意来将深度风格属性融入内容属性,要么使用自适应归一化来调整内容特征,都无法生成高质量的风格化图像。在本文中,我们提出了改进风格化图像质量的新技术。首先,我们提出了Style Consistency Instance Normalization(SCIN)方法,这是一种优化内容与风格特征之间对齐的方法。此外,我们还开发了一种基于实例的对比学习(ICL)方法,旨在理解各种风格之间的关系,从而提高生成风格化图像的质量。认识到VGG网络更擅长提取分类特征,需要更好地适应捕捉风格特征,我们还引入了感知编码器(PE)以捕捉风格特征。大量实验证明,与现有最先进的方法相比,我们提出的方法生成的风格化图像质量高,并有效防止了伪影。
https://arxiv.org/abs/2404.13584
Transforming two-dimensional (2D) images into three-dimensional (3D) volumes is a well-known yet challenging problem for the computer vision community. In the medical domain, a few previous studies attempted to convert two or more input radiographs into computed tomography (CT) volumes. Following their effort, we introduce a diffusion model-based technology that can rotate the anatomical content of any input radiograph in 3D space, potentially enabling the visualization of the entire anatomical content of the radiograph from any viewpoint in 3D. Similar to previous studies, we used CT volumes to create Digitally Reconstructed Radiographs (DRRs) as the training data for our model. However, we addressed two significant limitations encountered in previous studies: 1. We utilized conditional diffusion models with classifier-free guidance instead of Generative Adversarial Networks (GANs) to achieve higher mode coverage and improved output image quality, with the only trade-off being slower inference time, which is often less critical in medical applications; and 2. We demonstrated that the unreliable output of style transfer deep learning (DL) models, such as Cycle-GAN, to transfer the style of actual radiographs to DRRs could be replaced with a simple yet effective training transformation that randomly changes the pixel intensity histograms of the input and ground-truth imaging data during training. This transformation makes the diffusion model agnostic to any distribution variations of the input data pixel intensity, enabling the reliable training of a DL model on input DRRs and applying the exact same model to conventional radiographs (or DRRs) during inference.
将二维(2D)图像转换为三维(3D)体积在计算机视觉领域是一个众所周知但具有挑战性的问题。在医学领域,之前的一些研究表明,将两张或多张输入X光片转换为计算机断层扫描(CT)体积是可能的。他们的努力之后,我们引入了一种基于扩散模型的技术,该技术可以旋转任何输入X光片在3D空间中的解剖内容,从而有可能从任何角度观察到整个X光片的解剖内容。与之前的研究类似,我们使用CT体积来创建数字重建X光片(DRRs)作为模型的训练数据。然而,我们在之前的研究中遇到了两个显著的局限性:1.我们使用条件扩散模型(无分类指导)而不是生成对抗网络(GANs)来实现更高的模态覆盖和改善的输出图像质量,唯一的代价是推理时间更快,这在医疗应用中并不关键;2.我们证明了将深度学习模型(如循环神经网络)的风格迁移到DRRs的不确定输出可以被简单而有效的训练转换所取代,该转换在训练过程中随机改变输入和真实成像数据的像素强度直方图。这种转换使扩散模型对输入数据的像素强度分布变化具有免疫力,从而能够可靠地对DL模型在DRRs上的训练以及对常规X光片(或DRRs)的应用进行相同的模型。
https://arxiv.org/abs/2404.13000
Artistic style transfer aims to transfer the learned artistic style onto an arbitrary content image, generating artistic stylized images. Existing generative adversarial network-based methods fail to generate highly realistic stylized images and always introduce obvious artifacts and disharmonious patterns. Recently, large-scale pre-trained diffusion models opened up a new way for generating highly realistic artistic stylized images. However, diffusion model-based methods generally fail to preserve the content structure of input content images well, introducing some undesired content structure and style patterns. To address the above problems, we propose a novel pre-trained diffusion-based artistic style transfer method, called LSAST, which can generate highly realistic artistic stylized images while preserving the content structure of input content images well, without bringing obvious artifacts and disharmonious style patterns. Specifically, we introduce a Step-aware and Layer-aware Prompt Space, a set of learnable prompts, which can learn the style information from the collection of artworks and dynamically adjusts the input images' content structure and style pattern. To train our prompt space, we propose a novel inversion method, called Step-ware and Layer-aware Prompt Inversion, which allows the prompt space to learn the style information of the artworks collection. In addition, we inject a pre-trained conditional branch of ControlNet into our LSAST, which further improved our framework's ability to maintain content structure. Extensive experiments demonstrate that our proposed method can generate more highly realistic artistic stylized images than the state-of-the-art artistic style transfer methods.
艺术风格迁移的目的是将学习到的艺术风格应用到任意内容图像上,生成艺术风格化的图像。现有的基于生成对抗网络(GAN)的方法无法生成高度逼真的艺术风格化图像,并始终引入明显的伪影和失调模式。最近,大型预训练扩散模型为生成高度逼真的艺术风格化图像开辟了新的途径。然而,扩散模型方法通常无法保留输入内容图像的內容結構,引入一些不希望的内容结构和样式模式。为了解决上述问题,我们提出了一个新颖的预训练扩散-基于的艺术风格迁移方法,称为LSAST,可以在保留输入内容图像的內容结构的同时生成高度逼真的艺术风格化图像,不会引入明显的伪影和失调样式模式。具体来说,我们引入了一个步幅感知和层感知提示空间,一系列可学习的提示,可以从艺术作品集中学习风格信息,并动态调整输入图像的內容结构和样式模式。为了训练我们的提示空间,我们提出了一个新颖的翻转方法,称为步幅感知和层感知提示翻转,允许提示空间学习艺术作品集的样式信息。此外,我们将预训练的控制网分支注入到我们的LSAST中,进一步提高了我们的框架保持内容结构的能力。大量实验证明,与最先进的艺术风格迁移方法相比,我们所提出的方法可以生成更高度逼真的艺术风格化图像。
https://arxiv.org/abs/2404.11474
This research paper proposes a novel methodology for image-to-image style transfer on objects utilizing a single deep convolutional neural network. The proposed approach leverages the You Only Look Once version 8 (YOLOv8) segmentation model and the backbone neural network of YOLOv8 for style transfer. The primary objective is to enhance the visual appeal of objects in images by seamlessly transferring artistic styles while preserving the original object characteristics. The proposed approach's novelty lies in combining segmentation and style transfer in a single deep convolutional neural network. This approach omits the need for multiple stages or models, thus resulting in simpler training and deployment of the model for practical applications. The results of this approach are shown on two content images by applying different style images. The paper also demonstrates the ability to apply style transfer on multiple objects in the same image.
本文提出了一种利用单个深度卷积神经网络进行图像到图像风格迁移的新方法来处理物体。所提出的方法利用了You Only Look Once版本8(YOLOv8)分割模型和YOLOv8的骨干网络来进行风格迁移。主要目标是通过无缝转移艺术风格来增强图像中物体的视觉吸引力,同时保留原始物体的特征。该方法的创新之处在于将分割和风格迁移结合在一个单深的卷积神经网络中。这种方法省略了多个阶段或模型,因此简化了模型的训练和部署,为实际应用提供了更简单的模型。通过在两个内容图像上应用不同的风格图像,展示了这种方法的效果。本文还展示了在同一图像中应用风格迁移处理多个物体的能力。
https://arxiv.org/abs/2404.09461
In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.
在这项工作中,我们针对文本到图像(T2I)扩散模型的文本驱动风格迁移任务。主要挑战是在保持一致的结构同时实现有效的风格迁移效果。该领域过去的做法是将提示级的风格注入直接连接内容和风格提示,导致结构变形不可避免。在这项工作中,我们提出了一种新的解决方案,称为自适应风格集成(ASI),以实现细粒度的特征级别风格集成。它包括 Siamese Cross-Attention(SiCA)将单道跨注意解耦为双道结构以获得单独的内容和风格特征,以及自适应内容风格混合(AdaBlending)模块以以结构一致的方式将内容和风格信息耦合。实验证明,我们的方法在结构和风格保留方面都表现出更好的性能。
https://arxiv.org/abs/2404.06835
Recently, a surge of 3D style transfer methods has been proposed that leverage the scene reconstruction power of a pre-trained neural radiance field (NeRF). To successfully stylize a scene this way, one must first reconstruct a photo-realistic radiance field from collected images of the scene. However, when only sparse input views are available, pre-trained few-shot NeRFs often suffer from high-frequency artifacts, which are generated as a by-product of high-frequency details for improving reconstruction quality. Is it possible to generate more faithful stylized scenes from sparse inputs by directly optimizing encoding-based scene representation with target style? In this paper, we consider the stylization of sparse-view scenes in terms of disentangling content semantics and style textures. We propose a coarse-to-fine sparse-view scene stylization framework, where a novel hierarchical encoding-based neural representation is designed to generate high-quality stylized scenes directly from implicit scene representations. We also propose a new optimization strategy with content strength annealing to achieve realistic stylization and better content preservation. Extensive experiments demonstrate that our method can achieve high-quality stylization of sparse-view scenes and outperforms fine-tuning-based baselines in terms of stylization quality and efficiency.
最近,提出了一种利用预训练神经辐射场(NeRF)场景重构能力的3D风格迁移方法。要成功使用这种方法来风格化场景,首先需要从收集的场景图片中重构出照片真实的辐射场。然而,当仅有的输入视图稀疏时,预训练的少样本NeRFs通常会受到高频噪声的影响,这是由于为了提高重建质量而产生的高频细节。是否可以在稀疏输入下通过直接优化基于场景表示的编码来生成更忠实风格的场景呢?在本文中,我们考虑了通过解开内容语义和风格纹理来对稀疏视图场景进行风格化。我们提出了一个粗-细稀疏视图场景风格化框架,其中一种新颖的层次编码基于神经表示旨在直接生成高质量的风格化场景。我们还提出了一个新的优化策略——内容强度衰减,以实现逼真的风格化和更好的内容保留。大量实验证明,我们的方法可以在稀疏视图场景中实现高品质的风格化,并且在 stylization质量和效率方面优于基于微调的基线。
https://arxiv.org/abs/2404.05236
With the rapid development of XR, 3D generation and editing are becoming more and more important, among which, stylization is an important tool of 3D appearance editing. It can achieve consistent 3D artistic stylization given a single reference style image and thus is a user-friendly editing way. However, recent NeRF-based 3D stylization methods face efficiency issues that affect the actual user experience and the implicit nature limits its ability to transfer the geometric pattern styles. Additionally, the ability for artists to exert flexible control over stylized scenes is considered highly desirable, fostering an environment conducive to creative exploration. In this paper, we introduce StylizedGS, a 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting (3DGS) representation. The 3DGS brings the benefits of high efficiency. We propose a GS filter to eliminate floaters in the reconstruction which affects the stylization effects before stylization. Then the nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale and regions during the stylization to possess customized capabilities. Our method can attain high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference FPS.
随着XR技术的快速发展,3D生成和编辑变得越来越重要,其中,塑性是一种重要的3D外观编辑工具。通过给定单一定式风格图像,它可以实现一致的3D艺术塑性,从而成为一种用户友好的编辑方式。然而,基于NeRF的3D塑性方法面临效率问题,影响了实际用户体验,并且隐含的拓扑学限制了其将几何图案样式转移的能力。此外,艺术家对塑性场景的灵活控制被认为是高度渴望的,促进了创意探索的环境。在本文中,我们引入了StylizedGS,一种基于3D高斯平滑(3DGS)表示的3D神经风格转移框架。3DGS带来了高效率的优势。我们提出了GS滤波器来消除在重构过程中影响塑性效果的浮点。然后,基于最近邻的样式损失来实现通过对3DGS的形和色参数的微调来实现塑性,同时引入了基于其他正则化的深度保持损失,以防止对几何内容进行篡改。此外,通过专门设计的损失功能,StylizedGS使用户能够在塑性过程中控制颜色、塑性比例和区域,具有自定义功能。我们的方法可以实现高质量、具有忠实笔触和几何一致性的艺术塑性,具有灵活的控制。在各种场景和风格的大量实验中,我们证明了我们的方法在塑性和推理每秒帧数方面的有效性和效率。
https://arxiv.org/abs/2404.05220
We propose a novel approach to improve the reproducibility of neuroimaging results by converting statistic maps across different functional MRI pipelines. We make the assumption that pipelines can be considered as a style component of data and propose to use different generative models, among which, Diffusion Models (DM) to convert data between pipelines. We design a new DM-based unsupervised multi-domain image-to-image transition framework and constrain the generation of 3D fMRI statistic maps using the latent space of an auxiliary classifier that distinguishes statistic maps from different pipelines. We extend traditional sampling techniques used in DM to improve the transition performance. Our experiments demonstrate that our proposed methods are successful: pipelines can indeed be transferred, providing an important source of data augmentation for future medical studies.
我们提出了一个新方法来提高神经影像结果的可重复性,通过将不同功能MRI流程中的统计图进行转换。我们假设流程可以被视为数据的一种样式组件,并提出使用不同的生成模型,其中扩散模型(DM)用于在流程之间转换数据。我们设计了一个基于扩散模型的多域图像到图像无监督转换框架,并通过辅助分类器的潜在空间对3D fMRI统计图进行生成限制。我们扩展了在DM中使用的传统采样技术,以提高转换性能。我们的实验结果表明,我们所提出的方法是成功的:流程确实可以转移,为未来的医学研究提供了重要的数据增强来源。
https://arxiv.org/abs/2404.03703
Style transfer is a promising approach to close the sim-to-real gap in medical endoscopy. Rendering realistic endoscopic videos by traversing pre-operative scans (such as MRI or CT) can generate realistic simulations as well as ground truth camera poses and depth maps. Although image-to-image (I2I) translation models such as CycleGAN perform well, they are unsuitable for video-to-video synthesis due to the lack of temporal consistency, resulting in artifacts between frames. We propose MeshBrush, a neural mesh stylization method to synthesize temporally consistent videos with differentiable rendering. MeshBrush uses the underlying geometry of patient imaging data while leveraging existing I2I methods. With learned per-vertex textures, the stylized mesh guarantees consistency while producing high-fidelity outputs. We demonstrate that mesh stylization is a promising approach for creating realistic simulations for downstream tasks such as training and preoperative planning. Although our method is tested and designed for ureteroscopy, its components are transferable to general endoscopic and laparoscopic procedures.
风格迁移是一种有效的解决医疗内窥镜模拟与现实差距的方法。通过通过术前扫描(如MRI或CT)生成逼真的内窥镜视频以及真实相机姿态和深度图,可以生成逼真的模拟。尽管图像到图像(I2I)变换模型如CycleGAN表现良好,但由于缺乏时间一致性,导致帧间出现伪影。我们提出MeshBrush,一种基于神经网络的网格纹理化方法,用于合成具有不同程度渲染的逼真的视频。MeshBrush利用患者成像数据的底层几何,同时利用现有的I2I方法。通过学习每个顶点的纹理,纹理化的网格确保了一致性,并产生了高保真的输出。我们证明了网格纹理化是一种有前途的方法,可用于为下游任务(如培训和术前计划)创建逼真的模拟。尽管我们的方法已针对内窥镜超声检查进行了测试和设计,但它的组件可应用于其他内窥镜和腹腔镜手术。
https://arxiv.org/abs/2404.02999
Foundation models have emerged as pivotal tools, tackling many complex tasks through pre-training on vast datasets and subsequent fine-tuning for specific applications. The Segment Anything Model is one of the first and most well-known foundation models for computer vision segmentation tasks. This work presents a multi-faceted red-teaming analysis that tests the Segment Anything Model against challenging tasks: (1) We analyze the impact of style transfer on segmentation masks, demonstrating that applying adverse weather conditions and raindrops to dashboard images of city roads significantly distorts generated masks. (2) We focus on assessing whether the model can be used for attacks on privacy, such as recognizing celebrities' faces, and show that the model possesses some undesired knowledge in this task. (3) Finally, we check how robust the model is to adversarial attacks on segmentation masks under text prompts. We not only show the effectiveness of popular white-box attacks and resistance to black-box attacks but also introduce a novel approach - Focused Iterative Gradient Attack (FIGA) that combines white-box approaches to construct an efficient attack resulting in a smaller number of modified pixels. All of our testing methods and analyses indicate a need for enhanced safety measures in foundation models for image segmentation.
基础模型已经成为解决计算机视觉分割任务的关键工具,通过在庞大的数据集上进行预训练,然后针对特定应用进行微调,从而解决许多复杂任务。Segment Anything Model 是第一个也是最有名的基础模型之一,用于计算机视觉分割任务。这项工作提出了一种多方面的协同分析,针对 Segment Anything Model 进行挑战性任务: (1)我们分析了风格迁移对分割掩码的影响,证明了将不利天气条件和雨滴应用于城市道路仪表板图像会显著扭曲生成的掩码。 (2)我们关注评估模型是否可以用于隐私攻击,例如识别名人面容,并表明模型在这方面拥有一些不良知识。 (3)最后,我们检查了模型在文本提示下对分割掩码的抗攻击性。我们不仅展示了流行的高级攻击方法和对抗黑盒攻击的有效性,而且引入了一种新方法——专注于迭代攻击(FIGA),将白盒攻击方法结合构建出高效的攻击方式,从而实现更少的修改像素数量。我们所有的测试方法和分析都表明,基础模型在图像分割方面的安全性需要得到提高。
https://arxiv.org/abs/2404.02067
Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection -- ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022) -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models -- from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora -- showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language.
文本去毒是一种文本风格迁移(TST)任务,其中从具有恶意的文本形式(例如包含粗言的文本)将其转移到中立语域。最近,文本去毒方法在各种任务中的应用得到了发现,例如对大型语言模型(LLMs)的净化(Leong et al., 2023; He et al., 2024; Tang et al., 2023)和社会网络中的有毒言论对抗(Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023)。所有这些应用都非常重要,以确保现代数字世界中安全交流。然而,之前用于并行文本去毒数据集的构建的方法——ParaDetox(Logacheva et al., 2022)和APPADIA(Atwell et al., 2022)——仅在单语种设置中进行研究。在这项工作中,我们旨在将ParaDetox管道扩展到多种语言,并使用MultiParaDetox在多个语言上自动构建并行去毒数据集,从而实现对任何语言的平行文本去毒。然后,我们实验了不同的文本去毒模型——从无监督的基线到LLM和微调模型——展示了平行数据集存在的巨大好处,为任何语言获取最先进文本去毒模型。
https://arxiv.org/abs/2404.02037
Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on. This task is challenging due to the need to adapt to various photo poses, the sensitivity of hairstyles, and the lack of objective metrics. The current state of the art hairstyle transfer methods use an optimization process for different parts of the approach, making them inexcusably slow. At the same time, faster encoder-based models are of very low quality because they either operate in StyleGAN's W+ space or use other low-dimensional image generators. Additionally, both approaches have a problem with hairstyle transfer when the source pose is very different from the target pose, because they either don't consider the pose at all or deal with it inefficiently. In our paper, we present the HairFast model, which uniquely solves these problems and achieves high resolution, near real-time performance, and superior reconstruction compared to optimization problem-based methods. Our solution includes a new architecture operating in the FS latent space of StyleGAN, an enhanced inpainting approach, and improved encoders for better alignment, color transfer, and a new encoder for post-processing. The effectiveness of our approach is demonstrated on realism metrics after random hairstyle transfer and reconstruction when the original hairstyle is transferred. In the most difficult scenario of transferring both shape and color of a hairstyle from different images, our method performs in less than a second on the Nvidia V100. Our code is available at this https URL.
我们的论文解决了将发型从参考图像转移到输入照片进行虚拟试戴的复杂任务。由于需要适应各种照片姿势、发型敏感度和缺乏客观指标,这项任务非常具有挑战性。同时,目前基于优化过程的方法速度极慢,而快速编码的模型质量也非常低,因为它们要么在StyleGAN的W+空间运行,要么使用其他低维图像生成器。此外,两种方法在发型转移时都存在问题,因为它们要么完全忽略姿势,要么处理姿势不高效。在我们的论文中,我们提出了HairFast模型,它独特地解决了这些问题,并实现了高分辨率、接近实时性能和卓越的重建效果,与基于优化问题的方法相比。我们的解决方案包括一个新的在StyleGAN的FS潜在空间中运行的架构、增强的修复方法、改进的编码器以及后处理的新编码器。在随机发型转移和重建时,我们的方法在现实主义指标上证明了其有效性。在将不同图像的发型和颜色从一个图像转移到另一个图像的最困难的情况下,我们的方法在Nvidia V100上执行的时间不到一秒。我们的代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2404.01094