As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.
作为人类,我们渴望创建既自由又易于控制的媒体内容。得益于生成技术的显著发展,我们现在可以轻松地利用2D扩散方法合成由原始草图或指定人体姿势控制的图像,甚至可以逐步编辑/再生带有遮罩的局部区域。然而,在3D建模任务中,类似的工作流程仍然无法实现,因为3D生成的可控性和效率不高。在本文中,我们提出了一个新颖的可控且具有交互性的3D资产建模框架,名为Coin3D。Coin3D允许用户使用由基本形状组成的粗略几何代理来控制3D生成,并引入了交互式生成工作流程,以支持在几秒钟内提供响应式的3D物体预览。为此,我们开发了几个技术,包括对扩散模型应用体积粗略形状控制的3D适配器,用于精确部分编辑的代理边界编辑策略,用于支持响应式预览的渐进式体积缓存,以及体积-SDS,以确保一致的网格重建。对不同形状代理的交互式生成和编辑的广泛实验证明,我们的方法在3D资产生成任务中实现了卓越的可控性和灵活性。
https://arxiv.org/abs/2405.08054
Humans are remarkable in their ability to navigate without metric information. We can read abstract 2D maps, such as floor-plans or hand-drawn sketches, and use them to navigate in unseen rich 3D environments, without requiring prior traversals to map out these scenes in detail. We posit that this is enabled by the ability to represent the environment abstractly as interconnected navigational behaviours, e.g., "follow the corridor" or "turn right", while avoiding detailed, accurate spatial information at the metric level. We introduce the Scene Action Map (SAM), a behavioural topological graph, and propose a learnable map-reading method, which parses a variety of 2D maps into SAMs. Map-reading extracts salient information about navigational behaviours from the overlooked wealth of pre-existing, abstract and inaccurate maps, ranging from floor-plans to sketches. We evaluate the performance of SAMs for navigation, by building and deploying a behavioural navigation stack on a quadrupedal robot. Videos and more information is available at: this https URL.
人类在无指标信息的情况下进行导航的能力是令人惊叹的。我们可以阅读抽象的2D地图,如平面图或手绘草图,并利用它们在未见过的丰富3D环境中进行导航,而无需先前的遍历来详细绘制这些场景。我们认为,这是通过将环境抽象地表示为相互连接的导航行为而实现的,例如"跟随走廊"或"右转",而避免在指标级别上详细、准确的空间信息。我们引入了场景动作图(SAM),一个行为拓扑图,并提出了一种可学习地图阅读方法,可以将各种2D地图解析为SAM。地图阅读从忽视的丰富预先存在的抽象和不准确地图中提取有关导航行为的突出信息,从平面图到草图。我们通过在四足机器人上构建和部署行为导航堆栈来评估SAM的导航性能。有关更多信息,请参阅:https://this URL。
https://arxiv.org/abs/2405.07948
Hand-drawn cartoon animation employs sketches and flat-color segments to create the illusion of motion. While recent advancements like CLIP, SVD, and Sora show impressive results in understanding and generating natural video by scaling large models with extensive datasets, they are not as effective for cartoons. Through our empirical experiments, we argue that this ineffectiveness stems from a notable bias in hand-drawn cartoons that diverges from the distribution of natural videos. Can we harness the success of the scaling paradigm to benefit cartoon research? Unfortunately, until now, there has not been a sizable cartoon dataset available for exploration. In this research, we propose the Sakuga-42M Dataset, the first large-scale cartoon animation dataset. Sakuga-42M comprises 42 million keyframes covering various artistic styles, regions, and years, with comprehensive semantic annotations including video-text description pairs, anime tags, content taxonomies, etc. We pioneer the benefits of such a large-scale cartoon dataset on comprehension and generation tasks by finetuning contemporary foundation models like Video CLIP, Video Mamba, and SVD, achieving outstanding performance on cartoon-related tasks. Our motivation is to introduce large-scaling to cartoon research and foster generalization and robustness in future cartoon applications. Dataset, Code, and Pretrained Models will be publicly available.
手绘动画采用草图和平色段来创造运动的错觉。尽管最近如CLIP、SVD和Sora等先进技术通过扩展数据集对理解和生成自然视频取得了令人印象深刻的结果,但它们对于漫画作品来说并不有效。通过我们的实证实验,我们得出结论,这种无效性源于手绘漫画作品的一个显著偏差,这种偏差与自然视频的分布不同。我们能否利用扩展范式的成功来促进漫画研究?然而,到目前为止,还没有可供探索的大型漫画数据集。在这项研究中,我们提出了Sakuga-42M数据集,这是第一个大型手绘动画数据集。Sakuga-42M包括420万关键帧,涵盖了各种艺术风格、地区和年代,具有全面的语义注释,包括视频文本描述对、动漫标签、内容分类等。我们通过微调当代基础模型(如Video CLIP、Video Mamba和SVD)在理解和生成漫画相关任务上取得了出色的性能,从而开创了大型手绘动画数据集在理解和生成方面的优势。我们的目标是将大型扩展引入漫画研究,并为未来的漫画应用程序培养通用性和稳健性。数据集、代码和预训练模型将公开发布。
https://arxiv.org/abs/2405.07425
In Re-identification (ReID), recent advancements yield noteworthy progress in both unimodal and cross-modal retrieval tasks. However, the challenge persists in developing a unified framework that could effectively handle varying multimodal data, including RGB, infrared, sketches, and textual information. Additionally, the emergence of large-scale models shows promising performance in various vision tasks but the foundation model in ReID is still blank. In response to these challenges, a novel multimodal learning paradigm for ReID is introduced, referred to as All-in-One (AIO), which harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning. The diverse multimodal data in AIO are seamlessly tokenized into a unified space, allowing the modality-shared frozen encoder to extract identity-consistent features comprehensively across all modalities. Furthermore, a meticulously crafted ensemble of cross-modality heads is designed to guide the learning trajectory. AIO is the \textbf{first} framework to perform all-in-one ReID, encompassing four commonly used modalities. Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts, showcasing exceptional performance in zero-shot and domain generalization scenarios.
在重新识别(Re-identification,ReID)中,近期的进展在单模态和跨模态检索任务中都取得了显著的进步。然而,开发一个统一框架来处理多样 multimodal 数据(包括 RGB、红外、手绘和文本信息)仍然具有挑战性。此外,大规模模型的出现表明在各种视觉任务上的表现颇具前景,但 ReID 的基础模型仍然是空白的。为了应对这些挑战,提出了一个名为 All-in-One(AIO)的新型多模态学习范式,它利用预训练的大模型作为编码器,实现有效跨模态检索而无需额外微调。AIO 中的多样 multimodal 数据被无缝地转化为统一的 space,使模态共享的预训练编码器可以全面提取所有模态的身份一致特征。此外,还设计了一个精心制作的跨模态头集,以指导学习轨迹。AIO 是第一个实现所有模态 ReID 的框架,涵盖四种常用模态。在跨模态和多模态 ReID 的实验中,AIO 不仅能够应对各种模态数据,而且在具有挑战性的情境中表现出色,展示了在零散和领域通用情况下的卓越性能。
https://arxiv.org/abs/2405.04741
Deep learning-based sketch-to-clothing image generation provides the initial designs and inspiration in the fashion design processes. However, clothing generation from freehand drawing is challenging due to the sparse and ambiguous information from the drawn sketches. The current generation models may have difficulty generating detailed texture information. In this work, we propose TexControl, a sketch-based fashion generation framework that uses a two-stage pipeline to generate the fashion image corresponding to the sketch input. First, we adopt ControlNet to generate the fashion image from sketch and keep the image outline stable. Then, we use an image-to-image method to optimize the detailed textures of the generated images and obtain the final results. The evaluation results show that TexControl can generate fashion images with high-quality texture as fine-grained image generation.
基于深度学习的插图转服装图像生成为时尚设计过程提供了初始设计感和灵感。然而,由于手绘插图提供的信息稀疏且模糊,因此从手绘插图进行服装生成具有挑战性。当前的生成模型可能很难生成详细的纹理信息。在这项工作中,我们提出了TexControl,一种基于插图的时尚生成框架,它使用双阶段流程生成与插图输入对应的时尚图像。首先,我们采用ControlNet从插图生成时尚图像,并保持图像轮廓的稳定性。然后,我们使用图像到图像方法优化生成图像的详细纹理,并获得最终结果。评估结果显示,TexControl可以在精细纹理生成方面产生高质量的时尚图像。
https://arxiv.org/abs/2405.04675
Crafting effective prompts for code generation or editing with Large Language Models (LLMs) is not an easy task. Particularly, the absence of immediate, stable feedback during prompt crafting hinders effective interaction, as users are left to mentally imagine possible outcomes until the code is generated. In response, we introduce Language-Oriented Code Sketching, an interactive approach that provides instant, incremental feedback in the form of code sketches (i.e., incomplete code outlines) during prompt crafting. This approach converts a prompt into a code sketch by leveraging the inherent linguistic structures within the prompt and applying classic natural language processing techniques. The sketch then serves as an intermediate placeholder that not only previews the intended code structure but also guides the LLM towards the desired code, thereby enhancing human-LLM interaction. We conclude by discussing the approach's applicability and future plans.
编写有效的提示对于使用大型语言模型(LLMs)进行代码生成或编辑并不是一项容易的任务。特别地,在提示制作过程中缺乏即时的、稳定的反馈会阻碍有效的交互,用户需要等到代码生成后才能在头脑中想象可能的结果。为了应对这个问题,我们引入了面向语言的代码草图,一种交互式的提示制作方法,在制作提示的过程中提供即时、逐步的反馈,以代码草图(即不完整的代码大纲)的形式呈现。这种方法通过利用提示中固有的语言结构并应用经典的自然语言处理技术将提示转换为代码草图。草图 then serves as an intermediate placeholder that not only previews the intended code structure but also guides the LLM towards the desired code, thereby enhancing human-LLM interaction. We conclude by discussing the applicability of this approach and our future plans.
https://arxiv.org/abs/2405.03998
Cross-modality distillation arises as an important topic for data modalities containing limited knowledge such as depth maps and high-quality sketches. Such techniques are of great importance, especially for memory and privacy-restricted scenarios where labeled training data is generally unavailable. To solve the problem, existing label-free methods leverage a few pairwise unlabeled data to distill the knowledge by aligning features or statistics between the source and target modalities. For instance, one typically aims to minimize the L2 distance or contrastive loss between the learned features of pairs of samples in the source (e.g. image) and the target (e.g. sketch) modalities. However, most algorithms in this domain only focus on the experimental results but lack theoretical insight. To bridge the gap between the theory and practical method of cross-modality distillation, we first formulate a general framework of cross-modality contrastive distillation (CMCD), built upon contrastive learning that leverages both positive and negative correspondence, towards a better distillation of generalizable features. Furthermore, we establish a thorough convergence analysis that reveals that the distance between source and target modalities significantly impacts the test error on downstream tasks within the target modality which is also validated by the empirical results. Extensive experimental results show that our algorithm outperforms existing algorithms consistently by a margin of 2-3\% across diverse modalities and tasks, covering modalities of image, sketch, depth map, and audio and tasks of recognition and segmentation.
跨模态蒸馏是一个重要的话题,对于包括有限知识的数据模式(如深度图和高质量草图)来说,具有很大的重要性。对于通常缺乏标签训练数据且受到内存和隐私限制的场景,这种技术具有重要意义。为了解决这个问题,现有的无标签方法利用几对成对的未标记数据来通过将源和目标模态的特征或统计信息对齐来蒸馏知识。例如,通常试图最小化源(如图像)和学习到的特征与目标(如草图)模态之间的L2距离或对比度损失。然而,这个领域的大多数算法仅关注实验结果,而缺乏理论洞察力。为了在理论框架和实践方法之间弥合差距,我们首先提出了一个基于对比学习的一般框架:跨模态对比蒸馏(CMCD),并在此基础上进行了构建,利用正负对应关系,以更好地蒸馏通用特征。此外,我们还建立了详细的收敛分析,揭示了源和目标模态之间的距离对目标模态下游任务的测试误差具有显著影响,这也通过实验得到了验证。丰富的实验结果表明,我们的算法在各种模态和任务上 consistently优于现有算法,覆盖了图像、草图、深度图和音频模态,以及识别和分割任务。
https://arxiv.org/abs/2405.03355
Federated learning can train models without directly providing local data to the server. However, the frequent updating of the local model brings the problem of large communication overhead. Recently, scholars have achieved the communication efficiency of federated learning mainly by model compression. But they ignore two problems: 1) network state of each client changes dynamically; 2) network state among clients is not the same. The clients with poor bandwidth update local model slowly, which leads to low efficiency. To address this challenge, we propose a communication-efficient federated learning algorithm with adaptive compression under dynamic bandwidth (called AdapComFL). Concretely, each client performs bandwidth awareness and bandwidth prediction. Then, each client adaptively compresses its local model via the improved sketch mechanism based on his predicted bandwidth. Further, the server aggregates sketched models with different sizes received. To verify the effectiveness of the proposed method, the experiments are based on real bandwidth data which are collected from the network topology we build, and benchmark datasets which are obtained from open repositories. We show the performance of AdapComFL algorithm, and compare it with existing algorithms. The experimental results show that our AdapComFL achieves more efficient communication as well as competitive accuracy compared to existing algorithms.
联邦学习可以在不直接向服务器提供本地数据的情况下训练模型。然而,频繁更新本地模型会带来通信开销的问题。最近,学者通过模型压缩实现了联邦学习的主要通信效率。但是,他们忽略了两个问题:1)每个客户端的网络状态是动态变化的;2)客户端之间的网络状态不同。网络带宽较差的客户端更新本地模型缓慢,导致效率低下。为了应对这一挑战,我们提出了一个具有动态带宽的通信高效的联邦学习算法(称为AdapComFL)。具体来说,每个客户端进行带宽感知和带宽预测。然后,客户端通过基于预测带宽的改进画图机制自适应压缩其本地模型。此外,服务器汇总收到不同大小的画图模型的绘图结果。为了验证所提出方法的有效性,实验基于我们网络拓扑结构收集的实时带宽数据和从开放存储库获得的基准数据集。我们展示了AdapComFL算法的性能,并将其与现有算法进行了比较。实验结果表明,与现有算法相比,AdapComFL具有更高效的通信和竞争力的准确性。
https://arxiv.org/abs/2405.03248
We present SketchGPT, a flexible framework that employs a sequence-to-sequence autoregressive model for sketch generation, and completion, and an interpretation case study for sketch recognition. By mapping complex sketches into simplified sequences of abstract primitives, our approach significantly streamlines the input for autoregressive modeling. SketchGPT leverages the next token prediction objective strategy to understand sketch patterns, facilitating the creation and completion of drawings and also categorizing them accurately. This proposed sketch representation strategy aids in overcoming existing challenges of autoregressive modeling for continuous stroke data, enabling smoother model training and competitive performance. Our findings exhibit SketchGPT's capability to generate a diverse variety of drawings by adding both qualitative and quantitative comparisons with existing state-of-the-art, along with a comprehensive human evaluation study. The code and pretrained models will be released on our official GitHub.
我们提出了SketchGPT,一个灵活的框架,它采用序列到序列自回归模型用于草图生成和完成,以及用于草图识别的交互案例研究。通过将复杂草图映射为抽象基本结构的简化序列,我们的方法大大简化了自回归建模的输入。SketchGPT利用下一个词预测目标策略来理解草图模式,促进草图的创作和完成,并准确分类它们。这种提出的草图表示策略有助于克服连续 stroke 数据中自回归建模的现有挑战,使得模型训练更加平滑,同时实现具有竞争力的性能。我们的研究结果表明,SketchGPT通过添加定性和定量与现有最先进的水平的比较,具有生成各种不同草图的能力,以及全面的用户评估研究。代码和预训练模型将在我们的官方 GitHub 上发布。
https://arxiv.org/abs/2405.03099
We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.
我们正在见证随着大型文本到图像生成方法的最近成功,条件图像合成领域也发生了一场革命。这一成功也为我们使用多模态输入控制生成和编辑过程提供了新的机会。虽然使用深度、轮廓和其他图像进行空间控制已经吸引了许多研究,但我们认为另一个同样有效的模式是音频,因为声音和视觉是人类感知的主要组成部分。因此,我们提出了一种在大规模图像扩散模型中实现音频条件的方法。我们的方法首先将音频剪辑获得的特征映射到可以注入扩散模型中的标记。我们引入了额外的音频-图像跨注意层,在冻结扩散模型原始层的重量的同时,对其进行微调。除了音频条件图像生成外,我们的方法还可以与基于扩散的编辑方法相结合,以实现音频条件图像编辑。我们在广泛的音频和图像数据集上进行了实验。我们与最近的方法进行了广泛的比较,并展示了良好的性能。
https://arxiv.org/abs/2405.00878
App developers use the Graphical User Interface (GUI) of other apps as an important source of inspiration to design and improve their own apps. In recent years, research suggested various approaches to retrieve GUI designs that fit a certain text query from screenshot datasets acquired through automated GUI exploration. However, such text-to-GUI retrieval approaches only leverage the textual information of the GUI elements in the screenshots, neglecting visual information such as icons or background images. In addition, the retrieved screenshots are not steered by app developers and often lack important app features, e.g. whose UI pages require user authentication. To overcome these limitations, this paper proposes GUing, a GUI search engine based on a vision-language model called UIClip, which we trained specifically for the app GUI domain. For this, we first collected app introduction images from Google Play, which usually display the most representative screenshots selected and often captioned (i.e. labeled) by app vendors. Then, we developed an automated pipeline to classify, crop, and extract the captions from these images. This finally results in a large dataset which we share with this paper: including 303k app screenshots, out of which 135k have captions. We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind in GUI retrieval. We evaluated our approach on various datasets from related work and in manual experiment. The results demonstrate that our model outperforms previous approaches in text-to-GUI retrieval achieving a Recall@10 of up to 0.69 and a HIT@10 of 0.91. We also explored the performance of UIClip for other GUI tasks including GUI classification and Sketch-to-GUI retrieval with encouraging results.
翻译: 应用程序开发者会从其他应用程序的图形用户界面(GUI)中获得灵感来设计和改进他们的应用程序。近年来,研究建议从自动抓取通过 GUI 探索获得的屏幕截图数据集中检索 GUI 设计的各种方法。然而,这样的文本到 GUI 检索方法仅利用了屏幕截图中 GUI 元素的文本信息,而忽视了视觉信息,如图标或背景图像。此外,检索到的屏幕截图通常不是由应用程序开发者引导的,并且通常缺乏重要的应用程序功能,例如需要用户身份验证的 UI 页面。为了克服这些限制,本文提出了基于 UIClip 视觉语言模型的 GUI 搜索引擎,该模型专门为应用程序 GUI 领域进行训练。为此,我们首先从 Google Play 收集了应用程序介绍图片,这些图片通常显示了由应用开发商选择的最具有代表性的屏幕截图并附有标签(即标注)。然后,我们开发了一个自动化的管道来对这些图像进行分类、裁剪和提取标签。最终,我们得到了一个大型数据集,我们将其与本文分享:包括 303k 个应用程序屏幕截图,其中 135k 个带有标签。我们使用这个数据集来训练了一种新颖的视觉语言模型,据我们所知,这是 GUI 检索领域第一个这样的模型。我们在相关工作和手动实验的各种数据集上评估了我们的方法,结果表明,我们的模型在文本到 GUI 检索方面优于先前的方法,达到召回率@10 最高可达 0.69 和精确率@10 最高可达 0.91。我们还研究了 UIClip 在其他 GUI 任务上的性能,包括 GUI 分类和 Sketch-to-GUI 检索,具有鼓舞人心的结果。
https://arxiv.org/abs/2405.00145
Imbuing machines with the ability to talk has been a longtime pursuit of artificial intelligence (AI) research. From the very beginning, the community has not only aimed to synthesise high-fidelity speech that accurately conveys the semantic meaning of an utterance, but also to colour it with inflections that cover the same range of affective expressions that humans are capable of. After many years of research, it appears that we are on the cusp of achieving this when it comes to single, isolated utterances. This unveils an abundance of potential avenues to explore when it comes to combining these single utterances with the aim of synthesising more complex, longer-term behaviours. In the present chapter, we outline the methodological advances that brought us so far and sketch out the ongoing efforts to reach that coveted next level of artificial expressivity. We also discuss the societal implications coupled with rapidly advancing expressive speech synthesis (ESS) technology and highlight ways to mitigate those risks and ensure the alignment of ESS capabilities with ethical norms.
使机器具有交流的能力已经一直是人工智能(AI)研究的长期目标。从研究开始就旨在合成准确传达一句话的丰富高保真度的语音,并且用同样的情感表达范围为人类所能够覆盖的调谐来染色。经过多年的研究,在单句和孤立的话语上,我们似乎正处在实现这一目标即将到来的临界点。这揭示了我们在这些单句上合成更复杂、更长期行为的可能性。在本章中,我们概述了使我们达到目前阶段的Methodological advances,并勾勒出正致力于实现这一目标的努力。我们还讨论了与快速发展的表达性语音合成(ESS)技术相关的社会影响,并指出了解决这些风险并确保ESS能力与伦理规范保持一致的方法。
https://arxiv.org/abs/2404.19363
Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.
基于手绘图的图像检索(SBIR)将手绘图与相应的现实图像相关联。在本文中,我们试图同时解决这个任务的两个主要挑战:i)零散 shot,处理未见类别;ii)精细程度,涉及类内实例级别的检索。我们创新的关键在于意识到,仅从一般化角度解决这个问题可能是不充分的,因为有限见的类别所积累的知识可能不完全是宝贵的,或者不能完全转移给未见的目标类别。 受到这个启发,本文我们提出了一种双模提示的CLIP(DP-CLIP)网络,其中设计了一种自适应提示策略。具体来说,为了促进我们的DP-CLIP适应不可预测的目标类别,我们在目标类别和文本类别标签内分别构建了一系列类适应的提示词和通道比例。通过整合生成的指导,DP-CLIP可以在每个目标类别中获得宝贵的类中心见解,高效适应新类别,并捕捉到每个目标类别的独特区分线索,从而实现有效的检索。 通过这些设计,我们的DP-CLIP在Sketchy数据集上的Acc.@1值比最先进的细粒度零散 shot SBIR方法提高了7.3%。同时,在另外两个类级零散 shot SBIR基准中,我们的方法也取得了很好的表现。
https://arxiv.org/abs/2404.18695
Generating face image with specific gaze information has attracted considerable attention. Existing approaches typically input gaze values directly for face generation, which is unnatural and requires annotated gaze datasets for training, thereby limiting its application. In this paper, we present a novel gaze-controllable face generation task. Our approach inputs textual descriptions that describe human gaze and head behavior and generates corresponding face images. Our work first introduces a text-of-gaze dataset containing over 90k text descriptions spanning a dense distribution of gaze and head poses. We further propose a gaze-controllable text-to-face method. Our method contains a sketch-conditioned face diffusion module and a model-based sketch diffusion module. We define a face sketch based on facial landmarks and eye segmentation map. The face diffusion module generates face images from the face sketch, and the sketch diffusion module employs a 3D face model to generate face sketch from text description. Experiments on the FFHQ dataset show the effectiveness of our method. We will release our dataset and code for future research.
生成具有特定目光信息的面部图像引起了相当的关注。现有的方法通常直接输入目光值来进行面部生成,这并不自然,需要大量的注释目光数据进行训练,从而限制了其应用。在本文中,我们提出了一个新颖的具有控制目光的面部生成任务。我们的方法输入文本描述,描述了人类目光和头部的行为,并生成相应的面部图像。我们的工作首先引入了一个包含超过90k个文本描述的文本-目光数据集,涵盖了目光分布的密集范围和头部姿势的多样化。我们进一步提出了一个目光可控制文本-面部方法。我们的方法包含一个基于特征的人脸扩散模块和一个基于模型的特征扩散模块。我们定义了一个基于面部特征和眼部分割图的人脸特征扩散模块。人脸扩散模块从人脸特征中生成人脸图像,而基于模型的特征扩散模块采用3D人脸模型从文本描述中生成人脸特征扩散。在FFHQ数据集上的实验表明,我们的方法的有效性。我们将发布我们的数据集和代码供未来研究使用。
https://arxiv.org/abs/2404.17486
The rapid evolution of the fashion industry increasingly intersects with technological advancements, particularly through the integration of generative AI. This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models. Utilizing ControlNet and LoRA fine-tuning, our approach generates high-quality images from multimodal inputs such as text and sketches. We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data. Our evaluation, utilizing metrics like FID, CLIP Score, and KID, demonstrates that our model significantly outperforms traditional stable diffusion models. The results not only highlight the effectiveness of our model in generating fashion-appropriate outputs but also underscore the potential of diffusion models in revolutionizing fashion design workflows. This research paves the way for more interactive, personalized, and technologically enriched methodologies in fashion design and representation, bridging the gap between creative vision and practical application.
时尚行业的快速演变越来越与技术进步特别是通过采用生成式AI(Generative AI)整合密切相关。本研究介绍了一种新型的生成管道,通过利用潜在扩散模型对时尚设计过程进行建模,以实现时尚设计过程的数字化转型。利用ControlNet和LoRA微调,我们的方法从多模态输入(如文本和草图)生成高质量图像。通过整合草图数据,我们充分利用并提升了最先进的虚拟试穿数据集,包括多模态着装规范(Multimodal Dress Code)和VITON-HD,从而显著优于传统的稳定扩散模型。我们的评估,利用指标如FID、CLIP分数和KID,证明了我们的模型在生成时尚合适输出方面显著优于传统稳定扩散模型。结果不仅突出了我们在生成时尚相关输出方面的有效性,还强调了扩散模型在革新时装设计工作流程中的巨大潜力。这项研究为时尚设计及其表示方法带来了更交互、个性化和技术丰富的方法奠定了基础,将创意视觉和实际应用之间的差距缩小。
https://arxiv.org/abs/2404.18591
Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.
几何和外观控全身体像生成是一个有趣但具有挑战性的任务。现有的解决方案或依赖于粗略的条件(例如姿态,文本),从而缺乏对身体的几何和外观控制。绘画提供了这种编辑能力,并在各种基于草图的脸部生成和编辑解决方案中得到了应用。然而,直接将基于草图的脸部生成应用到全身生成通常由于姿态、身体形状和服装形状和纹理的高复杂性和多样性而无法产生高质量和多样性的结果。最近,基于几何可控制扩散的解决方案主要依赖于提示生成外观,而在输入粗糙时很难平衡现实感和结果的准确性。 这项工作提出了Sketch2Human,第一个基于语义草图的全身体人像生成系统(用于几何控制)和参考图像(用于外观控制)。我们的解决方案基于StyleGAN-Human的潜在空间,具有倒置的形状和外观潜在代码作为输入。具体来说,我们提出了一种用从StyleGAN-Human的潜在空间中采样的大规模合成数据训练的草图编码器,并直接从草图中监督生成图像。考虑到StyleGAN-Human中部分几何和纹理的纠缠信息以及缺乏解耦的数据集,我们设计了一种新的训练计划,以创建几何保留和外观传递的训练数据来调整生成器实现解耦几何和外观控制。尽管我们的方法是基于合成数据训练的,但它还可以处理手绘草图。定性和定量评估证明了我们的方法在现有方法中的卓越性能。
https://arxiv.org/abs/2404.15889
In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at this https URL .
在这项工作中,我们研究了基于草图指导的图像修复任务。与在自然语言引导下图像修复任务中已经取得很好进展的情况不同,相对较少研究的基于草图指导的修复任务为用户指定修复物体形状和姿势提供了更大的用户控制。作为解决这个问题的一种早期解决方案,我们引入了一个新的部分离散扩散过程(PDDP)。PDDP的前向传播会破坏图像上的遮罩区域,而反向传播根据我们提出的草图指导双向变换器重构这些遮罩区域。所提出的新的变换器模块接受两个输入——包含要修复的遮罩区域的图像和查询草图,以建模反向扩散过程。这种策略有效地解决了草图和自然图像之间的领域差距,从而提高了修复结果的质量。在缺乏针对这个任务的较大规模数据集的情况下,我们通过将MS-COCO中的数据集合成一个数据集来训练,并详细评估我们提出的框架与各种有效方法之间的差异。定性和定量的结果以及用户研究证实了所提出的修复方法在给定草图的视觉外观下修复了真实的物体。为了进一步研究,我们将代码公开发布在https:// 这个网址上。
https://arxiv.org/abs/2404.11949
A positive margin may result in an increased risk of local recurrences after breast retention surgery for any malignant tumour. In order to reduce the number of positive margins would offer surgeon real-time intra-operative information on the presence of positive resection margins. This study aims to design an intra-operative tumour margin evaluation scheme by using specimen mammography in breast-conserving surgery. Total of 30 cases were evaluated and compared with the manually determined contours by experienced physicians and pathology report. The proposed method utilizes image thresholding to extract regions of interest and then performs a deep learning model, i.e. SegNet, to segment tumour tissue. The margin width of normal tissues surrounding it is evaluated as the result. The desired size of margin around the tumor was set for 10 mm. The smallest average difference to manual sketched margin (6.53 mm +- 5.84). In the all case, the SegNet architecture was utilized to obtain tissue specimen boundary and tumor contour, respectively. The simulation results indicated that this technology is helpful in discriminating positive from negative margins in the intra-operative setting. The aim of proposed scheme was a potential procedure in the intra-operative measurement system. The experimental results reveal that deep learning techniques can draw results that are consistent with pathology reports.
积极的 margin 可能导致任何恶性肿瘤手术后局部复发风险增加。为了减少阳性切缘的数量,从而让外科医生在手术过程中实时了解阳性切除切缘的存在,这项研究旨在通过乳腺保乳手术使用组织乳腺X光检查设计一种术中肿瘤边缘评估方案。共评价了30例病例,并将其与有经验的医生手动确定的轮廓进行比较。所提出的方法利用图像阈值提取感兴趣区域,然后进行深度学习模型,即SegNet,对肿瘤组织进行分割。评估正常组织周围的边缘宽度作为结果。肿瘤周围希望的切缘大小设置为10毫米。所有情况下,SegNet架构都被用于获取组织样本边界和肿瘤轮廓。仿真结果表明,这种技术在术中可以帮助鉴别阳性切缘和阴性切缘。所提出方案的潜在程序性在于术中测量系统中。实验结果表明,深度学习技术可以获得与病理报告结果一致的结果。
https://arxiv.org/abs/2404.10600
Our goal is to build embodied agents that can learn inductively generalizable spatial concepts in a continual manner, e.g, constructing a tower of a given height. Existing work suffers from certain limitations (a) (Liang et al., 2023) and their multi-modal extensions, rely heavily on prior knowledge and are not grounded in the demonstrations (b) (Liu et al., 2023) lack the ability to generalize due to their purely neural approach. A key challenge is to achieve a fine balance between symbolic representations which have the capability to generalize, and neural representations that are physically grounded. In response, we propose a neuro-symbolic approach by expressing inductive concepts as symbolic compositions over grounded neural concepts. Our key insight is to decompose the concept learning problem into the following steps 1) Sketch: Getting a programmatic representation for the given instruction 2) Plan: Perform Model-Based RL over the sequence of grounded neural action concepts to learn a grounded plan 3) Generalize: Abstract out a generic (lifted) Python program to facilitate generalizability. Continual learning is achieved by interspersing learning of grounded neural concepts with higher level symbolic constructs. Our experiments demonstrate that our approach significantly outperforms existing baselines in terms of its ability to learn novel concepts and generalize inductively.
我们的目标是构建能够以连续的方式学习归纳通用空间概念的 embodied 代理,例如,构建一个指定高度的塔。现有工作存在某些局限性:(a)(Liang 等人,2023) 和他们的多模态扩展,严重依赖先验知识,并且缺乏演示(b)(Liu 等人,2023) 的物理基础,无法进行泛化。一个关键挑战是实现符号表示与具有泛化能力的神经表示之间的良好平衡。为了应对这一挑战,我们提出了一个基于神经符号的方法,通过将归纳性概念表达为 grounded 神经概念的符号组合来表示。我们的关键见解是将概念学习问题分解为以下步骤:1) 草图:为给定指令获得程序化表示;2)规划:在 grounded 神经动作概念的序列上执行基于模型的强化学习,以学习一个 grounded 计划;3)泛化:抽象出通用的(lifted)Python 程序,以促进泛化。持续学习是通过在 grounded 神经概念的学习中插入更高层次的符号构建来实现的。我们的实验证明,与其他现有基线相比,我们的方法在学习和泛化方面的表现显著出色。
https://arxiv.org/abs/2404.07774
Human visual imagination usually begins with analogies or rough sketches. For example, given an image with a girl playing guitar before a building, one may analogously imagine how it seems like if Iron Man playing guitar before Pyramid in Egypt. Nonetheless, visual condition may not be precisely aligned with the imaginary result indicated by text prompt, and existing layout-controllable text-to-image (T2I) generation models is prone to producing degraded generated results with obvious artifacts. To address this issue, we present a novel T2I generation method dubbed SmartControl, which is designed to modify the rough visual conditions for adapting to text prompt. The key idea of our SmartControl is to relax the visual condition on the areas that are conflicted with text prompts. In specific, a Control Scale Predictor (CSP) is designed to identify the conflict regions and predict the local control scales, while a dataset with text prompts and rough visual conditions is constructed for training CSP. It is worth noting that, even with a limited number (e.g., 1,000~2,000) of training samples, our SmartControl can generalize well to unseen objects. Extensive experiments on four typical visual condition types clearly show the efficacy of our SmartControl against state-of-the-arts. Source code, pre-trained models, and datasets are available at this https URL.
人类视觉想象通常从类比或粗略草图开始。例如,对于一个女孩在建筑物前弹吉他的图像,一个人可以类比地想象如果钢铁侠在埃及金字塔前弹吉他。然而,视觉状态可能不会完全与文本提示的想象结果对齐,并且现有的基于文本提示的图像到图像(T2I)生成模型容易产生明显的伪影。为解决这个问题,我们提出了一个名为SmartControl的新T2I生成方法,它旨在根据文本提示修改粗略视觉状态。我们SmartControl的关键思想是在与文本提示发生冲突的区域上放松视觉状态。具体来说,我们设计了一个控制比例预测器(CSP),用于识别冲突区域并预测局部控制比例,同时为训练CSP构建了一个带有文本提示和粗略视觉条件的数据集。值得注意的是,即使只有有限的训练样本(例如1,000~2,000个),我们的SmartControl也可以很好地适应未见过的物体。在四种典型视觉条件类型的广泛实验中,我们的SmartControl明显优于最先进的水平。源代码、预训练模型和数据集可在此链接处访问。
https://arxiv.org/abs/2404.06451