With the benefit of deep learning techniques, recent researches have made significant progress in image compression artifacts reduction. Despite their improved performances, prevailing methods only focus on learning a mapping from the compressed image to the original one but ignore the intrinsic attributes of the given compressed images, which greatly harms the performance of downstream parsing tasks. Different from these methods, we propose to decouple the intrinsic attributes into two complementary features for artifacts reduction,ie, the compression-insensitive features to regularize the high-level semantic representations during training and the compression-sensitive features to be aware of the compression degree. To achieve this, we first employ adversarial training to regularize the compressed and original encoded features for retaining high-level semantics, and we then develop the compression quality-aware feature encoder for compression-sensitive features. Based on these dual complementary features, we propose a Dual Awareness Guidance Network (DAGN) to utilize these awareness features as transformation guidance during the decoding phase. In our proposed DAGN, we develop a cross-feature fusion module to maintain the consistency of compression-insensitive features by fusing compression-insensitive features into the artifacts reduction baseline. Our method achieves an average 2.06 dB PSNR gains on BSD500, outperforming state-of-the-art methods, and only requires 29.7 ms to process one image on BSD500. Besides, the experimental results on LIVE1 and LIU4K also demonstrate the efficiency, effectiveness, and superiority of the proposed method in terms of quantitative metrics, visual quality, and downstream machine vision tasks.
得益于深度学习技术的优势,近年来图像压缩伪影减少的研究取得了显著进展。尽管其性能有所提高,但现有的方法仅关注从压缩图像到原始图像的映射学习,而忽略了给定压缩图像的固有属性,这大大削弱了下游解码任务的性能。与这些方法不同,我们提出了一种将固有属性解耦为两个互补特征的方法,以便在图像压缩伪影减少中实现压缩敏感特征的感知。为了实现这一目标,我们首先使用对抗训练来对压缩和原始编码特征进行规范,保留高级语义表示,然后我们为压缩敏感特征开发了压缩质量感知特征编码器。基于这些互补特征,我们提出了一个双感知指导网络(DAGN)来在解码阶段利用这些感知特征作为变换指导。在我们的DAGN中,我们开发了一个跨特征融合模块,通过将压缩敏感特征与 artifacts reduction 基线融合来保持压缩感知特征的一致性。我们的方法在BSD500上的平均PSNR增益达到2.06 dB,超越了最先进的方法,并且仅在BSD500上处理一张图片就需要29.7毫秒。此外,LIVE1和LIU4K的实验结果也证明了我们在数量指标、视觉质量和下游机器视觉任务方面的方法的有效性和优越性。
https://arxiv.org/abs/2405.09291
The task of generating dance from music is crucial, yet current methods, which mainly produce joint sequences, lead to outputs that lack intuitiveness and complicate data collection due to the necessity for precise joint annotations. We introduce a Dance Any Beat Diffusion model, namely DabFusion, that employs music as a conditional input to directly create dance videos from still images, utilizing conditional image-to-video generation principles. This approach pioneers the use of music as a conditioning factor in image-to-video synthesis. Our method unfolds in two stages: training an auto-encoder to predict latent optical flow between reference and driving frames, eliminating the need for joint annotation, and training a U-Net-based diffusion model to produce these latent optical flows guided by music rhythm encoded by CLAP. Although capable of producing high-quality dance videos, the baseline model struggles with rhythm alignment. We enhance the model by adding beat information, improving synchronization. We introduce a 2D motion-music alignment score (2D-MM Align) for quantitative assessment. Evaluated on the AIST++ dataset, our enhanced model shows marked improvements in 2D-MM Align score and established metrics. Video results can be found on our project page: this https URL.
将音乐生成舞蹈的任务至关重要,然而目前的做法,主要产生关节序列,导致输出缺乏直觉,且数据收集因精确关节注释的必要性而变得复杂。我们引入了基于音乐的舞蹈任何节奏扩散模型,即DabFusion,它使用音乐作为条件输入来直接从静态图像生成舞蹈视频,利用条件图像到视频生成原则。这种方法在图像到视频合成中首创了将音乐作为条件因素的使用。我们的方法分为两个阶段:训练自编码器预测参考帧和驱动帧之间的潜在光流,消除关节注释的需求,并训练基于U-Net的扩散模型根据音乐节奏编码生成这些潜在光流。尽管可以产生高质量的舞蹈视频,但基线模型在节奏对齐方面挣扎。我们通过添加节奏信息并提高同步来增强模型。我们还引入了2D运动音乐对齐评分(2D-MM Align)用于定量评估。在AIST++数据集上评估,我们增强后的模型在2D-MM Align评分和已建立的指标方面明显改进。视频结果可以在我们的项目页面上查看:https:// this URL。
https://arxiv.org/abs/2405.09266
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
控制文本到语音合成(TTS)中情感表达仍然具有挑战性。以前的研究主要集中在在句子级别学习全局音调表示,这与语言音调高度相关。我们的目标是构建一个层级情感分布(ED),有效地捕捉情感在各个粒度级别上的强度变化,包括音素、单词和句子。在TTS训练过程中,从地面真实音频中提取层级ED,并引导预测器建立情感和语言音调之间的联系。在运行时推理,TTS模型生成情感语音,同时为语音成分提供情感定量控制。客观和主观评价证实了所提出的框架在情感预测和控制方面的有效性。
https://arxiv.org/abs/2405.09171
In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. We additionally perform song classification based on the generated tracks. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.
在本文中,我们探讨了使用潜在扩散模型(一种强大的生成模型家族)从脑电图(EEG)录音中重构自然主义音乐的潜力。与简单音乐且音色有限的作品(如MIDI生成的曲目或单旋律作品)相比,这里的重点是复杂音乐,具有多样化的乐器、声音和效果,丰富和谐和音色。本研究代表了一种使用非侵入性EEG数据实现高质量音乐重建的初步尝试,采用端到端训练方法直接在原始数据上进行,无需手动预处理和通道选择。我们将模型 training 放在公共 NMED-T 数据集上,并通过提出基于神经嵌入的指标进行定量评估。此外,我们还基于生成的曲目进行歌曲分类。我们的工作对神经解码和脑-机接口的持续研究做出了贡献,揭示了使用EEG数据进行复杂听觉信息重建的可行性。
https://arxiv.org/abs/2405.09062
Recent advancements in deep learning for 3D models have propelled breakthroughs in generation, detection, and scene understanding. However, the effectiveness of these algorithms hinges on large training datasets. We address the challenge by introducing Efficient 3D Seam Carving (E3SC), a novel 3D model augmentation method based on seam carving, which progressively deforms only part of the input model while ensuring the overall semantics are unchanged. Experiments show that our approach is capable of producing diverse and high-quality augmented 3D shapes across various types and styles of input models, achieving considerable improvements over previous methods. Quantitative evaluations demonstrate that our method effectively enhances the novelty and quality of shapes generated by other subsequent 3D generation algorithms.
近年来,在深度学习领域为3D模型取得突破性的进展,主要体现在生成、检测和场景理解方面的提升。然而,这些算法的有效性依赖于大型训练数据集。为了解决这一挑战,我们引入了Efficient 3D Seam Carving(E3SC),一种基于缝合切割的新3D模型增强方法,在确保整体语义不变的前提下,逐步改变输入模型的部分部分。实验结果表明,我们的方法能够为各种输入模型的多样性和高质量生成3D形状,并在很大程度上超过了以前的方法。定量的评估结果表明,我们的方法有效地增强了后续3D生成算法生成的形状的新奇度和质量。
https://arxiv.org/abs/2405.09050
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
The increasing complexity of Artificial Intelligence models poses challenges to interpretability, particularly in the healthcare sector. This study investigates the impact of deep learning model complexity and Explainable AI (XAI) efficacy, utilizing four ResNet architectures (ResNet-18, 34, 50, 101). Through methodical experimentation on 4,369 lung X-ray images of COVID-19-infected and healthy patients, the research evaluates models' classification performance and the relevance of corresponding XAI explanations with respect to the ground-truth disease masks. Results indicate that the increase in model complexity is associated with a decrease in classification accuracy and AUC-ROC scores (ResNet-18: 98.4%, 0.997; ResNet-101: 95.9%, 0.988). Notably, in eleven out of twelve statistical tests performed, no statistically significant differences occurred between XAI quantitative metrics - Relevance Rank Accuracy and the proposed Positive Attribution Ratio - across trained models. These results suggest that increased model complexity does not consistently lead to higher performance or relevance of explanations for models' decision-making processes.
人工智能模型的复杂度增加对可解释性提出了挑战,特别是在医疗领域。这项研究调查了深度学习模型的复杂度和可解释AI(XAI)的有效性,利用了四个ResNet架构(ResNet-18、34、50和101)。通过在COVID-19感染者和健康患者的大规模肺X光片上进行实验,研究评估了模型的分类表现以及相应XAI解释与真实疾病口罩的关系。研究结果表明,模型复杂度的增加与分类准确性和AUC-ROC分数(ResNet-18:98.4%,0.997;ResNet-101:95.9%,0.988)的降低有关。值得注意的是,在12个统计测试中,XAI定量指标——相关性排名准确性和所提出的积极归因比——在训练模型上没有显著的差异。这些结果表明,增加模型复杂度并不一定导致模型性能或解释性的提高。
https://arxiv.org/abs/2405.08658
Domain generalization aims to develop models that are robust to distribution shifts. Existing methods focus on learning invariance across domains to enhance model robustness, and data augmentation has been widely used to learn invariant predictors, with most methods performing augmentation in the input space. However, augmentation in the input space has limited diversity whereas in the feature space is more versatile and has shown promising results. Nonetheless, feature semantics is seldom considered and existing feature augmentation methods suffer from a limited variety of augmented features. We decompose features into class-generic, class-specific, domain-generic, and domain-specific components. We propose a cross-domain feature augmentation method named XDomainMix that enables us to increase sample diversity while emphasizing the learning of invariant representations to achieve domain generalization. Experiments on widely used benchmark datasets demonstrate that our proposed method is able to achieve state-of-the-art performance. Quantitative analysis indicates that our feature augmentation approach facilitates the learning of effective models that are invariant across different domains.
领域泛化旨在开发对分布变化具有鲁棒性的模型。现有方法关注于在领域之间学习不变性以增强模型的鲁棒性,数据增强已经在很大程度上用于学习不变的特征,大多数方法在输入空间进行增强。然而,在输入空间进行增强的多样性有限,而在特征空间则更加多样化,已经取得了很好的效果。然而,很少考虑特征语义,现有的特征增强方法在增强特征的多样性方面存在局限性。我们将特征分解为类通用、类特有、领域通用和领域特有组件。我们提出了一种名为XDomainMix的跨领域特征增强方法,它能够增加样本多样性,同时强调学习不变的特征以实现领域泛化。在广泛使用的基准数据集上进行的实验证明,我们提出的方法能够实现最先进的性能水平。定量分析表明,我们的特征增强方法有助于学习在不同领域中具有有效性的模型。
https://arxiv.org/abs/2405.08586
This paper addresses the problem of pathological lung segmentation, a significant challenge in medical image analysis, particularly pronounced in cases of peripheral opacities (severe fibrosis and consolidation) because of the textural similarity between lung tissue and surrounding areas. To overcome these challenges, this paper emphasizes the use of CycleGAN for unpaired image-to-image translation, in order to provide an augmentation method able to generate fake pathological images matching an existing ground truth. Although previous studies have employed CycleGAN, they often neglect the challenge of shape deformation, which is crucial for accurate medical image segmentation. Our work introduces an innovative strategy that incorporates additional loss functions. Specifically, it proposes an L1 loss based on the lung surrounding which shape is constrained to remain unchanged at the transition from the healthy to pathological domains. The lung surrounding is derived based on ground truth lung masks available in the healthy domain. Furthermore, preprocessing steps, such as cropping based on ribs/vertebra locations, are applied to refine the input for the CycleGAN, ensuring that the network focus on the lung region. This is essential to avoid extraneous biases, such as the zoom effect bias, which can divert attention from the main task. The method is applied to enhance in semi-supervised manner the lung segmentation process by employing a U-Net model trained with on-the-fly data augmentation incorporating synthetic pathological tissues generated by the CycleGAN model. Preliminary results from this research demonstrate significant qualitative and quantitative improvements, setting a new benchmark in the field of pathological lung segmentation. Our code is available at this https URL
本文解决了病理肺部分割的问题,这是医学图像分析的一个显著挑战,尤其是在边缘透明度(严重纤维化和胶质化)病例中更加突出,因为肺组织与周围区域的纹理相似。为了克服这些挑战,本文强调使用CycleGAN进行无配对图像到图像的转换,以提供一种能够生成与现有真实 ground truth 匹配的假病理图像的增强方法。虽然之前的研究已经使用了CycleGAN,但它们通常忽视了形状变形的重要性,这对于准确医学图像分割至关重要。我们的工作引入了一种创新策略,包括额外的损失函数。具体来说,它提出了一个基于肺周围约束的L1损失,该约束在从健康到病理域的转换过程中保持形状不变。肺周围基于健康的领域内存在的真实肺mask为基础进行提取。此外,对输入进行预处理步骤,如基于肋/椎位置的裁剪,以优化CycleGAN,确保网络集中于肺区域。这对于避免诸如放大效果偏见等额外偏差至关重要。将该方法应用于半监督方式增强肺分割过程,通过使用训练时数据增强包含由CycleGAN模型生成的合成病理组织的方法。来自这项研究的结果表明,在半监督方式下,肺分割过程有显著的质量和数量改进,为病理肺部分割领域树立了新的基准。我们的代码可在此处访问:https://www.thisurl.com
https://arxiv.org/abs/2405.08556
This paper presents BARKPLUG V.2, a Large Language Model (LLM)-based chatbot system built using Retrieval Augmented Generation (RAG) pipelines to enhance the user experience and access to information within academic settings.The objective of BARKPLUG V.2 is to provide information to users about various campus resources, including academic departments, programs, campus facilities, and student resources at a university setting in an interactive fashion. Our system leverages university data as an external data corpus and ingests it into our RAG pipelines for domain-specific question-answering tasks. We evaluate the effectiveness of our system in generating accurate and pertinent responses for Mississippi State University, as a case study, using quantitative measures, employing frameworks such as Retrieval Augmented Generation Assessment(RAGAS). Furthermore, we evaluate the usability of this system via subjective satisfaction surveys using the System Usability Scale (SUS). Our system demonstrates impressive quantitative performance, with a mean RAGAS score of 0.96, and experience, as validated by usability assessments.
本文介绍了BARKPLUG V.2,一种基于Retrieval Augmented Generation(RAG)流程的大型语言模型(LLM)聊天机器人系统,旨在提高用户体验和学术环境中的信息访问。BARKPLUG V.2的目的是以交互的方式向用户提供有关各种校园资源的信息,包括学术部门、项目、校园设施和学生资源等。我们的系统利用大学数据作为外部数据语料库,并将其输入到我们的RAG流程中进行特定领域问题回答任务。我们用定量和框架评估我们的系统的有效性,如Retrieval Augmented Generation评估(RAGAS)。此外,我们还通过主观满意度调查评估了该系统的可用性,使用了System Usability Scale(SUS)。我们的系统表现出惊人的定量性能,平均RAGAS得分达到了0.96,这是我们通过可用性评估验证的。
https://arxiv.org/abs/2405.08120
In recent years, diffusion models (DMs) have become a popular method for generating synthetic data. By achieving samples of higher quality, they quickly became superior to generative adversarial networks (GANs) and the current state-of-the-art method in generative modeling. However, their potential has not yet been exploited in radar, where the lack of available training data is a long-standing problem. In this work, a specific type of DMs, namely denoising diffusion probabilistic model (DDPM) is adapted to the SAR domain. We investigate the network choice and specific diffusion parameters for conditional and unconditional SAR image generation. In our experiments, we show that DDPM qualitatively and quantitatively outperforms state-of-the-art GAN-based methods for SAR image generation. Finally, we show that DDPM profits from pretraining on largescale clutter data, generating SAR images of even higher quality.
近年来,扩散模型(DMs)已成为生成合成数据的一种流行方法。通过实现高质量的样本,它们迅速成为生成对抗网络(GANs)和当前生成建模状态的佼佼者。然而,在雷达领域,缺乏可用训练数据是一个长期存在的问题。在这项工作中,我们针对SAR领域 adapt 一种特定类型的DM,即去噪扩散概率模型(DDPM)。我们研究了条件下的和无条件SAR图像生成网络选择和扩散参数。在我们的实验中,我们证明了DDPM在SAR图像生成方面既具有定性又具有定量优势。最后,我们证明了DDPM在大型杂乱数据上的预训练使其产生更高质量的SAR图像。
https://arxiv.org/abs/2405.07776
Expanding a dictionary of pre-selected keywords is crucial for tasks in information retrieval, such as database query and online data collection. Here we propose Local Graph-based Dictionary Expansion (LGDE), a method that uses tools from manifold learning and network science for the data-driven discovery of keywords starting from a seed dictionary. At the heart of LGDE lies the creation of a word similarity graph derived from word embeddings and the application of local community detection based on graph diffusion to discover semantic neighbourhoods of pre-defined seed keywords. The diffusion in the local graph manifold allows the exploration of the complex nonlinear geometry of word embeddings and can capture word similarities based on paths of semantic association. We validate our method on a corpus of hate speech-related posts from Reddit and Gab and show that LGDE enriches the list of keywords and achieves significantly better performance than threshold methods based on direct word similarities. We further demonstrate the potential of our method through a real-world use case from communication science, where LGDE is evaluated quantitatively on data collected and analysed by domain experts by expanding a conspiracy-related dictionary.
扩展预选关键词词典对于信息检索任务(如数据库查询和在线数据收集)至关重要。在这里,我们提出了基于局部图的词典扩展(LGDE)方法,这是一种利用多范式学习和网络科学工具从种子词典中数据驱动发现关键词的方法。LGDE的核心是基于词向量创建词相似图,并基于图扩散应用局部社区检测来发现预定义种子关键词的语义邻域。词向量在局部图多范式中的扩散允许探索词向量的复杂非线性几何,并基于语义关联路径捕获词相似性。我们在Reddit和Gab上收集的关于仇恨言论的帖子作为语料库,并评估我们的方法在基于直接词相似度的阈值方法上的性能。我们还通过一个通信科学领域的实际应用场景,将LGDE在扩展了阴谋相关词汇的数据集上进行定量评估。
https://arxiv.org/abs/2405.07764
Monocular 3D object detection aims for precise 3D localization and identification of objects from a single-view image. Despite its recent progress, it often struggles while handling pervasive object occlusions that tend to complicate and degrade the prediction of object dimensions, depths, and orientations. We design MonoMAE, a monocular 3D detector inspired by Masked Autoencoders that addresses the object occlusion issue by masking and reconstructing objects in the feature space. MonoMAE consists of two novel designs. The first is depth-aware masking that selectively masks certain parts of non-occluded object queries in the feature space for simulating occluded object queries for network training. It masks non-occluded object queries by balancing the masked and preserved query portions adaptively according to the depth information. The second is lightweight query completion that works with the depth-aware masking to learn to reconstruct and complete the masked object queries. With the proposed object occlusion and completion, MonoMAE learns enriched 3D representations that achieve superior monocular 3D detection performance qualitatively and quantitatively for both occluded and non-occluded objects. Additionally, MonoMAE learns generalizable representations that can work well in new domains.
单目3D物体检测旨在实现从单视图图像中精确地定位和识别物体。尽管其最近的进展,它在处理普遍存在的物体遮挡时往往遇到困难,这会复杂化并降低物体尺寸、深度和方向的预测。我们设计MonoMAE,一种以遮罩卷积神经网络为灵感的多目3D检测器,通过在特征空间中遮罩和重构物体来解决物体遮挡问题。MonoMAE由两个新颖的设计组成。第一个是深度感知遮罩,它选择性地在特征空间中遮罩非遮挡物体查询的某些部分,以模拟网络训练中的遮挡物体查询。它通过根据深度信息平衡遮罩和保留查询部分来遮罩非遮挡物体查询。第二个是轻量级查询完成,它与深度感知遮罩协同工作,学习如何重构和完成遮膜物体查询。通过所提出的物体遮挡和完成,MonoMAE获得了丰富的3D表示,在遮挡和非遮挡物体上实现卓越的单目3D检测性能。此外,MonoMAE还学习了可以在新领域中表现良好的泛化表示。
https://arxiv.org/abs/2405.07696
Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.
现代信息查询系统正逐渐集成多种模态输入,如视觉和音频。然而,将目光(与用户意图紧密相关,且通过可穿戴的凝视跟踪设备越来越容易获取)集成到系统中仍然是一个未被探索的领域。本文介绍了一种新颖的凝视辅助信息查询范例,称为G-VOILA,它将用户的凝视、视野和语音为基础的自然语言查询协同起来,以促进更直观的查询过程。在一个涉及21名参与者(每日3个场景,p = 21,场景 = 3)的用户体验研究中,我们揭示了用户查询语言中的歧义以及使用G-VOILA时用户的自然查询行为的凝视-声音协同模式。根据定量和定性研究结果,我们为G-VOILA范式开发了一个设计框架,该框架有效地将凝视数据与实时查询上下文集成。然后,我们使用尖端深度学习技术实现了G-VOILA的演示案例。后续用户研究(p = 16,场景 = 2)证明了它的有效性,与没有凝视数据的基线相比,实现了更高的客观得分和主观得分。我们进一步进行了访谈,并为未来的凝视辅助信息查询系统提供了建议。
https://arxiv.org/abs/2405.07652
Unveiling the real appearance of retouched faces to prevent malicious users from deceptive advertising and economic fraud has been an increasing concern in the era of digital economics. This article makes the first attempt to investigate the face retouching reversal (FRR) problem. We first collect an FRR dataset, named deepFRR, which contains 50,000 StyleGAN-generated high-resolution (1024*1024) facial images and their corresponding retouched ones by a commercial online API. To our best knowledge, deepFRR is the first FRR dataset tailored for training the deep FRR models. Then, we propose a novel diffusion-based FRR approach (FRRffusion) for the FRR task. Our FRRffusion consists of a coarse-to-fine two-stage network: A diffusion-based Facial Morpho-Architectonic Restorer (FMAR) is constructed to generate the basic contours of low-resolution faces in the first stage, while a Transformer-based Hyperrealistic Facial Detail Generator (HFDG) is designed to create high-resolution facial details in the second stage. Tested on deepFRR, our FRRffusion surpasses the GP-UNIT and Stable Diffusion methods by a large margin in four widespread quantitative metrics. Especially, the de-retouched images by our FRRffusion are visually much closer to the raw face images than both the retouched face images and those restored by the GP-UNIT and Stable Diffusion methods in terms of qualitative evaluation with 85 subjects. These results sufficiently validate the efficacy of our work, bridging the recently-standing gap between the FRR and generic image restoration tasks. The dataset and code are available at this https URL.
在数字经济的时期,揭开修饰前后的脸的真实外观以防止恶意用户进行欺骗性广告和经济欺诈是一个越来越关注的问题。本文是首次调查了脸部修饰反向(FRR)问题。我们首先收集了一个名为deepFRR的数据集,其中包含50,000个由StyleGAN生成的具有1024*1024分辨率的高清(1024*1024)面部图像以及它们由商业在线API修整过的相应图像。据我们所知,deepFRR是第一个针对训练深度FRR模型的FRR数据集。然后,我们提出了一个新颖的扩散为基础的FRR方法(FRRffusion)用于FRR任务。我们的FRRffusion包括一个粗到细的两级网络:首先,通过扩散构建面部形态还原器(FMAR),以生成低分辨率面部的基本轮廓;其次,设计了一个Transformer-based超现实面部细节生成器(HFDG),用于在第二个阶段创建高分辨率面部细节。在deepFRR上测试我们的FRRffusion,我们的FRRffusion在四个广泛的定量指标上超过了GP-UNIT和Stable Diffusion方法。特别是,我们通过FRRffusion生成的去修补过的图像在视觉上与原始面部图像非常接近,而在定量评估中,与GP-UNIT和Stable Diffusion方法相比,修复后的图像在85个受试者中的质量评估结果也相差无几。这些结果充分验证了我们的工作,缩小了FRR和通用图像修复任务之间的最近空白。数据集和代码可在此https URL找到。
https://arxiv.org/abs/2405.07582
The quantitative analysis of political ideological positions is a difficult task. In the past, various literature focused on parliamentary voting data of politicians, party manifestos and parliamentary speech to estimate political disagreement and polarization in various political systems. However previous methods of quantitative political analysis suffered from a common challenge which was the amount of data available for analysis. Also previous methods frequently focused on a more general analysis of politics such as overall polarization of the parliament or party-wide political ideological positions. In this paper, we present a method to analyze ideological positions of individual parliamentary representatives by leveraging the latent knowledge of LLMs. The method allows us to evaluate the stance of politicians on an axis of our choice allowing us to flexibly measure the stance of politicians in regards to a topic/controversy of our choice. We achieve this by using a fine-tuned BERT classifier to extract the opinion-based sentences from the speeches of representatives and projecting the average BERT embeddings for each representative on a pair of reference seeds. These reference seeds are either manually chosen representatives known to have opposing views on a particular topic or they are generated sentences which where created using the GPT-4 model of OpenAI. We created the sentences by prompting the GPT-4 model to generate a speech that would come from a politician defending a particular position.
量化政治意识形态分析是一项困难的任务。在过去,各种文献关注于政治家的议会投票数据、党纲和议会演讲,以估计各种政治系统中的政治分歧和极化。然而,先前的量化政治分析方法普遍面临着一个共同的挑战,那就是数据可用性的数量。同时,先前的方法经常更广泛地分析政治,如整体议会的极化或党内的政治意识形态立场。在本文中,我们提出了一种利用LLM的潜在知识来分析个别议会代表意识形态的方法。该方法允许我们在我们选择的轴线上评估政治家的立场,从而我们可以灵活地衡量政治家在某个主题/争议问题上的立场。我们通过使用微调的BERT分类器从代表者的讲话中提取基于意见的句子,并将每个代表的平均BERT嵌入投影到两个参考种子对上。这些参考种子可以是手动选择的已知对某个主题持有不同观点的代表,也可以是使用OpenAI的GPT-4模型生成的句子。我们通过向GPT-4模型发出请求,生成一个政治家为某个立场辩护的演讲。
https://arxiv.org/abs/2405.07320
Remote sensing images captured by different platforms exhibit significant disparities in spatial resolution. Large scale factor super-resolution (SR) algorithms are vital for maximizing the utilization of low-resolution (LR) satellite data captured from orbit. However, existing methods confront challenges in recovering SR images with clear textures and correct ground objects. We introduce a novel framework, the Semantic Guided Diffusion Model (SGDM), designed for large scale factor remote sensing image super-resolution. The framework exploits a pre-trained generative model as a prior to generate perceptually plausible SR images. We further enhance the reconstruction by incorporating vector maps, which carry structural and semantic cues. Moreover, pixel-level inconsistencies in paired remote sensing images, stemming from sensor-specific imaging characteristics, may hinder the convergence of the model and diversity in generated results. To address this problem, we propose to extract the sensor-specific imaging characteristics and model the distribution of them, allowing diverse SR images generation based on imaging characteristics provided by reference images or sampled from the imaging characteristic probability distributions. To validate and evaluate our approach, we create the Cross-Modal Super-Resolution Dataset (CMSRD). Qualitative and quantitative experiments on CMSRD showcase the superiority and broad applicability of our method. Experimental results on downstream vision tasks also demonstrate the utilitarian of the generated SR images. The dataset and code will be publicly available at this https URL
不同平台捕获的远程感知图像在空间分辨率上存在显著的差异。用于最大化利用轨道捕获的低分辨率(LR)卫星数据的远程感知(SR)算法至关重要。然而,现有的方法在恢复具有清晰纹理和正确地面对象的SR图像方面面临挑战。我们引入了一个名为 Semantic Guided Diffusion Model(SGDM)的新框架,用于大型比例因子远程感知图像超分辨率。该框架利用预训练的生成模型作为先验,生成具有感知上合理的SR图像。我们进一步通过引入向量图来增强重建,向量图携带结构和语义线索。此外,来自传感器特定的成像特性导致的对配对遥感的像素级别不一致性,可能导致模型收敛困难和生成结果的多样性。为了解决这个问题,我们提出了一种提取传感器特定成像特性和建模它们的方法,以便根据参考图像或从成像特性概率分布中采样生成多样性的SR图像。为了验证和评估我们的方法,我们创建了 Cross-Modal Super-Resolution Dataset(CMSRD)。CMSRD 的质量和数量实验展示了我们方法的优势和广泛的适用性。下游视觉任务的实验结果也证明了生成的SR图像的实用性。数据集和代码将公开发布在以下链接:
https://arxiv.org/abs/2405.07044
Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject alignment thoroughly. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. The code will be released soon.
现有的基于主题的文本-图像生成模型存在让人厌倦的微调步骤,并很难同时保持文本-图像对齐和主题一致性。为了生成组合主题,它经常遇到诸如对象缺失和属性混合等问题,其中输入提示中的某些主题可能没有生成,或者它们的属性被错误地组合在一起。为了应对这些局限性,我们提出了一个主题驱动的生成框架,并在推理时间在生成过程中施加无训练指导。这种方法增强了关注图,使得每个主题都可以精确地绑定属性和进行特征注入。值得注意的是,我们的方法表现出出色的零散生成能力,尤其是在具有挑战性的合成生成任务中。此外,我们提出了一个新的指标GroundingScore来全面评估主题对齐效果。所得到定量结果作为有力的证据展示了我们提出方法的有效性。代码即将发布。
https://arxiv.org/abs/2405.06948
We propose a novel edge-assisted multi-user collaborative augmented reality framework in a large indoor environment. In Collaborative Augmented Reality, data communication that synchronizes virtual objects has large network traffic and high network latency. Due to drift, CAR applications without continuous data communication for coordinate system alignment have virtual object inconsistency. In addition, synchronization messages for online virtual object updates have high latency as the number of collaborative devices increases. To solve this problem, we implement the CAR framework, called eCAR, which utilizes edge computing to continuously match the device's coordinate system with less network traffic. Furthermore, we extend the co-visibility graph of the edge server to maintain virtual object spatial-temporal consistency in neighboring devices by synchronizing a local graph. We evaluate the system quantitatively and qualitatively in the public dataset and a physical indoor environment. eCAR communicates data for coordinate system alignment between the edge server and devices with less network traffic and latency. In addition, collaborative augmented reality synchronization algorithms quickly and accurately host and resolve virtual objects. The proposed system continuously aligns coordinate systems to multiple devices in a large indoor environment and shares augmented reality content. Through our system, users interact with virtual objects and share augmented reality experiences with neighboring users.
我们提出了一个在大型室内环境中新颖的边缘辅助多用户协作增强现实框架。在协作增强现实(CAR)中,数据通信同步虚拟物体具有很大的网络流量和高的网络延迟。由于漂移,没有连续数据通信来对坐标系统进行对齐的CAR应用会出现虚拟物体不一致。此外,随着协作设备的增加,在线虚拟物体更新同步消息具有很高的延迟。为了解决这个问题,我们实现了名为eCAR的CAR框架,它利用边缘计算来持续匹配设备的坐标系统与较低的网络流量。此外,我们通过同步局部图来扩展边缘服务器的共视图图,以维持邻居设备中虚拟物体的空间时间一致性。我们在公共数据集和物理室内环境中对系统进行定量定性评估。eCAR在边缘服务器和设备之间减少了网络流量和延迟,并准确地托管和解决虚拟物体。所提出的系统在大型室内环境中持续对多个设备进行坐标系统对齐,并共享增强现实内容。通过我们的系统,用户与虚拟物体交互,并与邻居用户分享增强现实体验。
https://arxiv.org/abs/2405.06872
Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.
确保AI系统可靠且安全地避免有害或危险行为是一个关键挑战,特别是对于具有高自主性和通用智能的AI系统或用于关键安全领域的系统。在本文中,我们将介绍并定义一个AI安全家族,我们将称之为保证安全(GS)AI。这些方法的核心特征是,它们旨在生产配备有高置信度量化安全保证的AI系统。这是通过三个核心组件之间的相互作用实现的:一个世界模型(它提供了AI系统如何影响外部世界的数学描述),一个安全规格(它提供了什么是可以接受的数学描述),和一个验证者(它提供了一个可审核的证明证书,表明AI相对于世界模型满足了安全规格)。我们勾勒出创建这三个核心组件的许多方法,描述了主要的技术挑战,并提出了一些可能的解决方案。我们还探讨了这种方法对AI安全的必要性,以及与其他主要方法的不足之处。
https://arxiv.org/abs/2405.06624