Effective communication is paramount for the inclusion of deaf individuals in society. However, persistent communication barriers due to limited Sign Language (SL) knowledge hinder their full participation. In this context, Sign Language Recognition (SLR) systems have been developed to improve communication between signing and non-signing individuals. In particular, there is the problem of recognizing isolated signs (Isolated Sign Language Recognition, ISLR) of great relevance in the development of vision-based SL search engines, learning tools, and translation systems. This work proposes an ISLR approach where body, hands, and facial landmarks are extracted throughout time and encoded as 2-D images. These images are processed by a convolutional neural network, which maps the visual-temporal information into a sign label. Experimental results demonstrate that our method surpassed the state-of-the-art in terms of performance metrics on two widely recognized datasets in Brazilian Sign Language (LIBRAS), the primary focus of this study. In addition to being more accurate, our method is more time-efficient and easier to train due to its reliance on a simpler network architecture and solely RGB data as input.
有效的沟通对于确保聋人融入社会至关重要。然而,由于缺乏手语(SL)知识而导致的持续沟通障碍,阻碍了聋人充分参与。在这种情况下,签名语言识别(SLR)系统已经开发出来,以改善手语和 non-signer 之间的沟通。特别是,在视觉导向 SL 搜索引擎、学习工具和翻译系统的发展中,存在识别孤立手语(Isolated Sign Language Recognition,ISLR)的重大问题。本文提出了一种 ISLR方法,通过提取人体的身体、手部、面部特征并编码为二维图像,在整个过程中进行编码。这些图像通过卷积神经网络进行处理,将视觉-时间信息映射到手语标签。实验结果表明,在我们的方法在巴西手语(LIBRAS)两个广泛认可的数据集上的性能指标方面超过了最先进的水平,这是本研究的主要目标。除了更加准确外,由于其依赖较简单的网络架构和对 RGB 数据输入的唯一性,我们的方法还具有更高效和易训练的特点。
https://arxiv.org/abs/2404.19148
The misuse of deepfake technology by malicious actors poses a potential threat to nations, societies, and individuals. However, existing methods for detecting deepfakes primarily focus on uncompressed videos, such as noise characteristics, local textures, or frequency statistics. When applied to compressed videos, these methods experience a decrease in detection performance and are less suitable for real-world scenarios. In this paper, we propose a deepfake video detection method based on 3D spatiotemporal trajectories. Specifically, we utilize a robust 3D model to construct spatiotemporal motion features, integrating feature details from both 2D and 3D frames to mitigate the influence of large head rotation angles or insufficient lighting within frames. Furthermore, we separate facial expressions from head movements and design a sequential analysis method based on phase space motion trajectories to explore the feature differences between genuine and fake faces in deepfake videos. We conduct extensive experiments to validate the performance of our proposed method on several compressed deepfake benchmarks. The robustness of the well-designed features is verified by calculating the consistent distribution of facial landmarks before and after video compression.Our method yields satisfactory results and showcases its potential for practical applications.
恶意actor利用Deepfake技术的不当使用对国家、社会和个人构成潜在威胁。然而,目前用于检测Deepfake的主要方法主要集中在未压缩的视频上,例如噪声特征、局部纹理或频率统计。将这些方法应用于压缩视频时,这些方法在检测性能上会降低,并且不太适合现实场景。在本文中,我们提出了一种基于3D时空轨迹的Deepfake视频检测方法。具体来说,我们利用一个健壮的3D模型构建了时空运动特征,将来自2D和3D帧的特征细节整合在一起,减轻了帧中大型头部旋转角度或不足照明的影响。此外,我们将面部表情与头部运动分离,并设计了一个基于相空间运动轨迹的序列分析方法,以探索Deepfake视频中的真实和假面之间的特征差异。我们对所提出的方法在多个压缩Deepfake基准测试中进行了广泛的实验验证。通过计算压缩前后的面部特征点的一致分布来验证设计的特征的鲁棒性。我们的方法取得了满意的结果,并展示了其在实际应用中的潜力。
https://arxiv.org/abs/2404.18149
Generating face image with specific gaze information has attracted considerable attention. Existing approaches typically input gaze values directly for face generation, which is unnatural and requires annotated gaze datasets for training, thereby limiting its application. In this paper, we present a novel gaze-controllable face generation task. Our approach inputs textual descriptions that describe human gaze and head behavior and generates corresponding face images. Our work first introduces a text-of-gaze dataset containing over 90k text descriptions spanning a dense distribution of gaze and head poses. We further propose a gaze-controllable text-to-face method. Our method contains a sketch-conditioned face diffusion module and a model-based sketch diffusion module. We define a face sketch based on facial landmarks and eye segmentation map. The face diffusion module generates face images from the face sketch, and the sketch diffusion module employs a 3D face model to generate face sketch from text description. Experiments on the FFHQ dataset show the effectiveness of our method. We will release our dataset and code for future research.
生成具有特定目光信息的面部图像引起了相当的关注。现有的方法通常直接输入目光值来进行面部生成,这并不自然,需要大量的注释目光数据进行训练,从而限制了其应用。在本文中,我们提出了一个新颖的具有控制目光的面部生成任务。我们的方法输入文本描述,描述了人类目光和头部的行为,并生成相应的面部图像。我们的工作首先引入了一个包含超过90k个文本描述的文本-目光数据集,涵盖了目光分布的密集范围和头部姿势的多样化。我们进一步提出了一个目光可控制文本-面部方法。我们的方法包含一个基于特征的人脸扩散模块和一个基于模型的特征扩散模块。我们定义了一个基于面部特征和眼部分割图的人脸特征扩散模块。人脸扩散模块从人脸特征中生成人脸图像,而基于模型的特征扩散模块采用3D人脸模型从文本描述中生成人脸特征扩散。在FFHQ数据集上的实验表明,我们的方法的有效性。我们将发布我们的数据集和代码供未来研究使用。
https://arxiv.org/abs/2404.17486
In this paper, we present a novel benchmark for Emotion Recognition using facial landmarks extracted from realistic news videos. Traditional methods relying on RGB images are resource-intensive, whereas our approach with Facial Landmark Emotion Recognition (FLER) offers a simplified yet effective alternative. By leveraging Graph Neural Networks (GNNs) to analyze the geometric and spatial relationships of facial landmarks, our method enhances the understanding and accuracy of emotion recognition. We discuss the advancements and challenges in deep learning techniques for emotion recognition, particularly focusing on Graph Neural Networks (GNNs) and Transformers. Our experimental results demonstrate the viability and potential of our dataset as a benchmark, setting a new direction for future research in emotion recognition technologies. The codes and models are at: this https URL
在本文中,我们提出了一个用于情感识别的基准,该基准是基于从现实新闻视频中提取的人脸特征点。传统依赖RGB图像的方法 resource-intensive,而我们的方法 Facial Landmark Emotion Recognition (FLER) 提供了一种简化的且有效的替代方案。通过利用图神经网络(GNNs)分析人脸特征点的几何和空间关系,我们的方法提高了情感识别的理解和准确性。我们讨论了用于情感识别的深度学习技术的进步和挑战,特别是关注图神经网络(GNNs)和Transformer。我们的实验结果表明,我们的数据集作为基准是可行的,为未来情感识别技术的研究奠定了新的方向。代码和模型在此处:<https:// this URL>
https://arxiv.org/abs/2404.13493
With the advent of social media, fun selfie filters have come into tremendous mainstream use affecting the functioning of facial biometric systems as well as image recognition systems. These filters vary from beautification filters and Augmented Reality (AR)-based filters to filters that modify facial landmarks. Hence, there is a need to assess the impact of such filters on the performance of existing face recognition systems. The limitation associated with existing solutions is that these solutions focus more on the beautification filters. However, the current AR-based filters and filters which distort facial key points are in vogue recently and make the faces highly unrecognizable even to the naked eye. Also, the filters considered are mostly obsolete with limited variations. To mitigate these limitations, we aim to perform a holistic impact analysis of the latest filters and propose an user recognition model with the filtered images. We have utilized a benchmark dataset for baseline images, and applied the latest filters over them to generate a beautified/filtered dataset. Next, we have introduced a model FaceFilterNet for beautified user recognition. In this framework, we also utilize our model to comment on various attributes of the person including age, gender, and ethnicity. In addition, we have also presented a filter-wise impact analysis on face recognition, age estimation, gender, and ethnicity prediction. The proposed method affirms the efficacy of our dataset with an accuracy of 87.25% and an optimal accuracy for facial attribute analysis.
随着社交媒体的出现,有趣的自拍滤镜已经进入了 mainstream 使用,对面部生物特征系统和图像识别系统产生了重大影响。这些滤镜从美颜滤镜和基于增强现实 (AR) 的滤镜到修改面部特征的滤镜。因此,有必要评估这类滤镜对现有面部识别系统性能的影响。现有解决方案的局限性在于,这些解决方案更关注美颜滤镜。然而,当前的 AR 基滤镜和扭曲面部关键点的滤镜最近很流行,使脸部高度难以识别,甚至对裸眼观察者来说也是如此。此外,考虑的滤镜大多是过时的,且变化有限。为了减轻这些限制,我们旨在对最新滤镜进行全面的评估,并提出了带有滤镜的用户识别模型。我们在基准图像上利用了基准数据集,并应用最新滤镜生成一个美颜/滤镜化数据集。接下来,我们引入了 FaceFilterNet 模型进行美颜用户识别。在这个框架下,我们还利用我们的模型来评论人员的各种属性,包括年龄、性别和种族。此外,我们还对面部识别、年龄估计、性别和种族预测进行了滤镜逐个影响分析。所提出的方法证实了我们的数据集的有效性,准确率为 87.25%,面部属性分析的最优准确率。
https://arxiv.org/abs/2404.08277
The domain of computer vision has experienced significant advancements in facial-landmark detection, becoming increasingly essential across various applications such as augmented reality, facial recognition, and emotion analysis. Unlike object detection or semantic segmentation, which focus on identifying objects and outlining boundaries, faciallandmark detection aims to precisely locate and track critical facial features. However, deploying deep learning-based facial-landmark detection models on embedded systems with limited computational resources poses challenges due to the complexity of facial features, especially in dynamic settings. Additionally, ensuring robustness across diverse ethnicities and expressions presents further obstacles. Existing datasets often lack comprehensive representation of facial nuances, particularly within populations like those in Taiwan. This paper introduces a novel approach to address these challenges through the development of a knowledge distillation method. By transferring knowledge from larger models to smaller ones, we aim to create lightweight yet powerful deep learning models tailored specifically for facial-landmark detection tasks. Our goal is to design models capable of accurately locating facial landmarks under varying conditions, including diverse expressions, orientations, and lighting environments. The ultimate objective is to achieve high accuracy and real-time performance suitable for deployment on embedded systems. This method was successfully implemented and achieved a top 6th place finish out of 165 participants in the IEEE ICME 2024 PAIR competition.
计算机视觉领域在面部特征检测方面取得了显著的进步,越来越成为各种应用(如增强现实、面部识别和情感分析)中不可或缺的技术。与物体检测或语义分割不同,面部特征检测的目标是精确地定位和跟踪关键面部特征。然而,将基于深度学习的面部特征检测模型部署在具有有限计算资源的嵌入式系统上,会面临由于面部特征的复杂性而导致的挑战。此外,确保在不同种族和表情上具有稳健性也带来了进一步的障碍。现有的数据集通常缺乏对面部细节的全面代表,特别是在像台湾这样的种族中。本文通过开发一种知识蒸馏方法来解决这些挑战。通过将知识从大模型传递到小模型,我们旨在创建专为面部特征检测任务设计的轻量级但强大的深度学习模型。我们的目标是设计出在各种条件下准确定位面部特征的模型,包括不同的表情、方向和光线环境。最终目标是实现适合嵌入式系统的较高准确性和实时性能。该方法在IEEE ICME 2024 PAIR比赛中取得了第六名的成绩。
https://arxiv.org/abs/2404.06029
Automated data labeling techniques are crucial for accelerating the development of deep learning models, particularly in complex medical imaging applications. However, ensuring accuracy and efficiency remains challenging. This paper presents iterative refinement strategies for automated data labeling in facial landmark diagnosis to enhance accuracy and efficiency for deep learning models in medical applications, including dermatology, plastic surgery, and ophthalmology. Leveraging feedback mechanisms and advanced algorithms, our approach iteratively refines initial labels, reducing reliance on manual intervention while improving label quality. Through empirical evaluation and case studies, we demonstrate the effectiveness of our proposed strategies in deep learning tasks across medical imaging domains. Our results highlight the importance of iterative refinement in automated data labeling to enhance the capabilities of deep learning systems in medical imaging applications.
自动数据标注技术对于加速深度学习模型的开发,尤其是在复杂的医学成像应用中,至关重要。然而,确保准确性和效率仍然具有挑战性。本文介绍了一种用于面部 landmark 诊断的自动数据标注的迭代改进策略,以提高医学应用中深度学习模型的准确性和效率,包括皮肤科、整形外科和眼科。通过利用反馈机制和先进算法,我们的方法迭代地优化初始标签,减少对手动干预的依赖,同时提高标签质量。通过实验评估和案例研究,我们证明了我们在医学成像领域中的深度学习任务的实际效果。我们的结果强调了在自动数据标注中进行迭代改进对于增强深度学习系统在医学成像应用中的能力的重要性。
https://arxiv.org/abs/2404.05348
The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances conditional detail by encoding text inputs with precise annotation information. This method directly tackles the issue of ambiguous control inputs by providing clear, annotated guidance to the model. Secondly, to overcome the issue of limited conditional supervision, we introduce Diffusion Consistency Loss (DCL), which applies supervision on the denoised latent code at any given time step. This encourages consistency between the latent code at each time step and the input signal, thereby enhancing the robustness and accuracy of the output. The combination of SGI and DCL results in our Effective Controllable Network (ECNet), which offers a more accurate controllable end-to-end text-to-image generation framework with a more precise conditioning input and stronger controllable supervision. We validate our approach through extensive experiments on generation under various conditions, such as human body skeletons, facial landmarks, and sketches of general objects. The results consistently demonstrate that our method significantly enhances the controllability and robustness of the generated images, outperforming existing state-of-the-art controllable text-to-image models.
近年来,条件文本到图像扩散模型已经引起了广泛的关注。然而,这些模型的精度往往因为两个主要原因而妥协,即模糊的条件输入和单向去噪损失中的条件指导不足。为了应对这些挑战,我们提出了两种创新解决方案。首先,我们提出了一种空间引导器(SGI),通过为文本输入提供精确的注释信息来增强条件细节。这种方法直接解决了模糊控制输入的问题,并为模型提供了明确的、注释的指导。其次,为了克服有限的条件监督问题,我们引入了扩散一致损失(DCL),对任意给定时间点的去噪潜在码进行监督。这鼓励了每个时间步的潜在码与输入信号之间的一致性,从而提高了输出的稳健性和准确性。SGI和DCL的结合我们产生了有效的可控制网络(ECNet),它具有更精确的控制输入和更强的可控制监督,从而实现了更准确、更可靠的文本到图像生成。我们通过在各种条件下进行广泛的实验来验证我们的方法。实验结果一致表明,我们的方法显著增强了生成图像的可控性和稳健性,超过了现有最先进的可控制文本到图像模型。
https://arxiv.org/abs/2403.18417
In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment. We release code and model weights at this https URL
在这项研究中,我们提出了AniPortrait,一种基于音频和参考肖像图像生成高质量动画的新框架。我们的方法分为两个阶段。首先,我们从音频中提取3D中间表示,并将其投影为一系列2D面部关键点。随后,我们采用了一个鲁棒的扩散模型,与运动模块相结合,将关键点序列转换为逼真且时间一致的肖像动画。实验结果表明,AniPortrait在面部自然性、姿态多样性和视觉质量方面具有优越性,从而提供了更加逼真的感知体验。此外,我们的方法在灵活性和可控制性方面具有很大潜力,可以有效地应用于诸如面部运动编辑或面部复原等领域。代码和模型权重现在可以从该链接下载:
https://arxiv.org/abs/2403.17694
Engagement in virtual learning is crucial for a variety of factors including learner satisfaction, performance, and compliance with learning programs, but measuring it is a challenging task. There is therefore considerable interest in utilizing artificial intelligence and affective computing to measure engagement in natural settings as well as on a large scale. This paper introduces a novel, privacy-preserving method for engagement measurement from videos. It uses facial landmarks, which carry no personally identifiable information, extracted from videos via the MediaPipe deep learning solution. The extracted facial landmarks are fed to a Spatial-Temporal Graph Convolutional Network (ST-GCN) to output the engagement level of the learner in the video. To integrate the ordinal nature of the engagement variable into the training process, ST-GCNs undergo training in a novel ordinal learning framework based on transfer learning. Experimental results on two video student engagement measurement datasets show the superiority of the proposed method compared to previous methods with improved state-of-the-art on the EngageNet dataset with a %3.1 improvement in four-class engagement level classification accuracy and on the Online Student Engagement dataset with a %1.5 improvement in binary engagement classification accuracy. The relatively lightweight ST-GCN and its integration with the real-time MediaPipe deep learning solution make the proposed approach capable of being deployed on virtual learning platforms and measuring engagement in real time.
在虚拟学习中进行参与度对于各种因素(包括学习者的满意度、表现和遵守学习计划)至关重要,但测量它是一个具有挑战性的任务。因此,人们对利用人工智能和情感计算从自然环境中测量参与度以及在大规模上测量参与度的兴趣很大。本文介绍了一种新的、隐私保护的从视频测量参与度的方法。它使用通过MediaPipe深度学习解决方案提取的视频中的面部特征点。提取的面部特征点通过Spatial-Temporal Graph Convolutional Network (ST-GCN)输出视频中学习者的参与度。为了将参与变量的顺序特性融入培训过程,ST-GCNs基于迁移学习在一种新颖的有序学习框架中进行训练。在两个视频学生参与度测量数据集上的实验结果表明,与之前的方法相比,所提出的方法在EngageNet数据集上提高了3.1%的四类参与度分类准确率,在在线学生参与度数据集上提高了1.5%的二分类参与度分类准确率。相对较轻的ST-GCN和其与实时MediaPipe深度学习解决方案的集成使得所提出的方法能够部署到虚拟学习平台上,实时测量参与度。
https://arxiv.org/abs/2403.17175
We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.
我们提出了X-Portrait,一种针对生成具有表现力和时间一致性的肖像动画的创新条件扩散模型。具体来说,给定一个单张肖像作为 appearance 参考,我们旨在通过来自驱动视频的运动来动画它,捕捉高动态度和微妙面部表情,并实现广泛的头部运动。其核心在于,我们利用预训练扩散模型的生成先验作为渲染骨架,同时通过 ControlNet 中的新控制信号实现细粒度头部姿势和表情控制。与传统的粗显控制方法(如面部特征)相比,我们的运动控制模块是在原始驱动 RGB 输入的框架内学习的,可以直接从原始驱动信号中解释动态。通过基于补丁的控制模块,可以进一步增强对小规模微妙的运动关注,比如眼睛位置。值得注意的是,为了减轻来自驱动信号的身份泄漏,我们通过缩放增强交叉熵图像来训练我们的运动控制模块,确保从表现参考模块的最大分离。实验结果表明,X-Portrait 在各种面部肖像和表现驱动序列中具有普遍的有效性,并展示了其在生成具有保持一致身份特性的引人入胜肖像动画方面的卓越能力。
https://arxiv.org/abs/2403.15931
In this paper, we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE, which leverages key points (e.g.~facial landmarks) to make ViT more resilient to scale, translation, and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE, however, can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints, the model can more effectively retain spatial relationships, even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images, particularly where alignment is prone to failure. Code and pre-trained models are available.
在本文中,我们解决了使ViT模型对未见到的平移变换更加鲁棒的问题。这种鲁棒性在各种识别任务中变得有用,例如在面部识别中,当图像对齐失败时。我们提出了一种名为KP-RPE的新方法,它利用关键点(例如~面部关键点)使ViT更加弹性,对平移和姿态变化具有鲁棒性。我们首先观察到,相对位置编码(RPE)是将平移变换推广到ViT的好的方法。然而,RPE只能将模型注入到附近像素比远距离像素更重要这一先验知识。关键点RPE(KP-RPE)是这一原则的扩展,其中像素的重要性不仅由其邻近性决定,还由其与图像中特定关键点之间的相对位置决定。通过将像素的重要性锚定在关键点上,模型可以在平移变换破坏时更有效地保留空间关系。我们在面部和步态识别中展示了KP-RPE的优点。实验结果表明,从低质量图像中提高面部识别性能的有效方法,特别是在对齐容易失败的地方。代码和预训练模型可获得。
https://arxiv.org/abs/2403.14852
We propose Semantic Facial Feature Control (SeFFeC) - a novel method for fine-grained face shape editing. Our method enables the manipulation of human-understandable, semantic face features, such as nose length or mouth width, which are defined by different groups of facial landmarks. In contrast to existing methods, the use of facial landmarks enables precise measurement of the facial features, which then enables training SeFFeC without any manually annotated labels. SeFFeC consists of a transformer-based encoder network that takes a latent vector of a pre-trained generative model and a facial feature embedding as input, and learns to modify the latent vector to perform the desired face edit operation. To ensure that the desired feature measurement is changed towards the target value without altering uncorrelated features, we introduced a novel semantic face feature loss. Qualitative and quantitative results show that SeFFeC enables precise and fine-grained control of 23 facial features, some of which could not previously be controlled by other methods, without requiring manual annotations. Unlike existing methods, SeFFeC also provides deterministic control over the exact values of the facial features and more localised and disentangled face edits.
我们提出了 Semantic Facial Feature Control (SeFFeC) - 一种用于细粒度面部形状编辑的新方法。我们的方法允许用户操纵可理解、语义的面部特征,如鼻子长度或嘴巴宽度,这些特征由不同的面部标志组定义。与现有方法相比,使用面部标志进行操作使得可以精确测量面部特征,从而在没有任何手动注释标签的情况下训练 SeFFeC。SeFFeC 由一个基于 Transformer 的编码器网络组成,该网络接受预训练生成模型的潜在向量和一个面部特征嵌入作为输入,并学会修改潜在向量以执行所需的面部编辑操作。为了确保所需的特征测量值朝目标值改变而不会改变无关特征,我们引入了一种新的语义面部特征损失。定性和定量的结果表明,SeFFeC 能够精确控制 23 个面部特征,其中一些特征以前不能由其他方法控制,而无需手动注释。与现有方法不同,SeFFeC 还提供了对面部特征确切值的确定性和更局部和分离的面部编辑的控制。
https://arxiv.org/abs/2403.13972
With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and models will be made publicly available at this https URL.
在对各种面部分析任务进行全面的调查和研究后,越来越多的研究者对发展统一的面部感知方法产生了浓厚的兴趣。现有的方法主要讨论了统一的表示和训练,缺乏任务的扩展性和应用效率。为解决这个问题,我们关注统一的模型结构,研究了一个面部通用模型。作为一种直观的设计,Naive Faceptor使具有相同输出形状和粒度的任务可以共享标准输出头的结构设计,从而实现提高任务扩展性的目标。此外,Faceptor还提出了一个设计良好的单编码器双解码器架构,允许任务特定的查询表示新兴的语义。这种设计在提高模型结构统一的同时,提高了存储开销的应用效率。此外,我们还引入了层注意力机制到Faceptor中,使模型能够动态选择最优层中的特征来执行所需任务。通过在13个面部感知数据集上进行联合训练,Faceptor在面部关键点定位、面部解析、年龄估计、表情识别、二进制属性分类和面部识别等方面取得了惊人的性能,超越了大多数专用方法。我们的训练框架也可以应用于辅助监督学习,在数据稀疏任务(如年龄估计和表情识别)中显著提高性能。代码和模型将在这个https:// URL上公开发布。
https://arxiv.org/abs/2403.09500
In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
在这项工作中,我们专注于学习可以在训练有效面部识别模型时进行自适应的面部表示。特别是在没有标签的情况下。首先,与现有的带有标签的人脸数据集相比,现实世界中存在大量未标记的脸。我们通过自监督预训练来探索这些未标记面部图像的学习策略,以转移通用的面部识别性能。此外,受到最近的一个发现的影响,即脸部显著区域对于面部识别至关重要,我们使用通过提取面部特征点来定位的补丁作为自监督学习中的标签。这使得我们的方法 - 基于LAndmark的特征点自监督学习LAFS) - 可以学习更具关键性的面部表示。我们还引入了两种特定特征点的自监督增强,以进一步规范化学习。通过学习基于特征点的面部表示,我们进一步通过正则化来缓解特征点位置的变化。我们的方法在多个面部识别基准测试中都实现了显著的改进,尤其是在更具挑战性的几拍场景中。
https://arxiv.org/abs/2403.08161
Multimodal deep learning methods capture synergistic features from multiple modalities and have the potential to improve accuracy for stress detection compared to unimodal methods. However, this accuracy gain typically comes from high computational cost due to the high-dimensional feature spaces, especially for intermediate fusion. Dimensionality reduction is one way to optimize multimodal learning by simplifying data and making the features more amenable to processing and analysis, thereby reducing computational complexity. This paper introduces an intermediate multimodal fusion network with manifold learning-based dimensionality reduction. The multimodal network generates independent representations from biometric signals and facial landmarks through 1D-CNN and 2D-CNN. Finally, these features are fused and fed to another 1D-CNN layer, followed by a fully connected dense layer. We compared various dimensionality reduction techniques for different variations of unimodal and multimodal networks. We observe that the intermediate-level fusion with the Multi-Dimensional Scaling (MDS) manifold method showed promising results with an accuracy of 96.00\% in a Leave-One-Subject-Out Cross-Validation (LOSO-CV) paradigm over other dimensional reduction methods. MDS had the highest computational cost among manifold learning methods. However, while outperforming other networks, it managed to reduce the computational cost of the proposed networks by 25\% when compared to six well-known conventional feature selection methods used in the preprocessing step.
多模态深度学习方法从多个模态中捕获协同特征,具有改善与单模态方法的准确性相比的压力检测的精度的潜力。然而,这种准确性提升通常来自高维特征空间的计算成本,特别是在中间融合阶段。降维是优化多模态学习的一种方式,通过简化数据并使特征更具处理和分析的灵活性,从而降低计算复杂性。本文介绍了一种基于多维学习基于降维的中间多模态融合网络。多模态网络通过1D-CNN和2D-CNN从生物特征信号和面部关键点生成独立表示。最后,这些特征被融合并输入到另一个1D-CNN层,接着是全连接密集层。我们比较了不同单模态和多模态网络的降维技术。我们观察到,在Leave-One-Subject-Out Cross-Validation(LOSO-CV)范式中,中间级融合与Multi-Dimensional Scaling(MDS)分形方法显示出有希望的结果,其准确率为96.00%。MDS在分形学习方法中具有最高的计算成本。然而,与其他网络相比,它通过与预处理步骤中使用的六个已知传统特征选择方法相比较,将所提出网络的计算成本降低了25%。
https://arxiv.org/abs/2403.08077
In this paper, we consider a novel and practical case for talking face video generation. Specifically, we focus on the scenarios involving multi-people interactions, where the talking context, such as audience or surroundings, is present. In these situations, the video generation should take the context into consideration in order to generate video content naturally aligned with driving audios and spatially coherent to the context. To achieve this, we provide a two-stage and cross-modal controllable video generation pipeline, taking facial landmarks as an explicit and compact control signal to bridge the driving audio, talking context and generated videos. Inside this pipeline, we devise a 3D video diffusion model, allowing for efficient contort of both spatial conditions (landmarks and context video), as well as audio condition for temporally coherent generation. The experimental results verify the advantage of the proposed method over other baselines in terms of audio-video synchronization, video fidelity and frame consistency.
在本文中,我们考虑了一种新的且实用的谈话面部视频生成案例。具体来说,我们关注涉及多人群互动的情况,其中谈话上下文(如观众或环境)是存在的。在这些情况下,视频生成应考虑上下文以生成与驾驶音频自然同步的视频内容,并使其在空间上与上下文一致。为了实现这一目标,我们提供了两个阶段的跨模态可控制视频生成管道,利用面部关键点作为显式且紧凑的控制信号来桥接驾驶音频、谈话上下文和生成视频。在这个管道中,我们设计了一个3D视频扩散模型,允许对空间条件(关键点和上下文视频)进行有效的弯曲,以及对音频条件的时域一致生成。实验结果证实了与基线相比,所提出方法在音频-视频同步性、视频质量和帧一致性方面的优势。
https://arxiv.org/abs/2402.18092
In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles. To address these issues, we propose EMO, a novel framework that utilizes a direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks. Our method ensures seamless frame transitions and consistent identity preservation throughout the video, resulting in highly expressive and lifelike animations. Experimental results demonsrate that EMO is able to produce not only convincing speaking videos but also singing videos in various styles, significantly outperforming existing state-of-the-art methodologies in terms of expressiveness and realism.
在这项工作中,我们通过关注音频线索和面部动作之间动态而微妙的關係,来应对在谈话视频生成中提高真实感和表现力的挑战。我们指出了传统技术通常无法捕捉到人类表情的完整范围,以及个体面部风格的独特性。为解决这些问题,我们提出了EMO,一种新框架,利用直接音频到视频合成方法,无需中间3D模型或面部关键点。我们的方法确保了视频中的平滑过渡和身份保持一致,从而产生了高度表现力和逼真的动画。实验结果表明,EMO能够产生不仅是令人信服的讲话视频,还有各种风格的音乐视频,在表现力和真实感方面显著超过了现有最先进的方法。
https://arxiv.org/abs/2402.17485
Video conferencing has caught much more attention recently. High fidelity and low bandwidth are two major objectives of video compression for video conferencing applications. Most pioneering methods rely on classic video compression codec without high-level feature embedding and thus can not reach the extremely low bandwidth. Recent works instead employ model-based neural compression to acquire ultra-low bitrates using sparse representations of each frame such as facial landmark information, while these approaches can not maintain high fidelity due to 2D image-based warping. In this paper, we propose a novel low bandwidth neural compression approach for high-fidelity portrait video conferencing using implicit radiance fields to achieve both major objectives. We leverage dynamic neural radiance fields to reconstruct high-fidelity talking head with expression features, which are represented as frame substitution for transmission. The overall system employs deep model to encode expression features at the sender and reconstruct portrait at the receiver with volume rendering as decoder for ultra-low bandwidth. In particular, with the characteristic of neural radiance fields based model, our compression approach is resolution-agnostic, which means that the low bandwidth achieved by our approach is independent of video resolution, while maintaining fidelity for higher resolution reconstruction. Experimental results demonstrate that our novel framework can (1) construct ultra-low bandwidth video conferencing, (2) maintain high fidelity portrait and (3) have better performance on high-resolution video compression than previous works.
视频会议最近引起了更多关注。高清晰度和低带宽是视频压缩为视频会议应用的主要目标。大多数先驱方法依赖于经典视频压缩编码器,没有高层次的特征嵌入,因此无法达到极低的带宽。最近的工作则采用基于模型的神经压缩方法,利用每个帧的稀疏表示(如面部关键点信息)来获得超低带宽,而這些方法无法保持高清晰度,因为它们基于二维图像的变形。在本文中,我们提出了一个用于高清晰度人像视频会议的新型低带宽神经压缩方法,利用隐式辐射场实现这两个主要目标。我们利用动态神经辐射场重构高清晰度谈话头,表达特征用帧置换表示。整个系统采用深度模型对发送方的表达特征进行编码,使用体积渲染作为解码器来重构接收端的肖像,以实现超低带宽。特别地,基于神经辐射场模型的特点,我们的压缩方法对分辨率无关,这意味着我们方法的低带宽与视频分辨率无关,同时保持高清晰度的重建。实验结果表明,我们的新框架可以(1)构建超低带宽的视频会议,(2)保持高清晰度的人像,(3)在视频压缩方面的性能比以前的工作更好。
https://arxiv.org/abs/2402.16599
In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Audio-visual speech enhancement (AVSE) attempts to restore speech quality, but existing methods often fall short, particularly in dynamic noise conditions. This study investigates the inclusion of emotion as a novel contextual cue within AVSE, hypothesizing that incorporating emotional understanding can improve speech enhancement performance. We propose a novel emotion-aware AVSE system that leverages both auditory and visual information. It extracts emotional features from the facial landmarks of the speaker and fuses them with corresponding audio and visual modalities. This enriched data serves as input to a deep UNet-based encoder-decoder network, specifically designed to orchestrate the fusion of multimodal information enhanced with emotion. The network iteratively refines the enhanced speech representation through an encoder-decoder architecture, guided by perceptually-inspired loss functions for joint learning and optimization. We train and evaluate the model on the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) dataset, a rich repository of audio-visual recordings with annotated emotions. Our comprehensive evaluation demonstrates the effectiveness of emotion as a contextual cue for AVSE. By integrating emotional features, the proposed system achieves significant improvements in both objective and subjective assessments of speech quality and intelligibility, especially in challenging noise environments. Compared to baseline AVSE and audio-only speech enhancement systems, our approach exhibits a noticeable increase in PESQ and STOI, indicating higher perceptual quality and intelligibility. Large-scale listening tests corroborate these findings, suggesting improved human understanding of enhanced speech.
在现实环境里,背景噪音显著地降低了人类语音的可听度和清晰度。音频-视觉语音增强(AVSE)试图恢复语音质量,但现有方法往往不够,特别是在动态噪音条件下。本研究探讨了将情感作为AVSE中的新颖上下文线索,假设将情感理解纳入其中可以提高 speech enhancement performance。我们提出了一个新颖的基于情感的AVSE系统,该系统利用听觉和视觉信息。它从发言者的面部特征中提取情感特征,并将它们与相应的音频和视觉模块融合。这个丰富的数据作为输入输入到专为情感信息融合而设计的深度UNet-based编码器-解码器网络中,该网络通过感知驱动的损失函数进行迭代优化。我们在CMU多模态情感意见(CMU-MOSEI)数据集上训练并评估该模型,这是一个丰富的音频-视觉录音数据集,带有注释的情感。全面的评估表明,情感作为上下文线索在AVSE中具有有效的效果。通过将情感特征融合到系统中,所提出的系统在客观和主观评估中的语音质量和可听性方面都取得了显著的改进,尤其是在具有挑战性噪音环境中。与基线AVSE和仅音频增强系统相比,我们的方法在感知质量和社会影响力方面明显增加,表明具有更高的可听度和理解性。大规模听力测试证实了这些发现,表明提高了人类对增强语音的理解。
https://arxiv.org/abs/2402.16394