In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors.
近年来,深度学习极大地简化了生成逼真假脸图像的过程。为了意识到这种伪造技术的危险性,研究人员开发了各种工具来检测这些伪造技术。然而,没有一个工具问到这个问题:什么数字编辑会使得真实照片脸孔图像伪造,而其他工具不会?在本文中,我们将面部伪造置于语义背景下,并定义了能够超过人类判断阈值的语义人脸属性编辑的方法是面部伪造的源头。在受到新定义的指导下,我们构建了一个大规模的人脸伪造图像数据集,其中每个图像都与一个由层次图组织起来的标签集相关联。我们的数据集使得有两个新的测试协议可以探究面部伪造检测器的泛化能力。此外,我们提出了一个语义导向的面部伪造检测方法,它抓住了标签关系并优先考虑了主要任务(即真实或伪造脸孔检测)。我们证明了所提出的数据集成功地揭示了当前检测器的缺陷作为测试集,并且在训练集上持续改进了它们的泛化能力。此外,我们还证明了我们的语义导向方法比传统的二分类和多分类分类基础检测器具有优越性。
https://arxiv.org/abs/2405.08487
In the ever-changing world of technology, continuous authentication and comprehensive access management are essential during user interactions with a device. Split Learning (SL) and Federated Learning (FL) have recently emerged as promising technologies for training a decentralized Machine Learning (ML) model. With the increasing use of smartphones and Internet of Things (IoT) devices, these distributed technologies enable users with limited resources to complete neural network model training with server assistance and collaboratively combine knowledge between different nodes. In this study, we propose combining these technologies to address the continuous authentication challenge while protecting user privacy and limiting device resource usage. However, the model's training is slowed due to SL sequential training and resource differences between IoT devices with different specifications. Therefore, we use a cluster-based approach to group devices with similar capabilities to mitigate the impact of slow devices while filtering out the devices incapable of training the model. In addition, we address the efficiency and robustness of training ML models by using SL and FL techniques to train the clients simultaneously while analyzing the overhead burden of the process. Following clustering, we select the best set of clients to participate in training through a Genetic Algorithm (GA) optimized on a carefully designed list of objectives. The performance of our proposed framework is compared to baseline methods, and the advantages are demonstrated using a real-life UMDAA-02-FD face detection dataset. The results show that CRSFL, our proposed approach, maintains high accuracy and reduces the overhead burden in continuous authentication scenarios while preserving user privacy.
在不断变化的科技世界中,用户与设备之间的连续身份验证和全面访问管理至关重要。最近,分裂学习(SL)和联邦学习(FL)作为一种有前景的分布式机器学习(ML)模型训练技术而出现。随着智能手机和物联网(IoT)设备的广泛使用,这些分布式技术使有限资源的用户能够通过服务器帮助和不同节点之间协作完成神经网络模型训练。然而,由于SL的顺序训练和不同规格的IoT设备的资源差异,模型的训练速度受到了影响。因此,我们使用基于聚类的策略将具有类似能力的设备分组,通过过滤无法训练模型的设备来减轻缓慢设备的影响。此外,通过使用SL和FL技术同时训练客户端,我们解决了训练ML模型的效率和鲁棒性问题。在聚类之后,我们通过基于目标优化设计的精简列表使用遗传算法(GA)选择参与训练的最佳客户端。 与基线方法相比,我们提出的框架的性能进行了比较,并使用实际生活的UDMA-02-FD面部检测数据集证明了其优势。结果表明,CRSFL方法在连续身份验证场景中保持高准确率,同时减轻了身份验证过程中的开销负担,同时保护了用户隐私。
https://arxiv.org/abs/2405.07174
Incorporating human-perceptual intelligence into model training has shown to increase the generalization capability of models in several difficult biometric tasks, such as presentation attack detection (PAD) and detection of synthetic samples. After the initial collection phase, human visual saliency (e.g., eye-tracking data, or handwritten annotations) can be integrated into model training through attention mechanisms, augmented training samples, or through human perception-related components of loss functions. Despite their successes, a vital, but seemingly neglected, aspect of any saliency-based training is the level of salience granularity (e.g., bounding boxes, single saliency maps, or saliency aggregated from multiple subjects) necessary to find a balance between reaping the full benefits of human saliency and the cost of its collection. In this paper, we explore several different levels of salience granularity and demonstrate that increased generalization capabilities of PAD and synthetic face detection can be achieved by using simple yet effective saliency post-processing techniques across several different CNNs.
将人类感知智能融入模型训练已经在多项困难的生物特征任务中增加了模型的泛化能力,例如展示攻击检测(PAD)和合成样本检测。在初始收集阶段,人类视觉突出(例如,眼跟踪数据或手写注释)可以通过关注机制、增强训练样本或通过损失函数中的人感知相关组件进行整合。尽管它们取得了成功,但任何基于突显的训练中一个似乎被忽视的重要方面是所需的突显粒度水平(例如,边界框、单个突显图或来自多个对象的突显聚合)。在本文中,我们探讨了几个不同的突显粒度水平,并证明了通过使用简单而有效的突显后处理技术,可以在多个不同的卷积神经网络中实现PAD和合成样本检测的泛化能力的提高。
https://arxiv.org/abs/2405.00650
The rapid development of diffusion models has triggered diverse applications. Identity-preserving text-to-image generation (ID-T2I) particularly has received significant attention due to its wide range of application scenarios like AI portrait and advertising. While existing ID-T2I methods have demonstrated impressive results, several key challenges remain: (1) It is hard to maintain the identity characteristics of reference portraits accurately, (2) The generated images lack aesthetic appeal especially while enforcing identity retention, and (3) There is a limitation that cannot be compatible with LoRA-based and Adapter-based methods simultaneously. To address these issues, we present \textbf{ID-Aligner}, a general feedback learning framework to enhance ID-T2I performance. To resolve identity features lost, we introduce identity consistency reward fine-tuning to utilize the feedback from face detection and recognition models to improve generated identity preservation. Furthermore, we propose identity aesthetic reward fine-tuning leveraging rewards from human-annotated preference data and automatically constructed feedback on character structure generation to provide aesthetic tuning signals. Thanks to its universal feedback fine-tuning framework, our method can be readily applied to both LoRA and Adapter models, achieving consistent performance gains. Extensive experiments on SD1.5 and SDXL diffusion models validate the effectiveness of our approach. \textbf{Project Page: \url{this https URL}}
扩散模型的快速发展引发了各种应用。身份保留的图像转文本生成(ID-T2I)特别因其在AI肖像和广告等广泛应用场景而受到关注。虽然现有的ID-T2I方法已经取得了令人印象深刻的成果,但仍然存在几个关键挑战:(1)准确保持参考肖像的身份特征很困难,(2)在保留身份特征的同时,生成的图像缺乏美学吸引力,(3)与基于LoRA和自适应方法的兼容性有限。为解决这些问题,我们提出了\textbf{ID-Aligner},一种通用的反馈学习框架,以提高ID-T2I的性能。为了恢复丢失的身份特征,我们引入了身份一致性奖励微调,利用面部检测和识别模型的反馈来提高生成的身份保留。此外,我们还提出了基于人类标注偏好数据和自动构建的反馈来提供美学调整信号的身份美学奖励微调。由于其通用反馈微调框架,我们的方法可以轻松应用于LoRA和自适应模型,实现一致的性能提升。在SD1.5和SDXL扩散模型上进行的大量实验证实了该方法的有效性。\textbf{项目页面:\url{this https URL}}
https://arxiv.org/abs/2404.15449
This paper introduces a new dispersed Haar-like filter for efficiently detection face. The basic idea for finding the filter is maximising between-class and minimising within-class variance. The proposed filters can be considered as an optimal configuration dispersed Haar-like filters; filters with disjoint black and white parts.
本文提出了一种新的分布式哈勃类似滤波器,用于有效地检测人脸。找到滤波器的基本思路是最大化跨类别方差,最小化内部方差。所提出的滤波器可以被视为最优配置的分布式哈勃类似滤波器;具有分离的黑和白部分的滤波器。
https://arxiv.org/abs/2404.10476
In this paper, we propose a physics-inspired contrastive learning paradigm for low-light enhancement, called PIE. PIE primarily addresses three issues: (i) To resolve the problem of existing learning-based methods often training a LLE model with strict pixel-correspondence image pairs, we eliminate the need for pixel-correspondence paired training data and instead train with unpaired images. (ii) To address the disregard for negative samples and the inadequacy of their generation in existing methods, we incorporate physics-inspired contrastive learning for LLE and design the Bag of Curves (BoC) method to generate more reasonable negative samples that closely adhere to the underlying physical imaging principle. (iii) To overcome the reliance on semantic ground truths in existing methods, we propose an unsupervised regional segmentation module, ensuring regional brightness consistency while eliminating the dependency on semantic ground truths. Overall, the proposed PIE can effectively learn from unpaired positive/negative samples and smoothly realize non-semantic regional enhancement, which is clearly different from existing LLE efforts. Besides the novel architecture of PIE, we explore the gain of PIE on downstream tasks such as semantic segmentation and face detection. Training on readily available open data and extensive experiments demonstrate that our method surpasses the state-of-the-art LLE models over six independent cross-scenes datasets. PIE runs fast with reasonable GFLOPs in test time, making it easy to use on mobile devices.
在本文中,我们提出了一个基于物理学习的对比学习范式,称为PIE。PIE主要解决了以下三个问题:(一)为了解决现有学习方法通常在严格像素对应图像对上训练LLE模型的問題,我们消除了需要像素对应的一对训练数据,而是使用未配对的图像进行训练。 (二)为了解决现有方法忽视负样本以及它们的生成不夠合理的问题,我们引入了基于物理的对比学习LLE,并设计了Bag of Curves(BoC)方法来生成更合理的负样本,使其更贴近底层物理成像原理。 (三)为了克服现有方法在现有方法中依赖语义真实值的问题,我们提出了一个无监督的区域分割模块,在确保区域亮度一致性的同时消除对语义真实值的依赖。 总体而言,与现有的LLE方法相比,所提出的PIE具有明显的优势。除了PIE的新架构外,我们还研究了PIE在下游任务(如语义分割和面部检测)上的性能提升。在易于获取的开源数据上进行训练,并进行了广泛的实验,结果表明,我们的方法在六个独立场景数据集上的性能超越了最先进的LLE模型。PIE在测试时间具有合理的GFLOPs,使其在移动设备上使用方便。
https://arxiv.org/abs/2404.04586
We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.
我们提出了VLOGGER方法,一种从单个输入图像的人类视频生成方法,该方法在最近的成功生成扩散模型的基础上进行了改进。VLOGGER方法由两部分组成:1)一个随机的人类到3D运动扩散模型;2)一个新型的扩散基于架构,它通过空间和时间控制来增强文本到图像模型。这支持生成高质量的视频,具有可变长度,并且可以通过高级人脸和身体表示来轻松控制。与之前的工作相比,我们的方法不需要为每个人进行训练,不依赖于人脸检测和裁剪,可以生成完整的图像(不仅是脸或嘴唇),并考虑了广泛的场景(例如可见的躯干或多样主体身份),这些场景对于正确合成交流中的人是至关重要的。我们还策划了MENTOR,一个具有3D姿势和表情注释的新颖而多样的大数据集,比之前的大一倍(800,000个身份),并支持动态手势,我们在其中训练和消融我们的主要技术贡献。VLOGGER在三个公共基准测试中的表现优于最先进的方法,同时考虑了图像质量、身份保留和时间一致性。我们还展示了VLOGGER在视频编辑和个性化方面的应用。
https://arxiv.org/abs/2403.08764
This technical report presents a diffusion model based framework for face swapping between two portrait images. The basic framework consists of three components, i.e., IP-Adapter, ControlNet, and Stable Diffusion's inpainting pipeline, for face feature encoding, multi-conditional generation, and face inpainting respectively. Besides, I introduce facial guidance optimization and CodeFormer based blending to further improve the generation quality. Specifically, we engage a recent light-weighted customization method (i.e., DreamBooth-LoRA), to guarantee the identity consistency by 1) using a rare identifier "sks" to represent the source identity, and 2) injecting the image features of source portrait into each cross-attention layer like the text features. Then I resort to the strong inpainting ability of Stable Diffusion, and utilize canny image and face detection annotation of the target portrait as the conditions, to guide ContorlNet's generation and align source portrait with the target portrait. To further correct face alignment, we add the facial guidance loss to optimize the text embedding during the sample generation.
此技术报告提出了一种基于扩散模型的图像交换框架,用于两个肖像图像之间的换脸。基本框架包括三个组件:IP适配器、控制网络和稳定扩散的修复管道,用于分别进行面部特征编码、多条件生成和换脸修复。此外,我还引入了面部指导优化和基于CodeFormer的混合技术,进一步提高了生成质量。具体来说,我们采用了一种轻量级的自定义化方法(即DreamBooth-LoRA),通过使用稀有标识符“sks”来表示原始身份,并在每个交叉注意层中注入原始肖像的图像特征,从而保证身份一致性。然后我依赖于Stable Diffusion的强修复能力,并利用目标肖像的清晰图像和面部检测标注作为条件,引导控制网络的生成并使原始肖像与目标肖像对齐。为了进一步校正面部对齐,我们在样本生成过程中添加了面部指导损失,用于优化文本嵌入。
https://arxiv.org/abs/2403.01108
Movement disorders are typically diagnosed by consensus-based expert evaluation of clinically acquired patient videos. However, such broad sharing of patient videos poses risks to patient privacy. Face blurring can be used to de-identify videos, but this process is often manual and time-consuming. Available automated face blurring techniques are subject to either excessive, inconsistent, or insufficient facial blurring - all of which can be disastrous for video assessment and patient privacy. Furthermore, assessing movement disorders in these videos is often subjective. The extraction of quantifiable kinematic features can help inform movement disorder assessment in these videos, but existing methods to do this are prone to errors if using pre-blurred videos. We have developed an open-source software called SecurePose that can both achieve reliable face blurring and automated kinematic extraction in patient videos recorded in a clinic setting using an iPad. SecurePose, extracts kinematics using a pose estimation method (OpenPose), tracks and uniquely identifies all individuals in the video, identifies the patient, and performs face blurring. The software was validated on gait videos recorded in outpatient clinic visits of 116 children with cerebral palsy. The validation involved assessing intermediate steps of kinematics extraction and face blurring with manual blurring (ground truth). Moreover, when SecurePose was compared with six selected existing methods, it outperformed other methods in automated face detection and achieved ceiling accuracy in 91.08% less time than a robust manual face blurring method. Furthermore, ten experienced researchers found SecurePose easy to learn and use, as evidenced by the System Usability Scale. The results of this work validated the performance and usability of SecurePose on clinically recorded gait videos for face blurring and kinematics extraction.
通常,运动障碍的诊断是通过基于共识的专家对临床获得的病人视频进行评估得出的。然而,如此广泛的共享病人视频会对患者隐私造成风险。可以采用面部模糊来删除视频中的身份信息,但这种过程通常是手动且耗时费力的。已有的自动面部模糊技术要么过度模糊,要么缺乏足够的模糊,这些都可能对视频评估和患者隐私造成灾难性的影响。此外,评估这些视频中的运动障碍通常是主观的。计算运动特征以帮助告知这些视频中的运动障碍评估是一种可行的方法,但现有的方法在使用预模糊视频时容易出错。我们开发了一个名为SecurePose的免费开源软件,可以在患者视频中使用iPad进行 clinic 环境下记录的可靠面部模糊和自动运动特征提取。SecurePose使用姿态估计方法(OpenPose)提取运动学,跟踪并唯一标识视频中所有的人,识别患者,并执行面部模糊。该软件在116名脊髓性截瘫儿童的外科诊所访问中记录的步态视频上进行了验证。验证包括使用手动模糊的中间步骤评估运动学提取和面部模糊(真实值)。此外,当SecurePose与其他六个选择的方法进行比较时,它在自动面部检测方面超过了其他方法,用时减少了91.08%。此外,十名有经验的研究人员发现SecurePose易于学习和使用,正如System Usability Scale所证明的。本工作的结果证实了SecurePose在临床上记录的步态视频中的面部模糊和运动学提取的性能和可用性。
https://arxiv.org/abs/2402.14143
The majority of computer vision applications that handle images featuring humans use face detection as a core component. Face detection still has issues, despite much research on the topic. Face detection's accuracy and speed might yet be increased. This review paper shows the progress made in this area as well as the substantial issues that still need to be tackled. The paper provides research directions that can be taken up as research projects in the field of face detection.
绝大多数处理图像的人脸识别应用都使用人脸检测作为核心组件。尽管在人脸识别领域已经进行了大量的研究,但人脸识别仍然存在一些问题。人脸识别的准确性和速度可能还有提高的空间。本文回顾论文展示了该领域所取得的研究进展以及仍需要解决的严重问题。论文提供了一些可以作为人脸识别领域研究项目的方向。
https://arxiv.org/abs/2402.03796
Detecting glass regions is a challenging task due to the ambiguity of their transparency and reflection properties. These transparent glasses share the visual appearance of both transmitted arbitrary background scenes and reflected objects, thus having no fixed patterns.Recent visual foundation models, which are trained on vast amounts of data, have manifested stunning performance in terms of image perception and image generation. To segment glass surfaces with higher accuracy, we make full use of two visual foundation models: Segment Anything (SAM) and Stable Diffusion.Specifically, we devise a simple glass surface segmentor named GEM, which only consists of a SAM backbone, a simple feature pyramid, a discerning query selection module, and a mask decoder. The discerning query selection can adaptively identify glass surface features, assigning them as initialized queries in the mask decoder. We also propose a Synthetic but photorealistic large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales, which contain 1x, 5x, 10x, and 20x of the original real data size. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually saturate as the amount of data increases. Extensive experiments demonstrate that GEM achieves a new state-of-the-art on the GSD-S validation set (IoU +2.1%). Codes and datasets are available at: this https URL.
检测玻璃区域是一个具有挑战性的任务,因为其透明度和反射特性具有不确定性。这些透明的玻璃分享传感和物体所具有的视觉外观,因此没有固定的模式。 训练大量数据的最近视觉基础模型在图像感知和图像生成方面表现出惊人的性能。为了更准确地分割玻璃表面,我们充分利用两个视觉基础模型:Segment Anything(SAM)和Stable Diffusion。具体来说,我们设计了一个简单的玻璃表面分割器GEM,它仅包含一个SAM骨架、一个简单的特征金字塔、一个精明的查询选择模块和一个掩码解码器。精明的查询选择可以动态地识别玻璃表面特征,并将它们作为初始化查询传递给掩码解码器。我们还通过扩散模型提出了一个合成但更真实的大规模玻璃表面检测数据集,其包含原始数据大小的1x、5x、10x和20x。这个数据集是一个可迁移学习的可行来源。合成数据的规模对迁移学习有积极影响,而随着数据量的增加,提高将逐渐趋于饱和。大量实验证明,GEM在GSD-S验证集上达到了最先进水平(IoU +2.1%)。代码和数据集可在此处下载:https://this URL。
https://arxiv.org/abs/2401.15282
The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and then use these exemplars to classify all speech segments by speaker identity. Notably, the method does not require face detection or tracking. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of the vast amount of videos available on modern streaming services. Project page : \url{this https URL}
本文的目标是实现自动识别字符的标题生成。给定一个视频和少量的元数据,我们提出了一种音频-视觉方法,生成对话的完整文本,带有精确的语音时间戳和说话人身份确定的字符。关键想法是首先使用音频-视觉线索选择每个角色的精确音频示例集,然后使用这些示例来对所有语音段进行说话人身份的分类。值得注意的是,该方法不需要面部检测或跟踪。我们在《辛普森一家》、《费城永远阳光下》和《 Scrubs》等众多的情景喜剧上评估了该方法。我们想象这个系统将有助于自动生成字幕,从而提高现代流媒体服务上丰富视频的可用性。项目页面:\url{这个链接}
https://arxiv.org/abs/2401.12039
The auto-management of vehicle entrance and parking in any organization is a complex challenge encompassing record-keeping, efficiency, and security concerns. Manual methods for tracking vehicles and finding parking spaces are slow and a waste of time. To solve the problem of auto management of vehicle entrance and parking, we have utilized state-of-the-art deep learning models and automated the process of vehicle entrance and parking into any organization. To ensure security, our system integrated vehicle detection, license number plate verification, and face detection and recognition models to ensure that the person and vehicle are registered with the organization. We have trained multiple deep-learning models for vehicle detection, license number plate detection, face detection, and recognition, however, the YOLOv8n model outperformed all the other models. Furthermore, License plate recognition is facilitated by Google's Tesseract-OCR Engine. By integrating these technologies, the system offers efficient vehicle detection, precise identification, streamlined record keeping, and optimized parking slot allocation in buildings, thereby enhancing convenience, accuracy, and security. Future research opportunities lie in fine-tuning system performance for a wide range of real-world applications.
任何组织对车辆入口和停车的自动管理是一个复杂而广泛的挑战,涉及记录、效率和安全问题。手动跟踪车辆和查找停车位的方法缓慢而且浪费时间。为解决车辆入口和停车的自动管理问题,我们利用了最先进的人工智能技术,将车辆入口和停车的过程自动化到任何组织中。为了确保安全性,我们的系统集成了车辆检测、车牌识别和人脸识别模型,以确保人和车辆与组织注册。我们已经为车辆检测、车牌识别、人脸识别和识别训练了多个深度学习模型,然而,YOLOv8n模型在其他模型中表现出色。此外,通过整合这些技术,系统提供了高效的车辆检测、精确的身份识别、简洁的记录和优化的停车位分配,从而提高了便利性、准确性和安全性。未来研究机会在于对各种现实应用进行系统性能的微调。
https://arxiv.org/abs/2312.02699
Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there's a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven't been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.
面部识别器已成为许多应用程序的关键组件,包括监控,而这些应用程序通常需要在具有有限处理能力和内存的边缘设备上运行。因此,对于在资源受限设备上运行的紧凑型面部识别器模型,迫切需要开发出高效运作的模型。在最近几年里,网络剪枝技术已经引起了研究人员的高度关注。尽管这些方法在面部识别器方面越来越受欢迎,但在面部识别器背景下,这些方法并没有得到很好的研究。在本文中,我们在两个已经小型且紧凑的面部识别器EXTD(非常小巧的面部识别器)和EResFD(高效的ResNet面部识别器)上实现了基于几何中值滤波(FPGM)的滤波器剪枝。我们所采用的主要剪枝算法是结合Filter Pruning through Geometric Median(FPGM)和Soft Filter Pruning(SFP)迭代过程。我们还应用了L1范数剪枝作为基准,与所提出的剪枝方法进行比较。在WIDER FACE数据集上的实验评估结果表明,与所提出的剪枝方法相比,具有有限的准确度损失,甚至在高剪枝率下具有小的准确度增加,原剪枝方法具有巨大潜力。
https://arxiv.org/abs/2311.16613
Human eye gaze estimation is an important cognitive ingredient for successful human-robot interaction, enabling the robot to read and predict human behavior. We approach this problem using artificial neural networks and build a modular system estimating gaze from separately cropped eyes, taking advantage of existing well-functioning components for face detection (RetinaFace) and head pose estimation (6DRepNet). Our proposed method does not require any special hardware or infrared filters but uses a standard notebook-builtin RGB camera, as often approached with appearance-based methods. Using the MetaHuman tool, we also generated a large synthetic dataset of more than 57,000 human faces and made it publicly available. The inclusion of this dataset (with eye gaze and head pose information) on top of the standard Columbia Gaze dataset into training the model led to better accuracy with a mean average error below two degrees in eye pitch and yaw directions, which compares favourably to related methods. We also verified the feasibility of our model by its preliminary testing in real-world setting using the builtin 4K camera in NICO semi-humanoid robot's eye.
人类眼睛注视估计是成功的人机交互的重要认知成分,使机器人能够阅读和预测人类行为。我们通过人工神经网络解决这个问题,并建立了一个模块系统,从分别裁剪的眼睛中估计注视,利用现有的面部检测(RetinaFace)和头姿态估计(6DRepNet)的成熟组件。我们提出的方法不需要特殊的硬件或红外滤镜,而是利用了一个标准的笔记本内置的RGB相机,通常与基于外观的方法相同。使用元人类工具,我们还生成了超过57,000个合成面部数据集,并将其公开发布。在将这个数据集(带有眼部和头姿态信息)放在标准的哥伦比亚 gaze数据集中训练模型后,我们在眼俯仰和眼偏转方向上的平均误差低于两度,与相关方法相比具有优势。我们还通过使用NICO半人形机器人预先测试模型中的内置4K相机来验证我们模型的可行性。
https://arxiv.org/abs/2311.14175
In response to the global COVID-19 pandemic, there has been a critical demand for protective measures, with face masks emerging as a primary safeguard. The approach involves a two-fold strategy: first, recognizing the presence of a face by detecting faces, and second, identifying masks on those faces. This project utilizes deep learning to create a model that can detect face masks in real-time streaming video as well as images. Face detection, a facet of object detection, finds applications in diverse fields such as security, biometrics, and law enforcement. Various detector systems worldwide have been developed and implemented, with convolutional neural networks chosen for their superior performance accuracy and speed in object detection. Experimental results attest to the model's excellent accuracy on test data. The primary focus of this research is to enhance security, particularly in sensitive areas. The research paper proposes a rapid image pre-processing method with masks centred on faces. Employing feature extraction and Convolutional Neural Network, the system classifies and detects individuals wearing masks. The research unfolds in three stages: image pre-processing, image cropping, and image classification, collectively contributing to the identification of masked faces. Continuous surveillance through webcams or CCTV cameras ensures constant monitoring, triggering a security alert if a person is detected without a mask.
为了应对全球新冠疫情,人们普遍要求采取保护措施,口罩已成为最主要的保护方式。该方法采用双重策略:首先,通过检测人脸来识别存在的人脸,然后在对人脸进行识别时,确定口罩。本项目利用深度学习创建了一个可以实时检测戴口罩情况的模型。人脸检测,作为物体检测的一个方面,在安全、生物识别和执法等领域有广泛应用。世界各地已经开发并实施了许多检测系统,而卷积神经网络因其卓越的检测性能和速度而备受选择。实验结果证实了模型在测试数据上的卓越准确性。 本项目的研究重点是提高安全性,特别是敏感区域的安全性。研究论文提出了一种快速预处理图像的方法,其中口罩居中。利用特征提取和卷积神经网络,系统对佩戴口罩的人进行分类和检测。研究过程包括图像预处理、图像裁剪和图像分类,共同致力于识别戴口罩的脸孔。通过摄像头或闭路电视的持续监控,确保持续监控,如果一个人在没有戴口罩的情况下被检测到,则会触发安全警报。
https://arxiv.org/abs/2311.10408
Two difficulties here make low-light image enhancement a challenging task; firstly, it needs to consider not only luminance restoration but also image contrast, image denoising and color distortion issues simultaneously. Second, the effectiveness of existing low-light enhancement methods depends on paired or unpaired training data with poor generalization performance. To solve these difficult problems, we propose in this paper a new learning-based Retinex decomposition of zero-shot low-light enhancement method, called ZERRINNet. To this end, we first designed the N-Net network, together with the noise loss term, to be used for denoising the original low-light image by estimating the noise of the low-light image. Moreover, RI-Net is used to estimate the reflection component and illumination component, and in order to solve the color distortion and contrast, we use the texture loss term and segmented smoothing loss to constrain the reflection component and illumination component. Finally, our method is a zero-reference enhancement method that is not affected by the training data of paired and unpaired datasets, so our generalization performance is greatly improved, and in the paper, we have effectively validated it with a homemade real-life low-light dataset and additionally with advanced vision tasks, such as face detection, target recognition, and instance segmentation. We conducted comparative experiments on a large number of public datasets and the results show that the performance of our method is competitive compared to the current state-of-the-art methods. The code is available at:this https URL
本文提出了一种新的基于学习的零散低光增强方法,称为ZERRINNet。为了解决这些问题,我们在论文中提出了一种新的基于学习的Retinex分解零散低光增强方法。首先,我们设计了一个N-Net网络,包括噪声损失项,用于通过估计低光图像的噪声来消除原始低光图像的噪声。此外,我们还使用了RI-Net来估计反射分量和支持向量,为了解决色彩失真和对比度问题,我们使用了纹理损失项和分割平滑损失来约束反射分量和光照分量。最后,我们的方法是一种零参考增强方法,不会受到成对和未成对数据集的训练数据的影响。在论文中,我们通过使用自己制作的现实低光数据集以及先进视觉任务(如面部检测、目标识别和实例分割)来有效验证了我们的方法。我们在多个公共数据集上进行了比较实验,结果表明,与最先进的现有方法相比,我们的方法具有竞争力的性能。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2311.02995
Despite significant research on lightweight deep neural networks (DNNs) designed for edge devices, the current face detectors do not fully meet the requirements for "intelligent" CMOS image sensors (iCISs) integrated with embedded DNNs. These sensors are essential in various practical applications, such as energy-efficient mobile phones and surveillance systems with always-on capabilities. One noteworthy limitation is the absence of suitable face detectors for the always-on scenario, a crucial aspect of image sensor-level applications. These detectors must operate directly with sensor RAW data before the image signal processor (ISP) takes over. This gap poses a significant challenge in achieving optimal performance in such scenarios. Further research and development are necessary to bridge this gap and fully leverage the potential of iCIS applications. In this study, we aim to bridge the gap by exploring extremely low-bit lightweight face detectors, focusing on the always-on face detection scenario for mobile image sensor applications. To achieve this, our proposed model utilizes sensor-aware synthetic RAW inputs, simulating always-on face detection processed "before" the ISP chain. Our approach employs ternary (-1, 0, 1) weights for potential implementations in image sensors, resulting in a relatively simple network architecture with shallow layers and extremely low-bitwidth. Our method demonstrates reasonable face detection performance and excellent efficiency in simulation studies, offering promising possibilities for practical always-on face detectors in real-world applications.
尽管在轻型边缘设备上进行了大量关于为边缘设备设计的轻量级深度神经网络(DNNs)的研究,但当前的 face 检测器并没有完全满足集成嵌入式 DNNs 的“智能”CMOS图像传感器(iCIS)的要求。这些传感器在各种实际应用中非常重要,如高效的移动电话和具有持续开启功能的安防系统。一个值得注意的是,在持续开启的场景下缺乏适合的 face 检测器,这是图像传感器级别应用的关键方面。这些检测器必须在图像信号处理器(ISP)接管之前直接与传感器RAW数据操作。这一空白对在此类场景实现最佳性能提出了重大挑战。进一步的研究和开发是必要的,以弥合这一空白并充分利用iCIS应用的潜力。在本研究中,我们旨在通过探索极端轻量级的 face 检测器来弥合这一空白,重点关注移动图像传感器应用的持续开启场景。为了实现这一目标,我们提出的模型利用了传感器感知的合成RAW输入,在ISP链之前对持续开启的 face 检测进行模拟。我们采用二进制(-1,0,1)权重来设计成像传感器上的实现,导致网络架构相对简单,具有极低的位宽。我们的方法在模拟研究中的面部检测性能和效率都表现出相当不错的水平,为现实世界中的持续开启面部检测器提供了有前途的解决方案。
https://arxiv.org/abs/2311.01001
Wearing a mask is one of the important measures to prevent infectious diseases. However, it is difficult to detect people's mask-wearing situation in public places with high traffic flow. To address the above problem, this paper proposes a mask-wearing face detection model based on YOLOv5l. Firstly, Multi-Head Attentional Self-Convolution not only improves the convergence speed of the model but also enhances the accuracy of the model detection. Secondly, the introduction of Swin Transformer Block is able to extract more useful feature information, enhance the detection ability of small targets, and improve the overall accuracy of the model. Our designed I-CBAM module can improve target detection accuracy. In addition, using enhanced feature fusion enables the model to better adapt to object detection tasks of different scales. In the experimentation on the MASK dataset, the results show that the model proposed in this paper achieved a 1.1% improvement in mAP(0.5) and a 1.3% improvement in mAP(0.5:0.95) compared to the YOLOv5l model. Our proposed method significantly enhances the detection capability of mask-wearing.
戴口罩是预防传染病的重要措施之一。然而,在高度人流量的公共场合很难检测到人们的口罩佩戴情况。为解决上述问题,本文提出了一种基于YOLOv5l的口罩佩戴面部检测模型。首先,Multi-Head Attentional Self-Convolution 不仅提高了模型的收敛速度,还增强了模型的检测精度。其次,引入Swin Transformer Block能够提取更丰富的特征信息,提高小目标检测能力,并提高整个模型的准确性。我们设计的I-CBAM模块可以提高目标检测精度。此外,使用增强特征融合可以使模型更好地适应不同规模的物体检测任务。在MASK数据集的实验中,结果表明,与YOLOv5l模型相比,本文提出的模型在mAP(0.5)和mAP(0.5:0.95)上分别实现了1.1%和1.3%的提高。我们提出的方法显著增强了戴口罩检测能力。
https://arxiv.org/abs/2310.10245
Practical video analytics systems that are deployed in bandwidth constrained environments like autonomous vehicles perform computer vision tasks such as face detection and recognition. In an end-to-end face analytics system, inputs are first compressed using popular video codecs like HEVC and then passed onto modules that perform face detection, alignment, and recognition sequentially. Typically, the modules of these systems are evaluated independently using task-specific imbalanced datasets that can misconstrue performance estimates. In this paper, we perform a thorough end-to-end evaluation of a face analytics system using a driving-specific dataset, which enables meaningful interpretations. We demonstrate how independent task evaluations, dataset imbalances, and inconsistent annotations can lead to incorrect system performance estimates. We propose strategies to create balanced evaluation subsets of our dataset and to make its annotations consistent across multiple analytics tasks and scenarios. We then evaluate the end-to-end system performance sequentially to account for task interdependencies. Our experiments show that our approach provides consistent, accurate, and interpretable estimates of the system's performance which is critical for real-world applications.
实时的视频分析系统在像自动驾驶这样的带宽受限环境中执行计算机视觉任务,如面部检测和识别。在端到端面部分析系统中,首先使用流行的视频编码格式(如HEVC)对输入进行压缩,然后传递给依次执行面部检测、对齐和识别的模块。通常,这些系统的模块使用特定任务的不平衡数据集进行独立评估,这可能导致性能估计的误解。在本文中,我们通过使用驾驶特定数据集进行了对端到端面部分析系统的深入评估,这使得有意义的结果。我们证明了独立任务评估、数据不平衡和注释不统一可能导致系统性能估计错误。我们提出了创建数据集平衡评估子集以及在不同分析和任务场景下使其注释保持一致的策略。然后,我们按顺序评估端到端系统的性能,以考虑任务依赖关系。我们的实验结果表明,我们的方法提供了一致、准确和可解释的系统性能估计,这对实时应用程序至关重要。
https://arxiv.org/abs/2310.06945