Despite remarkable advancements, mainstream gaze estimation techniques, particularly appearance-based methods, often suffer from performance degradation in uncontrolled environments due to variations in illumination and individual facial attributes. Existing domain adaptation strategies, limited by their need for target domain samples, may fall short in real-world applications. This letter introduces Branch-out Auxiliary Regularization (BAR), an innovative method designed to boost gaze estimation's generalization capabilities without requiring direct access to target domain data. Specifically, BAR integrates two auxiliary consistency regularization branches: one that uses augmented samples to counteract environmental variations, and another that aligns gaze directions with positive source domain samples to encourage the learning of consistent gaze features. These auxiliary pathways strengthen the core network and are integrated in a smooth, plug-and-play manner, facilitating easy adaptation to various other models. Comprehensive experimental evaluations on four cross-dataset tasks demonstrate the superiority of our approach.
尽管在可见的进步中,主流的视差估计技术(特别是以外观为基础的方法)在未受控的环境中往往性能下降,因为照明和个体面部属性的变化会导致性能下降。现有的领域自适应策略,由于需要目标领域样本,可能在其现实应用中不够有效。本文介绍了一种名为Branch-out Auxiliary Regularization(BAR)的创新方法,旨在提高视差估计的泛化能力,而无需直接访问目标领域数据。具体来说,BAR结合了两个辅助一致性正则化分支:一个使用增强样本来对抗环境变化,另一个将目光方向与积极源域样本对齐,以促进学习一致的视差特征。这些辅助通道加强了核心网络,以一种平滑、可插拔的方式集成,便于轻松适应各种其他模型。在四个跨数据集任务的综合实验评估中,证明了我们的方法具有优越性。
https://arxiv.org/abs/2405.01439
Gaze is an essential prompt for analyzing human behavior and attention. Recently, there has been an increasing interest in determining gaze direction from facial videos. However, video gaze estimation faces significant challenges, such as understanding the dynamic evolution of gaze in video sequences, dealing with static backgrounds, and adapting to variations in illumination. To address these challenges, we propose a simple and novel deep learning model designed to estimate gaze from videos, incorporating a specialized attention module. Our method employs a spatial attention mechanism that tracks spatial dynamics within videos. This technique enables accurate gaze direction prediction through a temporal sequence model, adeptly transforming spatial observations into temporal insights, thereby significantly improving gaze estimation accuracy. Additionally, our approach integrates Gaussian processes to include individual-specific traits, facilitating the personalization of our model with just a few labeled samples. Experimental results confirm the efficacy of the proposed approach, demonstrating its success in both within-dataset and cross-dataset settings. Specifically, our proposed approach achieves state-of-the-art performance on the Gaze360 dataset, improving by $2.5^\circ$ without personalization. Further, by personalizing the model with just three samples, we achieved an additional improvement of $0.8^\circ$. The code and pre-trained models are available at \url{this https URL}.
凝视是人类行为和注意力的关键提示。最近,越来越多地关注从视频面部视频中确定凝视方向。然而,视频凝视估计面临着重大挑战,例如理解视频序列中凝视的动态演变,处理静态背景,以及适应光照变化。为了应对这些挑战,我们提出了一个简单而新颖的深度学习模型,旨在从视频中估计凝视,并包括一个专门的注意力模块。我们的方法采用了一种空间注意力机制,跟踪视频中的空间动态。这种技术通过时间序列模型准确预测凝视方向,从而显著提高了凝视估计精度。此外,我们的方法结合高斯过程,包括个人特征,从而通过仅几个标记样本实现模型的个性化。实验结果证实了所提出的方法的成效,表明其在数据集内和跨数据集设置中都取得了成功。具体来说,我们在Gaze360数据集上实现了最先进的性能,通过个性化模型提高了2.5度。此外,通过仅使用三个样本,我们实现了额外的0.8度改进。代码和预训练模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.05215
This paper tackles the problem of passive gaze estimation using both event and frame data. Considering inherently different physiological structures, it's intractable to accurately estimate purely based on a given state. Thus, we reformulate the gaze estimation as the quantification of state transitions from the current state to several prior registered anchor states. Technically, we propose a two-stage learning-based gaze estimation framework to divide the whole gaze estimation process into a coarse-to-fine process of anchor state selection and final gaze location. Moreover, to improve generalization ability, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion technique to iteratively remove inherent noise of event data. Extensive experiments demonstrate the effectiveness of the proposed method, which greatly surpasses state-of-the-art methods by a large extent of 15$\%$. The code will be publicly available at this https URL.
本论文通过同时考虑事件和帧数据,解决了被动眼动估计的问题。考虑到固有生理结构的不同,仅基于给定状态进行准确估计是不可能的。因此,我们将目光估计重新建模为从当前状态到几个已注册的先验锚定状态的状态转移量的量化。技术上,我们提出了一个基于两级学习的光注意力估计框架,将整个目光估计过程划分为锚定状态选择和最终眼动位置的粗细过程。此外,为了提高泛化能力,我们将一组局部专家与学生网络对齐,引入了一种新的去噪蒸馏算法,利用去噪扩散技术迭代地去除事件数据固有噪声。大量实验证明,所提出的方法的有效性超出了现有方法的很大程度,其性能提高到了15%以上。代码将在这个 https URL 上公开。
https://arxiv.org/abs/2404.00548
Driver's eye gaze holds a wealth of cognitive and intentional cues crucial for intelligent vehicles. Despite its significance, research on in-vehicle gaze estimation remains limited due to the scarcity of comprehensive and well-annotated datasets in real driving scenarios. In this paper, we present three novel elements to advance in-vehicle gaze research. Firstly, we introduce IVGaze, a pioneering dataset capturing in-vehicle gaze, collected from 125 subjects and covering a large range of gaze and head poses within vehicles. Conventional gaze collection systems are inadequate for in-vehicle use. In this dataset, we propose a new vision-based solution for in-vehicle gaze collection, introducing a refined gaze target calibration method to tackle annotation challenges. Second, our research focuses on in-vehicle gaze estimation leveraging the IVGaze. In-vehicle face images often suffer from low resolution, prompting our introduction of a gaze pyramid transformer that leverages transformer-based multilevel features integration. Expanding upon this, we introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing perspective transformation, we rotate virtual cameras to normalize images, utilizing camera pose to merge normalized and original images for accurate gaze estimation. GazeDPTR shows state-of-the-art performance on the IVGaze dataset. Thirdly, we explore a novel strategy for gaze zone classification by extending the GazeDPTR. A foundational tri-plane and project gaze onto these planes are newly defined. Leveraging both positional features from the projection points and visual attributes from images, we achieve superior performance compared to relying solely on visual features, substantiating the advantage of gaze estimation. Our project is available at https://yihua.zone/work/ivgaze.
驾驶员的视角持有大量的认知和意图线索,这对智能车辆至关重要。然而,由于实际驾驶场景中全面和标注数据的稀缺性,关于车内目光估计的研究仍然有限。在本文中,我们提出了三个新的元素来促进车内目光研究的发展。首先,我们介绍了IVGaze,一个开创性的数据集,收集了125个受试者的车内目光,涵盖了车辆内目光和头部的各种姿态。传统的目光收集系统不适用于车内使用。在这个数据集中,我们提出了一种新的基于视觉的解决方案来解决标注挑战,引入了精确的 gaze 目标校准方法来解决这一问题。其次,我们的研究关注利用IVGaze进行车内目光估计。由于车内面部图片往往分辨率较低,因此我们引入了一种基于变压器的 gaze 金字塔转变换器,利用基于变压器的多层特征集成。进一步,我们引入了双流 gaze 金字塔转变换器(GazeDPTR)。利用透视变换,我们将虚拟相机旋转来归一化图像,利用相机姿态来合并归一化和原始图像以实现准确的目标眼神估计。GazeDPTR在IVGaze数据集上表现出最先进的成绩。最后,我们探讨了通过扩展GazeDPTR来对目光区域进行分类的新策略。我们定义了一个基础的三维平面和将投影点的位置特征和图像的视觉特征合并。我们实现了比仅依赖视觉特征更卓越的表现,证实了眼神估计的优势。我们的项目可在https://yihua.zone/work/ivgaze上查看。
https://arxiv.org/abs/2403.15664
Gaze estimation methods often experience significant performance degradation when evaluated across different domains, due to the domain gap between the testing and training data. Existing methods try to address this issue using various domain generalization approaches, but with little success because of the limited diversity of gaze datasets, such as appearance, wearable, and image quality. To overcome these limitations, we propose a novel framework called CLIP-Gaze that utilizes a pre-trained vision-language model to leverage its transferable knowledge. Our framework is the first to leverage the vision-and-language cross-modality approach for gaze estimation task. Specifically, we extract gaze-relevant feature by pushing it away from gaze-irrelevant features which can be flexibly constructed via language descriptions. To learn more suitable prompts, we propose a personalized context optimization method for text prompt tuning. Furthermore, we utilize the relationship among gaze samples to refine the distribution of gaze-relevant features, thereby improving the generalization capability of the gaze estimation model. Extensive experiments demonstrate the excellent performance of CLIP-Gaze over existing methods on four cross-domain evaluations.
目光估计方法通常在评估时会受到不同领域的显著性能下降,因为测试和训练数据之间的领域差异。现有的方法尝试通过各种领域泛化方法来解决这个 issue,但效果仍然有限,因为目光数据集的多样性有限,如外观、可穿戴和图像质量等。为了克服这些限制,我们提出了一个名为 CLIP-Gaze 的新框架,它利用了一个预训练的视觉-语言模型来利用其可转移的知识。我们的框架是第一个利用视觉和语言跨模态方法来解决目光估计任务的。具体来说,我们通过将注意力相关的特征从与目光无关的特征中推开,从而实现了对语言描述可构建的灵活性。为了学习更合适的提示,我们提出了一个个性化的上下文优化方法来进行文本提示的调整。此外,我们还利用目光样本之间的关系来平滑目光相关特征的分布,从而提高 gaze 估计模型的泛化能力。大量实验证明,CLIP-Gaze 在四个跨领域评估中的表现超过了现有方法。
https://arxiv.org/abs/2403.05124
Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators' updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.
最新的目光估计方法需要大规模训练数据,但它们的收集和交换却存在着显著的隐私风险。我们提出PrivatEyes - 基于联邦学习和安全多方计算(MPC)的第一个隐私增强训练方法,用于基于外观的目光估计。PrivatEyes使多个局部数据集上的训练目光估计算法能够在不同的用户和服务器上进行训练,并对个人估计算法的更新进行安全聚合。PrivatEyes保证,即使大多数聚合服务器都是恶意的,个人目光数据也不会泄漏。我们还引入了一种新的数据泄露攻击DualView,证明了PrivatEyes比其他方法更有效地限制了训练数据的泄露。在MPIIGaze、MPIIFaceGaze、GazeCapture和NVGaze数据集上的评估进一步表明,提高隐私不会导致目光估计精度降低,或者导致计算成本大幅上升——这两者与非安全对照物的水平相当。
https://arxiv.org/abs/2402.18970
Gaze object prediction aims to predict the location and category of the object that is watched by a human. Previous gaze object prediction works use CNN-based object detectors to predict the object's location. However, we find that Transformer-based object detectors can predict more accurate object location for dense objects in retail scenarios. Moreover, the long-distance modeling capability of the Transformer can help to build relationships between the human head and the gaze object, which is important for the GOP task. To this end, this paper introduces Transformer into the fields of gaze object prediction and proposes an end-to-end Transformer-based gaze object prediction method named TransGOP. Specifically, TransGOP uses an off-the-shelf Transformer-based object detector to detect the location of objects and designs a Transformer-based gaze autoencoder in the gaze regressor to establish long-distance gaze relationships. Moreover, to improve gaze heatmap regression, we propose an object-to-gaze cross-attention mechanism to let the queries of the gaze autoencoder learn the global-memory position knowledge from the object detector. Finally, to make the whole framework end-to-end trained, we propose a Gaze Box loss to jointly optimize the object detector and gaze regressor by enhancing the gaze heatmap energy in the box of the gaze object. Extensive experiments on the GOO-Synth and GOO-Real datasets demonstrate that our TransGOP achieves state-of-the-art performance on all tracks, i.e., object detection, gaze estimation, and gaze object prediction. Our code will be available at this https URL.
目光物体预测的目标是预测人类观看的对象的位置和类别。 previous gaze object prediction 使用基于CNN的对象检测器预测对象的位置。然而,我们发现基于Transformer的对象检测器对于密集场景中的对象具有更准确的预测位置的能力。此外,Transformer的远距离建模能力可以帮助建立人头和目光物体之间的关系,这对于GOP任务非常重要。因此,本文将Transformer引入目光物体预测领域,并提出了一个端到端的Transformer-based gaze物体预测方法,名为TransGOP。具体来说,TransGOP使用了一个标准的Transformer-based物体检测器来检测物体的位置,并在目光回归器中设计了一个基于Transformer的 gaze 自动编码器,以建立远距离目光关系。此外,为了提高目光热图回归,我们提出了一个物体到目光物体的交叉注意力机制,让目光自动编码器的查询从物体检测器中学习全局记忆位置知识。最后,为了使整个框架端到端训练,我们提出了一个Gaze Box损失,通过增强目光物体的 gaze 热图能量,共同优化物体检测器和目光回归器。在GOO-Synth和GOO-Real数据集上的大量实验证明,我们的TransGOP在所有曲目上都实现了最先进的性能,即物体检测、目光估计和目光物体预测。我们的代码将在此处https:// URL上可用。
https://arxiv.org/abs/2402.13578
Gaze estimation, the task of predicting where an individual is looking, is a critical task with direct applications in areas such as human-computer interaction and virtual reality. Estimating the direction of looking in unconstrained environments is difficult, due to the many factors that can obscure the face and eye regions. In this work we propose CrossGaze, a strong baseline for gaze estimation, that leverages recent developments in computer vision architectures and attention-based modules. Unlike previous approaches, our method does not require a specialised architecture, utilizing already established models that we integrate in our architecture and adapt for the task of 3D gaze estimation. This approach allows for seamless updates to the architecture as any module can be replaced with more powerful feature extractors. On the Gaze360 benchmark, our model surpasses several state-of-the-art methods, achieving a mean angular error of 9.94 degrees. Our proposed model serves as a strong foundation for future research and development in gaze estimation, paving the way for practical and accurate gaze prediction in real-world scenarios.
凝视估计,预测个人正在看向的位置是一个关键任务,在领域如人机交互和虚拟现实中有直接应用。在约束环境中估计眼神方向很难,因为有许多因素会遮挡面部和眼睛区域。在这项工作中,我们提出了CrossGaze,一个强大的凝视估计基线,它利用了计算机视觉架构和基于注意力的模块的最近发展。与之前的方法不同,我们的方法不需要专门的架构,而是利用我们架构中已有的模型,并针对3D凝视估计进行适应。这种方法允许随着任何模块的更强大的特征提取器进行无缝更新。在Gaze360基准上,我们的模型超越了几个最先进的方法,实现了9.94度的平均角误差。我们提出的模型为未来凝视估计的研究和开发奠定了坚实的基础,为现实世界场景中的实际和准确凝视预测铺平了道路。
https://arxiv.org/abs/2402.08316
Advances in face swapping have enabled the automatic generation of highly realistic faces. Yet face swaps are perceived differently than when looking at real faces, with key differences in viewer behavior surrounding the eyes. Face swapping algorithms generally place no emphasis on the eyes, relying on pixel or feature matching losses that consider the entire face to guide the training process. We further investigate viewer perception of face swaps, focusing our analysis on the presence of an uncanny valley effect. We additionally propose a novel loss equation for the training of face swapping models, leveraging a pretrained gaze estimation network to directly improve representation of the eyes. We confirm that viewed face swaps do elicit uncanny responses from viewers. Our proposed improvements significant reduce viewing angle errors between face swaps and their source material. Our method additionally reduces the prevalence of the eyes as a deciding factor when viewers perform deepfake detection tasks. Our findings have implications on face swapping for special effects, as digital avatars, as privacy mechanisms, and more; negative responses from users could limit effectiveness in said applications. Our gaze improvements are a first step towards alleviating negative viewer perceptions via a targeted approach.
面部换脸技术的进步使得高度逼真的脸部生成成为可能。然而,与观察真实人脸时不同,面部换脸在观众行为方面存在不同的看法,主要是眼睛周围的观感差异。面部换脸算法通常没有强调眼睛,依赖像素或特征匹配损失,这些损失考虑整个面部来指导训练过程。我们进一步研究了观众对面部换脸的感知,重点关注是否存在令人不安的谷地效应。我们提出了一个新的损失方程来训练面部换脸模型,利用预训练的眼部估计网络直接提高眼睛的表示。我们证实,观众看到的面部换脸会引发观众的不安反应。我们提出的改进措施显著减少了面部换脸和其源材料之间的视角误差。我们的方法还减少了观众在进行深度伪造检测任务时作为决定性因素的眼球普遍性。我们的研究结果对面部换脸、特效、隐私机制等领域的应用具有影响。用户负面反应可能限制这些应用的有效性。我们的眼部改进是减轻观众不良感知的第一步,通过一种有针对性的方法。
https://arxiv.org/abs/2402.03188
Driver's gaze information can be crucial in driving research because of its relation to driver attention. Particularly, the inclusion of gaze data in driving simulators broadens the scope of research studies as they can relate drivers' gaze patterns to their features and performance. In this paper, we present two gaze region estimation modules integrated in a driving simulator. One uses the 3D Kinect device and another uses the virtual reality Oculus Rift device. The modules are able to detect the region, out of seven in which the driving scene was divided, where a driver is gazing at in every route processed frame. Four methods were implemented and compared for gaze estimation, which learn the relation between gaze displacement and head movement. Two are simpler and based on points that try to capture this relation and two are based on classifiers such as MLP and SVM. Experiments were carried out with 12 users that drove on the same scenario twice, each one with a different visualization display, first with a big screen and later with Oculus Rift. On the whole, Oculus Rift outperformed Kinect as the best hardware for gaze estimation. The Oculus-based gaze region estimation method with the highest performance achieved an accuracy of 97.94%. The information provided by the Oculus Rift module enriches the driving simulator data and makes it possible a multimodal driving performance analysis apart from the immersion and realism obtained with the virtual reality experience provided by Oculus.
驾驶员的目光信息对于驾驶研究至关重要,因为它与驾驶员注意力的关系。特别是,将目光数据纳入驾驶模拟器中扩大了研究研究的范围,因为它们可以关系驾驶员目光模式及其特征和表现。在本文中,我们介绍了一个驾驶模拟器中的两个目光区域估计模块。一个使用Kinect 3D设备,另一个使用Oculus Rift虚拟现实设备。这些模块能够检测出在处理每个路线的帧中,驾驶员在目光中所视的区域。我们比较了四种目光估计方法,它们学习了目光移动与头部移动之间的关系。两种方法更简单,基于试图捕捉这种关系的点,而两种方法基于像MLP和SVM这样的分类器。我们对12名用户在同一场景中进行了实验,每个人分别使用不同的可视化显示,先用大屏幕,然后使用Oculus Rift。总的来说,Oculus Rift超越了Kinect,成为目光估计的最佳硬件。基于Oculus的眼光区域估计方法达到97.94%的准确度。Oculus Rift模块提供的信息使驾驶模拟器数据更加丰富,使其能够进行多模态驾驶性能分析,而不仅仅是通过Oculus提供的虚拟现实体验获得的沉浸和现实感。
https://arxiv.org/abs/2402.05248
In this research, we present SLYKLatent, a novel approach for enhancing gaze estimation by addressing appearance instability challenges in datasets due to aleatoric uncertainties, covariant shifts, and test domain generalization. SLYKLatent utilizes Self-Supervised Learning for initial training with facial expression datasets, followed by refinement with a patch-based tri-branch network and an inverse explained variance-weighted training loss function. Our evaluation on benchmark datasets achieves an 8.7% improvement on Gaze360, rivals top MPIIFaceGaze results, and leads on a subset of ETH-XGaze by 13%, surpassing existing methods by significant margins. Adaptability tests on RAF-DB and Affectnet show 86.4% and 60.9% accuracies, respectively. Ablation studies confirm the effectiveness of SLYKLatent's novel components. This approach has strong potential in human-robot interaction.
在这项研究中,我们提出了SLYKLatent,一种通过解决数据集中由于随机不确定性、协变量位移和测试域泛化导致的特征不稳定问题来增强注视估计的新型方法。SLYKLatent利用自监督学习对面部表情数据集进行初始训练,然后通过基于补丁的三分支网络和逆均方误差损失函数进行优化。我们在基准数据集上的评估实现了Gaze360的8.7%的改进,与顶级MPIIFaceGaze的结果相当,并且在ETH-XGaze上的份额增加了13%,超过了现有方法。在RAF-DB和Affectnet上的适应性测试显示,SLYKLatent的新组件分别获得了86.4%和60.9%的准确度。消融研究证实了SLYKLatent新组件的有效性。这种方法在人类机器人交互中具有很强的潜力。
https://arxiv.org/abs/2402.01555
Recently, appearance-based gaze estimation has been attracting attention in computer vision, and remarkable improvements have been achieved using various deep learning techniques. Despite such progress, most methods aim to infer gaze vectors from images directly, which causes overfitting to person-specific appearance factors. In this paper, we address these challenges and propose a novel framework: Stochastic subject-wise Adversarial gaZE learning (SAZE), which trains a network to generalize the appearance of subjects. We design a Face generalization Network (Fgen-Net) using a face-to-gaze encoder and face identity classifier and a proposed adversarial loss. The proposed loss generalizes face appearance factors so that the identity classifier inferences a uniform probability distribution. In addition, the Fgen-Net is trained by a learning mechanism that optimizes the network by reselecting a subset of subjects at every training step to avoid overfitting. Our experimental results verify the robustness of the method in that it yields state-of-the-art performance, achieving 3.89 and 4.42 on the MPIIGaze and EyeDiap datasets, respectively. Furthermore, we demonstrate the positive generalization effect by conducting further experiments using face images involving different styles generated from the generative model.
近年来,基于外观的注意力检测在计算机视觉领域引起了关注,并使用各种深度学习技术取得了显著的改进。然而,大多数方法旨在通过直接从图像中推断目光向量来获取人物特定外观因素,导致过拟合到个性化的外观因素。在本文中,我们解决了这些挑战,并提出了一个新颖的框架:随机主题的对抗性全局姿态学习(SAZE),它训练一个网络来推广主题。我们使用 face-to-gaze 编码器和一个 face identity 分类器来设计 Fgen-Net。所提出的损失将人脸外观因素扩展为一个均匀的概率分布。此外,通过在每次训练步骤中重新选择部分主题来优化网络,以避免过拟合。我们对该方法进行了实验,结果表明,该方法具有鲁棒性,在 MPIIGaze 和 EyeDiap 数据集上分别取得了 3.89 和 4.42 的分数。此外,我们通过进一步实验使用包含不同风格的人脸图像来验证了积极泛化效果。
https://arxiv.org/abs/2401.13865
Despite the recent remarkable achievement in gaze estimation, efficient and accurate personalization of gaze estimation without labels is a practical problem but rarely touched on in the literature. To achieve efficient personalization, we take inspiration from the recent advances in Natural Language Processing (NLP) by updating a negligible number of parameters, "prompts", at the test time. Specifically, the prompt is additionally attached without perturbing original network and can contain less than 1% of a ResNet-18's parameters. Our experiments show high efficiency of the prompt tuning approach. The proposed one can be 10 times faster in terms of adaptation speed than the methods compared. However, it is non-trivial to update the prompt for personalized gaze estimation without labels. At the test time, it is essential to ensure that the minimizing of particular unsupervised loss leads to the goals of minimizing gaze estimation error. To address this difficulty, we propose to meta-learn the prompt to ensure that its updates align with the goal. Our experiments show that the meta-learned prompt can be effectively adapted even with a simple symmetry loss. In addition, we experiment on four cross-dataset validations to show the remarkable advantages of the proposed method.
尽管在 gaze 估计方面取得了最近引人注目的成就,但 gaze 估计没有标签的准确且高效个性化仍然是一个实际问题,但很少在文献中涉及。为了实现高效的个性化,我们受到了自然语言处理(NLP)领域最近取得的进展的启发,在测试时更新了极少量的参数,“提示”(prompts)。具体来说,我们还在原始网络之外附加了提示,且不会对 ResNet-18 的参数造成较大影响。我们的实验表明,提示调整方法具有很高的效率。与比较方法相比,所提出的个性化 gaze 估计方法可以快 10 倍。然而,为个性化 gaze 估计没有标签,更新提示并不容易。在测试时,确保最小化特定无监督损失导致 gaze 估计误差最小化目标是至关重要的。为了应对这一困难,我们提出了一种元学习提示的方法,以确保其更新符合目标。我们的实验表明,即使使用简单的对称损失,元学习提示也可以有效地适应。此外,我们还对四个跨数据集的实验进行了研究,以展示所提出方法的优势。
https://arxiv.org/abs/2401.01577
Introduction: In the realm of human-computer interaction and behavioral research, accurate real-time gaze estimation is critical. Traditional methods often rely on expensive equipment or large datasets, which are impractical in many scenarios. This paper introduces a novel, geometry-based approach to address these challenges, utilizing consumer-grade hardware for broader applicability. Methods: We leverage novel face landmark detection neural networks capable of fast inference on consumer-grade chips to generate accurate and stable 3D landmarks of the face and iris. From these, we derive a small set of geometry-based descriptors, forming an 8-dimensional manifold representing the eye and head movements. These descriptors are then used to formulate linear equations for predicting eye-gaze direction. Results: Our approach demonstrates the ability to predict gaze with an angular error of less than 1.9 degrees, rivaling state-of-the-art systems while operating in real-time and requiring negligible computational resources. Conclusion: The developed method marks a significant step forward in gaze estimation technology, offering a highly accurate, efficient, and accessible alternative to traditional systems. It opens up new possibilities for real-time applications in diverse fields, from gaming to psychological research.
简介:在人类-计算机交互和行为研究的领域,精确实时眼神检测是至关重要的。传统方法通常依赖于昂贵的设备或大量数据,这在许多场景下是不切实际的。本文介绍了一种新颖的基于几何的方法来解决这些挑战,利用消费级硬件实现更广泛的适用性。方法:我们利用具有快速检测消费者级芯片上面部关键点的神经网络来生成准确且稳定的面部和眼睛的三维关键点。从中,我们导出一个基于几何的描述符,构成一个8维的流形,表示眼和头的运动。这些描述符随后被用来形成预测眼 gaze 方向的线性方程。结果:我们的方法在角误差不到1.9度的情况下,展示了与最先进的系统相媲美的能力,同时在实时操作中,且对计算资源的需求非常小。结论:所开发的方法在目光检测技术上取得了显著的突破,为传统系统提供了一种高准确度、高效和易用性的替代方案。这为各种领域的实时应用提供了新的可能性,从游戏到心理学研究。
https://arxiv.org/abs/2401.00406
Over the past decade, visual gaze estimation has garnered growing attention within the research community, thanks to its wide-ranging application scenarios. While existing estimation approaches have achieved remarkable success in enhancing prediction accuracy, they primarily infer gaze directions from single-image signals and discard the huge potentials of the currently dominant text guidance. Notably, visual-language collaboration has been extensively explored across a range of visual tasks, such as image synthesis and manipulation, leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) model. Nevertheless, existing gaze estimation approaches ignore the rich semantic cues conveyed by linguistic signals and priors in CLIP feature space, thereby yielding performance setbacks. In pursuit of making up this gap, we delve deeply into the text-eye collaboration protocol and introduce a novel gaze estimation framework in this paper, referred to as GazeCLIP. Specifically, we intricately design a linguistic description generator to produce text signals with coarse directional cues. Additionally, a CLIP-based backbone that excels in characterizing text-eye pairs for gaze estimation is presented. This is followed by the implementation of a fine-grained multi-modal fusion module aimed at modeling the interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of the proposed GazeCLIP which surpasses the previous approaches and achieves the state-of-the-art estimation accuracy.
在过去的十年里,视觉注视估计在研究社区中引起了越来越多的关注,这要归功于其广泛的应用场景。虽然现有的估计方法在提高预测准确性方面取得了显著的成功,但它们主要从单图像信号中推断目光方向并丢弃了当前占主导地位的文本指导的巨大潜力。值得注意的是,在广泛的视觉任务中,例如图像合成和操作,我们广泛研究了视觉-语言合作,利用了大型对比性语言-图像预训练(CLIP)模型的显著可转移性。然而,现有的注视估计方法忽略了CLIP特征空间中语言信号和先验所传达的丰富语义线索,从而导致性能下降。为了填补这一空白,我们深入研究了文本-眼协作协议,并在本文中引入了一个名为GazeCLIP的新注视估计框架。具体来说,我们精心设计了一个语言描述生成器,用于产生带有粗略方向线索的文本信号。此外,我们还介绍了在CLIP基础上用于 gaze estimation 的骨干网络。接着,我们实现了一个细粒度的多模态融合模块,旨在建模异质输入之间的相互关系。在三个具有挑战性的数据集上的广泛实验证明,所提出的GazeCLIP具有优越性,超越了以前的方法,实现了与最佳估计精度相当的结果。
https://arxiv.org/abs/2401.00260
Gaze estimation has become a subject of growing interest in recent research. Most of the current methods rely on single-view facial images as input. Yet, it is hard for these approaches to handle large head angles, leading to potential inaccuracies in the estimation. To address this issue, adding a second-view camera can help better capture eye appearance. However, existing multi-view methods have two limitations. 1) They require multi-view annotations for training, which are expensive. 2) More importantly, during testing, the exact positions of the multiple cameras must be known and match those used in training, which limits the application scenario. To address these challenges, we propose a novel 1-view-to-2-views (1-to-2 views) adaptation solution in this paper, the Unsupervised 1-to-2 Views Adaptation framework for Gaze estimation (UVAGaze). Our method adapts a traditional single-view gaze estimator for flexibly placed dual cameras. Here, the "flexibly" means we place the dual cameras in arbitrary places regardless of the training data, without knowing their extrinsic parameters. Specifically, the UVAGaze builds a dual-view mutual supervision adaptation strategy, which takes advantage of the intrinsic consistency of gaze directions between both views. In this way, our method can not only benefit from common single-view pre-training, but also achieve more advanced dual-view gaze estimation. The experimental results show that a single-view estimator, when adapted for dual views, can achieve much higher accuracy, especially in cross-dataset settings, with a substantial improvement of 47.0%. Project page: this https URL.
目光估计已经成为最近研究的一个热门主题。大多数现有方法依赖于单视图面部图像作为输入。然而,对于这些方法来说处理大的头角度是困难的,导致估计精度存在潜在误差。为了应对这个问题,添加第二个视角的相机可以帮助更好地捕捉眼部特征。然而,现有的多视图方法有两个限制。1)它们需要多视图注释来进行训练,这需要花费大量的资金。2)更重要的是,在测试时,多个相机的精确位置必须知道并与其训练时使用的位置相匹配,这限制了应用场景。为了应对这些挑战,本文提出了一种新颖的1-视-2-视(1-到2视)适应解决方案,即无监督的1-到2视适应框架(UVAGaze)。我们的方法将传统的单视目光估计算法适应于灵活放置的双相机。这里的"灵活"意味着我们将双相机在任何位置放置,而不知道它们的非线性参数。具体来说,UVAGaze构建了一种双视 mutual supervision adaptation strategy,利用了两视之间目光方向的固有一致性。这样,我们的方法不仅可以从常见的单视预训练中受益,还可以实现更高级的双视目光估计。实验结果表明,将目光估计算法适应双视可以实现更高的准确度,尤其是在跨数据集设置中,其准确度提高了47.0%。项目页面:此链接。
https://arxiv.org/abs/2312.15644
Human eye gaze estimation is an important cognitive ingredient for successful human-robot interaction, enabling the robot to read and predict human behavior. We approach this problem using artificial neural networks and build a modular system estimating gaze from separately cropped eyes, taking advantage of existing well-functioning components for face detection (RetinaFace) and head pose estimation (6DRepNet). Our proposed method does not require any special hardware or infrared filters but uses a standard notebook-builtin RGB camera, as often approached with appearance-based methods. Using the MetaHuman tool, we also generated a large synthetic dataset of more than 57,000 human faces and made it publicly available. The inclusion of this dataset (with eye gaze and head pose information) on top of the standard Columbia Gaze dataset into training the model led to better accuracy with a mean average error below two degrees in eye pitch and yaw directions, which compares favourably to related methods. We also verified the feasibility of our model by its preliminary testing in real-world setting using the builtin 4K camera in NICO semi-humanoid robot's eye.
人类眼睛注视估计是成功的人机交互的重要认知成分,使机器人能够阅读和预测人类行为。我们通过人工神经网络解决这个问题,并建立了一个模块系统,从分别裁剪的眼睛中估计注视,利用现有的面部检测(RetinaFace)和头姿态估计(6DRepNet)的成熟组件。我们提出的方法不需要特殊的硬件或红外滤镜,而是利用了一个标准的笔记本内置的RGB相机,通常与基于外观的方法相同。使用元人类工具,我们还生成了超过57,000个合成面部数据集,并将其公开发布。在将这个数据集(带有眼部和头姿态信息)放在标准的哥伦比亚 gaze数据集中训练模型后,我们在眼俯仰和眼偏转方向上的平均误差低于两度,与相关方法相比具有优势。我们还通过使用NICO半人形机器人预先测试模型中的内置4K相机来验证我们模型的可行性。
https://arxiv.org/abs/2311.14175
Gaze estimation is a valuable technology with numerous applications in fields such as human-computer interaction, virtual reality, and medicine. This report presents the implementation of a gaze estimation system using the Sony Spresense microcontroller board and explores its performance in latency, MAC/cycle, and power consumption. The report also provides insights into the system's architecture, including the gaze estimation model used. Additionally, a demonstration of the system is presented, showcasing its functionality and performance. Our lightweight model TinyTrackerS is a mere 169Kb in size, using 85.8k parameters and runs on the Spresense platform at 3 FPS.
凝视估计是一种有价值的技术,在诸如人机交互、虚拟现实和医疗等领域具有众多应用。本报告使用索尼Spresense微控制器板实现了凝视估计系统的部署,并探讨了其在延迟、MAC循环和功耗方面的性能。报告还提供了系统架构的见解,包括使用的凝视估计模型。此外,系统功能和性能的演示也被呈现出来。我们的轻量级模型TinyTrackerS仅占169Kb,使用85.8k个参数,并运行在Spresense平台上,采样率为3 FPS。
https://arxiv.org/abs/2308.12313
The advent of foundation models signals a new era in artificial intelligence. The Segment Anything Model (SAM) is the first foundation model for image segmentation. In this study, we evaluate SAM's ability to segment features from eye images recorded in virtual reality setups. The increasing requirement for annotated eye-image datasets presents a significant opportunity for SAM to redefine the landscape of data annotation in gaze estimation. Our investigation centers on SAM's zero-shot learning abilities and the effectiveness of prompts like bounding boxes or point clicks. Our results are consistent with studies in other domains, demonstrating that SAM's segmentation effectiveness can be on-par with specialized models depending on the feature, with prompts improving its performance, evidenced by an IoU of 93.34% for pupil segmentation in one dataset. Foundation models like SAM could revolutionize gaze estimation by enabling quick and easy image segmentation, reducing reliance on specialized models and extensive manual annotation.
基础模型模型的出现标志着人工智能进入了一个新的时代。SAM 是图像分割的基础模型中的第一个。在这项研究中,我们评估了 SMA 对虚拟现实设置中记录的眼部图像特征进行分割的能力。对标注眼部图像数据集的需求不断增加,这为SMA重新定义数据注释格局提供了巨大的机会。我们的研究集中在SMA的零样本学习能力和提示(如边界框或点击)的有效性上。我们的结果与其他领域的研究结果一致,证明了SMA的分割效果可以与根据特定功能进行专门建模的模型相媲美,而提示可以提高其性能,这可以通过在一个数据集上的瞳孔分割的IoU为93.34%来说明。基于SMA的基础模型可以凭借快速且易於理解的图像分割,减少对专门模型的依赖,以及减少大量手动注释,从而颠覆凝视估计。
https://arxiv.org/abs/2311.08077
DeepFake detection is pivotal in personal privacy and public safety. With the iterative advancement of DeepFake techniques, high-quality forged videos and images are becoming increasingly deceptive. Prior research has seen numerous attempts by scholars to incorporate biometric features into the field of DeepFake detection. However, traditional biometric-based approaches tend to segregate biometric features from general ones and freeze the biometric feature extractor. These approaches resulted in the exclusion of valuable general features, potentially leading to a performance decline and, consequently, a failure to fully exploit the potential of biometric information in assisting DeepFake detection. Moreover, insufficient attention has been dedicated to scrutinizing gaze authenticity within the realm of DeepFake detection in recent years. In this paper, we introduce GazeForensics, an innovative DeepFake detection method that utilizes gaze representation obtained from a 3D gaze estimation model to regularize the corresponding representation within our DeepFake detection model, while concurrently integrating general features to further enhance the performance of our model. Experiment results reveal that our proposed GazeForensics outperforms the current state-of-the-art methods.
深度伪造检测在个人隐私和公共安全方面具有关键作用。随着深度伪造技术的迭代进步,高品质伪造视频和图像变得越来越欺骗性。之前的研究表明,学者们试图将生物特征引入深度伪造检测领域。然而,传统基于生物特征的方法通常将生物特征与一般特征分离并冻结生物特征提取器。这些方法导致价值的一般特征被排除,可能導致性能下降,从而无法充分利用生物信息在协助深度伪造检测中的潜在能量。此外,在最近几年,关于深度伪造检测领域中眼神真实性的审查不足。在本文中,我们介绍了GazeForensics,一种创新的深度伪造检测方法,它利用从3D gaze估计模型获得的眼神表示来规范我们深度伪造检测模型中的相应表示,同时将一般特征集成到我们的模型中,以进一步增强模型的性能。实验结果表明,我们提出的GazeForensics超越了现有技术的水平。
https://arxiv.org/abs/2311.07075