Co-speech gesturing is an important modality in conversation, providing context and social cues. In character animation, appropriate and synchronised gestures add realism, and can make interactive agents more engaging. Historically, methods for automatically generating gestures were predominantly audio-driven, exploiting the prosodic and speech-related content that is encoded in the audio signal. In this paper we instead experiment with using LLM features for gesture generation that are extracted from text using LLAMA2. We compare against audio features, and explore combining the two modalities in both objective tests and a user study. Surprisingly, our results show that LLAMA2 features on their own perform significantly better than audio features and that including both modalities yields no significant difference to using LLAMA2 features in isolation. We demonstrate that the LLAMA2 based model can generate both beat and semantic gestures without any audio input, suggesting LLMs can provide rich encodings that are well suited for gesture generation.
共同语言手势是在对话中一个重要的模式,提供上下文和社会线索。在角色动画中,适当的同步手势增加真实感,可以让交互型代理更具吸引力。历史上,自动生成手势的方法主要是音频驱动的,利用音频信号中编码的语调和社会相关内容。在本文中,我们尝试使用LLM特征来生成从文本中提取的LLAMA2模型的手势。我们比较了音频特征,并探讨了在客观测试和用户研究中将两个模式相结合的效果。令人惊讶的是,我们的结果表明,LLAMA2特征比音频特征表现更好,而将两个模式结合使用对使用LLAMA2特征的单独应用没有显著影响。我们证明了基于LLAMA2的模型可以在没有任何音频输入的情况下生成节奏和语义手势,这表明LLMs可以提供丰富的编码,非常适合手势生成。
https://arxiv.org/abs/2405.08042
With the rapid advancement of technologies such as virtual reality, augmented reality, and gesture control, users expect interactions with computer interfaces to be more natural and intuitive. Existing visual algorithms often struggle to accomplish advanced human-computer interaction tasks, necessitating accurate and reliable absolute spatial prediction methods. Moreover, dealing with complex scenes and occlusions in monocular images poses entirely new challenges. This study proposes a network model that performs parallel processing of root-relative grids and root recovery tasks. The model enables the recovery of 3D hand meshes in camera space from monocular RGB images. To facilitate end-to-end training, we utilize an implicit learning approach for 2D heatmaps, enhancing the compatibility of 2D cues across different subtasks. Incorporate the Inception concept into spectral graph convolutional network to explore relative mesh of root, and integrate it with the locally detailed and globally attentive method designed for root recovery exploration. This approach improves the model's predictive performance in complex environments and self-occluded scenes. Through evaluation on the large-scale hand dataset FreiHAND, we have demonstrated that our proposed model is comparable with state-of-the-art models. This study contributes to the advancement of techniques for accurate and reliable absolute spatial prediction in various human-computer interaction applications.
随着诸如虚拟现实、增强现实和手势控制的快速进步,用户期望与计算机界面进行更自然、直观的交互。现有的视觉算法通常很难完成高级的人机交互任务,因此需要准确可靠的绝对空间预测方法。此外,处理单目图像中的复杂场景和遮挡也带来了全新的挑战。本研究提出了一个并行处理根相对网格和根恢复任务的网络模型。该模型能够从单目RGB图像中恢复相机空间中的三维手纹理。为了促进端到端训练,我们利用隐式学习方法对2D热图进行处理,增强不同子任务中2D提示的兼容性。将Inception概念引入到光谱图卷积网络中,以探索根的相对纹理,并将其与局部详细和全局关注方法相结合,用于根恢复探索。这种方法在复杂环境和自闭场景中提高了模型的预测性能。通过在大型手数据集FreiHAND上进行评估,我们证明了与最先进的模型相比,我们所提出的模型具有可比性。本研究为各种人机交互应用中准确可靠绝对空间预测技术的发展做出了贡献。
https://arxiv.org/abs/2405.07167
Sign language serves as a non-vocal means of communication, transmitting information and significance through gestures, facial expressions, and bodily movements. The majority of current approaches for sign language recognition (SLR) and translation rely on RGB video inputs, which are vulnerable to fluctuations in the background. Employing a keypoint-based strategy not only mitigates the effects of background alterations but also substantially diminishes the computational demands of the model. Nevertheless, contemporary keypoint-based methodologies fail to fully harness the implicit knowledge embedded in keypoint sequences. To tackle this challenge, our inspiration is derived from the human cognition mechanism, which discerns sign language by analyzing the interplay between gesture configurations and supplementary elements. We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator. In order to facilitate interaction across multiple streams, we investigate diverse methodologies such as keypoint fusion strategies, head fusion, and self-distillation. The resulting framework is denoted as MSKA-SLR, which is expanded into a sign language translation (SLT) model through the straightforward addition of an extra translation network. We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology. Notably, we have attained a novel state-of-the-art performance in the sign language translation task of Phoenix-2014T. The code and models can be accessed at: this https URL.
手语作为一种非言语沟通方式,通过手势、面部表情和肢体动作传达信息和意义。目前的大多数手语识别(SLR)和翻译方法都依赖于RGB视频输入,这些输入容易受到背景波动的影响。采用基于关键点的策略不仅减轻了背景变化的影响,而且极大地降低了模型的计算需求。然而,目前的基于关键点的方法并没有充分利用手语序列中蕴含的隐含知识。为了应对这一挑战,我们的灵感来源于人脑认知机制,它通过分析手势配置和附加元素之间的相互作用来辨别手语。我们提出了一种多流关键点注意网络,用于描绘一个可用关键点估计算法的序列关键点。为了促进多流之间的交互,我们研究了诸如关键点融合策略、头融合和自蒸馏等不同的方法。通过在 Phoenix-2014、Phoenix-2014T 和 CSL-Daily 等知名基准上进行全面的实验,展示了我们方法的有效性。值得注意的是,我们在 Phoenix-2014T 上的手语翻译任务中取得了最先进的性能。代码和模型可以访问此链接:https:// this URL。
https://arxiv.org/abs/2405.05672
Sensors and Artificial Intelligence (AI) have revolutionized the analysis of human movement, but the scarcity of specific samples presents a significant challenge in training intelligent systems, particularly in the context of diagnosing neurodegenerative diseases. This study investigates the feasibility of utilizing robot-collected data to train classification systems traditionally trained with human-collected data. As a proof of concept, we recorded a database of numeric characters using an ABB robotic arm and an Apple Watch. We compare the classification performance of the trained systems using both human-recorded and robot-recorded data. Our primary objective is to determine the potential for accurate identification of human numeric characters wearing a smartwatch using robotic movement as training data. The findings of this study offer valuable insights into the feasibility of using robot-collected data for training classification systems. This research holds broad implications across various domains that require reliable identification, particularly in scenarios where access to human-specific data is limited.
传感器和人工智能(AI)已经彻底颠覆了人类运动的分析,但数据的稀疏性在训练智能系统方面提出了一个重大挑战,特别是在诊断神经退行性疾病方面。这项研究调查了使用机器人收集的数据来训练传统以人类收集数据训练的分类系统的可行性。作为概念证明,我们使用ABB机器人手和苹果手表记录了一系列数字字符的数据库。我们比较了使用人类记录和机器人记录数据的训练系统的分类性能。我们的主要目标是要确定使用机器人运动作为训练数据准确识别智能手表上的人类数字字符的可能性。本研究的发现为使用机器人收集的数据用于训练分类系统提供了宝贵的见解。这种研究在需要可靠识别的各种领域都具有广泛的应用价值,尤其是在无法获取人类特定数据的情况下。
https://arxiv.org/abs/2405.04241
Psychological studies have shown that Micro Gestures (MG) are closely linked to human emotions. MG-based emotion understanding has attracted much attention because it allows for emotion understanding through nonverbal body gestures without relying on identity information (e.g., facial and electrocardiogram data). Therefore, it is essential to recognize MG effectively for advanced emotion understanding. However, existing Micro Gesture Recognition (MGR) methods utilize only a single modality (e.g., RGB or skeleton) while overlooking crucial textual information. In this letter, we propose a simple but effective visual-text contrastive learning solution that utilizes text information for MGR. In addition, instead of using handcrafted prompts for visual-text contrastive learning, we propose a novel module called Adaptive prompting to generate context-aware prompts. The experimental results show that the proposed method achieves state-of-the-art performance on two public datasets. Furthermore, based on an empirical study utilizing the results of MGR for emotion understanding, we demonstrate that using the textual results of MGR significantly improves performance by 6%+ compared to directly using video as input.
心理学家研究表明,微表情(MG)与人类情感密切相关。基于MG的情感理解吸引了越来越多的关注,因为它允许在不依赖身份信息(例如面部和心率数据)的情况下通过非语言身体手势进行情感理解。因此,要有效地进行高级情感理解,必须认出MG。然而,现有的MGR识别方法仅利用了一个单一模态(例如RGB或骨骼),而忽视了关键的文本信息。在本文中,我们提出了一个简单但有效的视觉-文本对比学习解决方案,利用文本信息进行MGR。此外,我们提出了一个新模块,称为自适应提示,用于生成具有上下文意识的提示。实验结果表明,与使用视频作为输入的传统方法相比,所提出的方法在两个公共数据集上实现了最先进的性能。此外,根据使用MGR进行情感理解的经验研究,我们证明了使用MGR的文本结果显著提高了性能,与直接使用视频作为输入相比,提高了6%+。
https://arxiv.org/abs/2405.01885
Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multimodal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data. See this https URL for example output.
尽管人类在进行面对面的对话时同时以口头和非口头方式进行交流,但联合和统一合成语音音频和共同发言3D手势运动文本的方法是一个新兴和正在发展的新领域。这些技术具有很大的潜力,可以让更自然、高效、表达丰富、更稳健的合成通信,但目前受到缺乏足够大型的数据集的阻碍,因为现有的方法是基于所有组成模块的并行数据进行训练。受到学生和教师方法启发,我们提出了一个简单的解决数据不足问题的方法,即通过简单地合成更多训练材料来解决数据不足问题。具体来说,我们使用基于大型数据集的单模态合成模型来创建多模态(但合成)平行训练数据,然后在该材料上预训练一个联合合成模型。此外,我们提出了一种新的合成架构,可以向该领域最先进的方法添加更好的可控的音调建模。我们的结果证实,在大量合成数据上进行预训练可以提高多模态模型生成的语音和运动的质量,而所提出的架构在预训练在合成数据时进一步提高了效果。请查看此链接为示例输出。
https://arxiv.org/abs/2404.19622
Cued Speech (CS) is an advanced visual phonetic encoding system that integrates lip reading with hand codings, enabling people with hearing impairments to communicate efficiently. CS video generation aims to produce specific lip and gesture movements of CS from audio or text inputs. The main challenge is that given limited CS data, we strive to simultaneously generate fine-grained hand and finger movements, as well as lip movements, meanwhile the two kinds of movements need to be asynchronously aligned. Existing CS generation methods are fragile and prone to poor performance due to template-based statistical models and careful hand-crafted pre-processing to fit the models. Therefore, we propose a novel Gloss-prompted Diffusion-based CS Gesture generation framework (called GlossDiff). Specifically, to integrate additional linguistic rules knowledge into the model. we first introduce a bridging instruction called \textbf{Gloss}, which is an automatically generated descriptive text to establish a direct and more delicate semantic connection between spoken language and CS gestures. Moreover, we first suggest rhythm is an important paralinguistic feature for CS to improve the communication efficacy. Therefore, we propose a novel Audio-driven Rhythmic Module (ARM) to learn rhythm that matches audio speech. Moreover, in this work, we design, record, and publish the first Chinese CS dataset with four CS cuers. Extensive experiments demonstrate that our method quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods. We release the code and data at this https URL.
Cued Speech(CS)是一种先进的视觉语音编码系统,它将 lip 阅读与手打编码相结合,使听力受损的人能够有效地交流。CS 视频生成旨在从音频或文本输入中产生特定的 lip 和手部运动。主要挑战是,由于有限的数据,我们努力在同时生成精细的手和手指运动以及 lip 运动的同时,使这两种运动同步。现有的 CS 生成方法由于基于模板的统计模型和谨慎的手工预处理而脆弱,性能不佳。因此,我们提出了一个新颖的基于 Gloss-prompted Diffusion 的 CS 手势生成框架(称为 GlossDiff)。具体来说,为了将语义规则知识集成到模型中,我们首先引入了一个名为 \textbf{Gloss} 的桥接指令,它是一种自动生成的描述性文本,用于建立口语和 CS 手势之间的更细微的语义联系。此外,我们首先建议节奏是 CS 提高交流效果的重要会话特征。因此,我们提出了一个名为音频驱动的节奏模块(ARM)来学习与音频 speech 匹配的节奏。此外,在这项工作中,我们设计、记录并发布了第一个中文 CS 数据集,包含四个 CS 指操。大量实验证明,我们的方法在质量和数量上优于目前的最佳(SOTA)方法。我们将代码和数据发布在這個链接上:https://www.example.com/。
https://arxiv.org/abs/2404.19277
The "meaning" of an iconic gesture is conditioned on its informational evaluation. Only informational evaluation lifts a gesture to a quasi-linguistic level that can interact with verbal content. Interaction is either vacuous or regimented by usual lexicon-driven inferences. Informational evaluation is spelled out as extended exemplification (extemplification) in terms of perceptual classification of a gesture's visual iconic model. The iconic model is derived from Frege/Montague-like truth-functional evaluation of a gesture's form within spatially extended domains. We further argue that the perceptual classification of instances of visual communication requires a notion of meaning different from Frege/Montague frameworks. Therefore, a heuristic for gesture interpretation is provided that can guide the working semanticist. In sum, an iconic gesture semantics is introduced which covers the full range from kinematic gesture representations over model-theoretic evaluation to inferential interpretation in dynamic semantic frameworks.
标志性手势的意义取决于其信息评估。只有信息评估将手势提升到可以与言语内容互动的准语言级别,这种互动可以是空白的,也可以是由通常的词汇驱动的推理推断。信息评估被用感知分类来描述手势的视觉原型模型的表现形式。原型模型来源于在手势的扩展领域内对手势形式的Frege/Montague类似真理函数评估。我们进一步认为,视觉通信实例的感知分类需要一个与Frege/Montague框架不同的意义概念。因此,提供了一个手势解释的启发式,可以引导工作语义学家。总之,引入了标志性手势语义,该语义覆盖了从运动符号表示到模型理论评估的完整范围,以及动态语义框架中的推理解释。
https://arxiv.org/abs/2404.18708
Event-based sensors are well suited for real-time processing due to their fast response times and encoding of the sensory data as successive temporal differences. These and other valuable properties, such as a high dynamic range, are suppressed when the data is converted to a frame-based format. However, most current methods either collapse events into frames or cannot scale up when processing the event data directly event-by-event. In this work, we address the key challenges of scaling up event-by-event modeling of the long event streams emitted by such sensors, which is a particularly relevant problem for neuromorphic computing. While prior methods can process up to a few thousand time steps, our model, based on modern recurrent deep state-space models, scales to event streams of millions of events for both training and inference.We leverage their stable parameterization for learning long-range dependencies, parallelizability along the sequence dimension, and their ability to integrate asynchronous events effectively to scale them up to long event streams.We further augment these with novel event-centric techniques enabling our model to match or beat the state-of-the-art performance on several event stream benchmarks. In the Spiking Speech Commands task, we improve state-of-the-art by a large margin of 6.6% to 87.1%. On the DVS128-Gestures dataset, we achieve competitive results without using frames or convolutional neural networks. Our work demonstrates, for the first time, that it is possible to use fully event-based processing with purely recurrent networks to achieve state-of-the-art task performance in several event-based benchmarks.
基于事件的传感器非常适合实时处理,因为它们具有快速的响应时间和将感知数据编码为连续的时间差异。当数据以帧格式转换时,这些和其他有价值的特点(高动态范围)会被抑制。然而,大多数现有方法要么将事件压缩成帧,要么在处理事件数据时无法按事件逐个处理。在这项工作中,我们解决了在事件逐个处理方面扩展长事件建模的关键挑战,这是神经形态计算的一个相关问题。虽然先前的方法可以处理多达几千个时间步,但我们的模型,基于现代递归深度状态空间模型,可以扩展到处理数百万个事件的event stream,无论是训练还是推理。我们利用它们的稳健参数化来学习长距离依赖关系,沿着序列维度的并行性以及它们将异步事件有效地整合到long event streams中的能力,将其扩展到long event streams。我们进一步通过新颖的事件中心技术增加这些模型的性能,使其在多个event stream基准测试中与最先进的水平相当或者超过。在Spiking Speech Commands任务中,我们在很大程度上提高了最先进的水平,提高了6.6%到87.1%。在DVS128-Gestures数据集上,我们实现了与使用帧或卷积神经网络的竞争结果。我们的工作证明了,对于使用纯递归网络的完全事件基于处理,可以在多个事件基于基准测试中实现最先进任务的性能。
https://arxiv.org/abs/2404.18508
In this study, we propose the first hardware implementation of a context-based recurrent spiking neural network (RSNN) emphasizing on integrating dual information streams within the neocortical pyramidal neurons specifically Context- Dependent Leaky Integrate and Fire (CLIF) neuron models, essential element in RSNN. We present a quantized version of the CLIF neuron (qCLIF), developed through a hardware-software codesign approach utilizing the sparse activity of RSNN. Implemented in a 45nm technology node, the qCLIF is compact (900um^2) and achieves a high accuracy of 90% despite 8 bit quantization on DVS gesture classification dataset. Our analysis spans a network configuration from 10 to 200 qCLIF neurons, supporting up to 82k synapses within a 1.86 mm^2 footprint, demonstrating scalability and efficiency
在这项研究中,我们提出了一个基于上下文信息的循环神经网络(RSNN)的硬件实现,重点关注将双信息流集成到轴突颗粒细胞特别是Context- Dependent Leaky Integrate和Fire(CLIF)神经元模型中,这是RSNN的必要组成部分。我们呈现了通过使用RSNN的稀疏活动开发的量化CLIF神经元(qCLIF)。在45nm技术节点上实现,qCLIF是紧凑的(900um2)且具有90%的高准确度,尽管在DVS手势分类数据集上进行了8位量化。我们的分析跨越了网络配置从10到200个qCLIF神经元,支持在1.86mm2的足迹内达到82k个突触。这证明了可扩展性和效率。
https://arxiv.org/abs/2404.18066
Clustering of motion trajectories is highly relevant for human-robot interactions as it allows the anticipation of human motions, fast reaction to those, as well as the recognition of explicit gestures. Further, it allows automated analysis of recorded motion data. Many clustering algorithms for trajectories build upon distance metrics that are based on pointwise Euclidean distances. However, our work indicates that focusing on salient characteristics is often sufficient. We present a novel distance measure for motion plans consisting of state and control trajectories that is based on a compressed representation built from their main features. This approach allows a flexible choice of feature classes relevant to the respective task. The distance measure is used in agglomerative hierarchical clustering. We compare our method with the widely used dynamic time warping algorithm on test sets of motion plans for the Furuta pendulum and the Manutec robot arm and on real-world data from a human motion dataset. The proposed method demonstrates slight advantages in clustering and strong advantages in runtime, especially for long trajectories.
运动轨迹聚类对于人机交互具有高度相关性,因为它允许预测人类动作、快速反应以及识别明确的手势。此外,它允许对记录的运动数据进行自动分析。许多轨迹聚类算法基于基于点间欧氏距离的距离度量。然而,我们的工作表明,关注显著特征通常是足够的。我们提出了一种基于主要特征的压缩表示的运动计划距离度量,为人机交互任务提供了一种灵活的特征类选择方式。该方法用于聚类分层。我们将我们的方法与用于 furua 摆和 manutec 机器人手臂的运动计划广泛使用的动态时间扭曲算法以及来自人类运动数据集的真实世界数据进行比较。与聚类算法相比,我们的方法在聚类方面略微优势,在运行时具有明显优势,特别是对于长轨迹。
https://arxiv.org/abs/2404.17269
Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.
手势是人类交互的固有特征,通常在面对面交流中作为语言的补充,形成了一个多模态通信系统。手势分析的一个重要任务是检测手势的开始和结束。在手势检测研究中,主要集中于视觉和运动信息来检测具有较低变异性的一小部分孤立或无声手势,而忽视了语音和视觉信号的融合来检测与语言同时出现的手势。本文通过关注共词手势检测来填补这一空白,强调言语和共词手势之间的同步性。我们解决了三个主要挑战:手势形式的变异性,手势和言语发起时间之间的时序错位,以及模态之间的采样率差异。我们研究了扩展的言语时间窗口,并为每个模态使用单独的骨干模型来解决时序错位和采样率差异。我们利用跨模态和早期融合技术中的Transformer编码器来有效地对齐和整合言语和骨骼序列。研究结果表明,结合视觉和言语信息可以显著增强手势检测性能。我们的研究结果表明,将言语缓冲区扩展到视觉时间段以外可以提高性能,而跨模态和早期融合技术的使用优于使用单模态和晚期融合方法。此外,我们发现模型手势预测信心与可能与手势相关的小级别语音频率特征之间存在相关性。总体而言,研究为共词手势提供了更好的理解和检测方法,促进了多模态通信的分析。
https://arxiv.org/abs/2404.14952
Millimeter wave radar is gaining traction recently as a promising modality for enabling pervasive and privacy-preserving gesture recognition. However, the lack of rich and fine-grained radar datasets hinders progress in developing generalized deep learning models for gesture recognition across various user postures (e.g., standing, sitting), positions, and scenes. To remedy this, we resort to designing a software pipeline that exploits wealthy 2D videos to generate realistic radar data, but it needs to address the challenge of simulating diversified and fine-grained reflection properties of user gestures. To this end, we design G3R with three key components: (i) a gesture reflection point generator expands the arm's skeleton points to form human reflection points; (ii) a signal simulation model simulates the multipath reflection and attenuation of radar signals to output the human intensity map; (iii) an encoder-decoder model combines a sampling module and a fitting module to address the differences in number and distribution of points between generated and real-world radar data for generating realistic radar data. We implement and evaluate G3R using 2D videos from public data sources and self-collected real-world radar data, demonstrating its superiority over other state-of-the-art approaches for gesture recognition.
毫米波雷达最近作为实现普遍且隐私保护的手势识别的有前景的模态而受到关注。然而,缺乏丰富和细粒度的雷达数据集会阻碍开发通用的深度学习模型用于各种用户姿势(例如,站立,坐着)和场景的手势识别。为了解决这个问题,我们采用了设计一个利用丰富 2D 视频生成逼真雷达数据的软件流水线,但需要解决模拟用户手势多种反射特性的挑战。为此,我们设计了 G3R,包括三个关键组件:(i)一个手势反射点生成器将手臂骨点扩展为人体反射点;(ii)一个信号模拟模型模拟多径反射和衰减雷达信号以输出人体强度图;(iii)一个编码器-解码器模型结合采样模块和拟合模块来解决生成和现实世界雷达数据中点之间的数量和分布差异,以生成逼真的雷达数据。我们使用来自公共数据源的 2D 视频和自收集的实世界雷达数据来实施和评估 G3R,证明了其在手势识别方面的优越性。
https://arxiv.org/abs/2404.14934
Current electromyography (EMG) pattern recognition (PR) models have been shown to generalize poorly in unconstrained environments, setting back their adoption in applications such as hand gesture control. This problem is often due to limited training data, exacerbated by the use of supervised classification frameworks that are known to be suboptimal in such settings. In this work, we propose a shift to deep metric-based meta-learning in EMG PR to supervise the creation of meaningful and interpretable representations. We use a Siamese Deep Convolutional Neural Network (SDCNN) and contrastive triplet loss to learn an EMG feature embedding space that captures the distribution of the different classes. A nearest-centroid approach is subsequently employed for inference, relying on how closely a test sample aligns with the established data distributions. We derive a robust class proximity-based confidence estimator that leads to a better rejection of incorrect decisions, i.e. false positives, especially when operating beyond the training data domain. We show our approach's efficacy by testing the trained SDCNN's predictions and confidence estimations on unseen data, both in and out of the training domain. The evaluation metrics include the accuracy-rejection curve and the Kullback-Leibler divergence between the confidence distributions of accurate and inaccurate predictions. Outperforming comparable models on both metrics, our results demonstrate that the proposed meta-learning approach improves the classifier's precision in active decisions (after rejection), thus leading to better generalization and applicability.
目前,用于手势控制的EMG(肌电图)模式识别(PR)模型在约束环境中的泛化能力较差,这使得它们在应用领域(如手势控制)中的采用受到了限制。这个问题通常是由于训练数据的有限性,以及使用已知在這種环境中表现不佳的监督分类框架而加剧的。在本文中,我们提出了一个从EMG PR转向深度基于指标的元学习,以指导有意义和可解释的表示的创建。我们使用Siamese Deep Convolutional Neural Network(SDCNN)和对比性三元组损失来学习捕捉不同类分布的EMG特征嵌入空间。接下来,我们采用最近邻算法进行推理,依赖于测试样本与已建立数据分布的接近程度。我们通过计算最近邻算法的置信度来得到一个鲁棒的分类器,可以更好地拒绝错误决策,即假阳性,尤其是在训练数据领域之外。我们通过在未见过的数据上测试训练后的SDCNN的预测和置信度来评估我们方法的有效性。评估指标包括准确率-拒绝曲线和准确与不准确预测之间的一致分布的Kullback-Leibler散度。在两个指标上超越相似模型的结果表明,所提出的元学习方法改进了分类器在积极决策(拒绝对话)中的准确率,从而提高了泛化能力和适用性。
https://arxiv.org/abs/2404.15360
Gesture recognition based on surface electromyography (sEMG) has been gaining importance in many 3D Interactive Scenes. However, sEMG is easily influenced by various forms of noise in real-world environments, leading to challenges in providing long-term stable interactions through sEMG. Existing methods often struggle to enhance model noise resilience through various predefined data augmentation techniques. In this work, we revisit the problem from a short term enhancement perspective to improve precision and robustness against various common noisy scenarios with learnable denoise using sEMG intrinsic pattern information and sliding-window attention. We propose a Short Term Enhancement Module(STEM) which can be easily integrated with various models. STEM offers several benefits: 1) Learnable denoise, enabling noise reduction without manual data augmentation; 2) Scalability, adaptable to various models; and 3) Cost-effectiveness, achieving short-term enhancement through minimal weight-sharing in an efficient attention mechanism. In particular, we incorporate STEM into a transformer, creating the Short Term Enhanced Transformer (STET). Compared with best-competing approaches, the impact of noise on STET is reduced by more than 20%. We also report promising results on both classification and regression datasets and demonstrate that STEM generalizes across different gesture recognition tasks.
基于表面电生理(sEMG)的手势识别在许多三维交互场景中越来越重要。然而,sEMG很容易受到现实环境中各种形式的噪声影响,导致通过sEMG提供长期稳定交互存在挑战。现有的方法通常很难通过各种预定义的数据增强技术增强模型的噪声韧性。在这项工作中,我们从短期增强的角度重新审视问题,以改善精度和对各种常见噪声场景的鲁棒性,使用sEMG固有模式信息和滑动窗口注意力进行学习去噪。我们提出了一个短期增强模块(STEM),可以轻松地与各种模型集成。STEM带来几个优点:1)可学习去噪,无需手动数据增强;2)可扩展性,适用于各种模型;3)性价比高,通过高效的注意力机制实现短期增强。特别地,我们将STEM集成到Transformer中,创建了短期增强Transformer(STET)。与最佳竞争方法相比,STET受到的噪声影响降低了20%以上。我们还报道了在分类和回归数据集上的积极结果,并证明了STEM在各种手势识别任务上通用。
https://arxiv.org/abs/2404.11213
The Indian classical dance-drama Kathakali has a set of hand gestures called Mudras, which form the fundamental units of all its dance moves and postures. Recognizing the depicted mudra becomes one of the first steps in its digital processing. The work treats the problem as a 24-class classification task and proposes a vector-similarity-based approach using pose estimation, eliminating the need for further training or fine-tuning. This approach overcomes the challenge of data scarcity that limits the application of AI in similar domains. The method attains 92% accuracy which is a similar or better performance as other model-training-based works existing in the domain, with the added advantage that the method can still work with data sizes as small as 1 or 5 samples with a slightly reduced performance. Working with images, videos, and even real-time streams is possible. The system can work with hand-cropped or full-body images alike. We have developed and made public a dataset for the Kathakali Mudra Recognition as part of this work.
印度古典舞蹈戏剧Kathakali有一组被称为Mudras的手势,它们构成了所有其舞蹈动作和姿势的基本单位。识别所描绘的手势是其在数字处理中的第一步。该工作将问题视为一个24类分类任务,并使用姿态估计基于向量的方法,消除了进一步的训练或微调的需求。这种方法克服了数据稀缺性,这限制了AI在类似领域中的应用。该方法获得了92%的准确度,这是该领域其他基于模型训练的工作中的类似或更好的性能,并且具有可以与少量数据样本一起工作的稍微降低性能的优点。可以与图像、视频和实时流合作工作。系统可以处理手裁剪或全身图像。我们在这个工作中为Kathakali Mudra识别开发并公开了一个数据集。
https://arxiv.org/abs/2404.11205
Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly
人类交流是多模态的;例如,面对面的交互包括听觉信号(说话)和视觉信号(面部动作和手势)。因此,在设计基于机器学习的面部表情识别系统时,充分利用多个模态是非常重要的。此外,考虑到不断增长的视频数据量,这些系统应该利用未标记的视频原始数据,而不需要昂贵的注释。因此,在这项工作中,我们采用了一种多任务多模态自监督学习方法对野外视频数据进行面部表情识别。我们的模型结合了三种自监督目标函数:第一,多模态对比损失,将相同视频中的不同数据模态聚集在表示空间中。第二,多模态聚类损失,保留输入数据在表示空间中的语义结构。第三,多模态数据重建损失。我们在三个面部表情识别基准上对这种多模态多任务自监督学习方法进行全面研究。为此,我们研究了在不同自监督任务对面部表情识别下游任务进行学习时的性能。我们的模型ConCluGen在CMU-MOSEI数据集上优于多个多模态自监督和完全监督基线。我们的结果表明,多模态自监督任务为具有挑战性的任务(如面部表情识别)提供了较大的性能提升,同时减少了手动注释的数量。我们公开发布预训练模型和源代码。
https://arxiv.org/abs/2404.10904
Object recognition, commonly performed by a camera, is a fundamental requirement for robots to complete complex tasks. Some tasks require recognizing objects far from the robot's camera. A challenging example is Ultra-Range Gesture Recognition (URGR) in human-robot interaction where the user exhibits directive gestures at a distance of up to 25~m from the robot. However, training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of a significant amount of labeled samples. The generation of synthetic training datasets is a recent solution to the lack of real-world data, while unable to properly replicate the realistic visual characteristics of distant objects in images. In this letter, we propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes. The DUR generator receives a desired distance and class (e.g., gesture) and outputs a corresponding synthetic image. We apply DUR to train a URGR model with directive gestures in which fine details of the gesturing hand are challenging to distinguish. DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model. More importantly, training a DUR model on a limited amount of real data and then using it to generate synthetic data for training a URGR model outperforms directly training the URGR model on real data. The synthetic-based URGR model is also demonstrated in gesture-based direction of a ground robot.
物体识别,通常由相机执行,是机器人完成复杂任务的基本要求。有些任务要求在机器人摄像头的远距离内识别物体。一个具有挑战性的例子是在人机交互中,用户在距离机器人25米以内的距离时表现出指令手势。然而,为了训练能够识别距离遥远物体且难以观察的模型的模型,需要收集大量标记样本。生成合成训练数据是针对缺乏真实世界数据的最近解决方案,但它无法正确复制远处物体的真实视觉特征。在本文中,我们基于扩散模型的扩散Ultra-Range (DUR)框架提出了合成远距离物体标记图像的方案。DUR生成器接收所需距离和分类(例如手势)并输出相应的合成图像。我们将DUR应用于训练具有指令手势的URGR模型,其中手部细节的分辨度很高。DUR与其他类型的生成模型进行了比较,展示了在准确性和识别成功率方面的优越性。更重要的是,在有限量的真实数据上训练DUR模型,然后使用该模型生成训练URGR模型的合成数据,比直接在真实数据上训练URGR模型更有效地实现。基于仿真的URGR模型还在基于手势的地面机器人的手部方向中得到了演示。
https://arxiv.org/abs/2404.09846
Addressing the challenge of a digital assistant capable of executing a wide array of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advancements in large language models (LLMs) and present a visual language model (VLM) that can fulfill diverse tasks on mobile devices. Our model functions by interacting solely with the user interface (UI). It uses the visual input from the device screen and mimics human-like interactions, encompassing gestures such as tapping and swiping. This generality in the input and output space allows our agent to interact with any application on the device. Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots along with corresponding actions. Evaluating our method on the challenging Android in the Wild benchmark demonstrates its promising efficacy and potential.
address the challenge of a digital assistant capable of executing a wide range of user tasks, our research focuses on the realm of instruction-based mobile device control. We leverage recent advances in large language models (LLMs) and present a visual language model (VLM) that can perform various tasks on mobile devices. Our model functions by interacting solely with the user interface (UI). It uses the visual input from the device screen and mimics human-like interactions, covering gestures such as tapping and swiping. This generality in the input and output space allows our agent to interact with any application on the device. Unlike previous methods, our model operates not only on a single screen image but on vision-language sentences created from sequences of past screenshots along with corresponding actions. Evaluating our method on the challenging Android in the Wild benchmark demonstrates its promising efficacy and potential.
https://arxiv.org/abs/2404.08755
Intention-based Human-Robot Interaction (HRI) systems allow robots to perceive and interpret user actions to proactively interact with humans and adapt to their behavior. Therefore, intention prediction is pivotal in creating a natural interactive collaboration between humans and robots. In this paper, we examine the use of Large Language Models (LLMs) for inferring human intention during a collaborative object categorization task with a physical robot. We introduce a hierarchical approach for interpreting user non-verbal cues, like hand gestures, body poses, and facial expressions and combining them with environment states and user verbal cues captured using an existing Automatic Speech Recognition (ASR) system. Our evaluation demonstrates the potential of LLMs to interpret non-verbal cues and to combine them with their context-understanding capabilities and real-world knowledge to support intention prediction during human-robot interaction.
基于意图的人机交互(HRI)系统允许机器人感知和解释用户的动作,从而主动与人类互动并适应其行为。因此,意图预测在创建人类与机器人之间自然互动的重要性不言而喻。在本文中,我们研究了使用大型语言模型(LLMs)在合作物体分类任务中推断人类意图的方法。我们引入了一种分层的解释用户非语言线索的方法,包括手势、身体姿势和面部表情,并将其与环境和用户口头线索捕获的现有自动语音识别(ASR)系统相结合。我们的评估表明,LLMs具有解释非语言线索的能力,并将其与上下文理解能力和现实世界的知识相结合,支持在人类与机器人交互过程中进行意图预测。
https://arxiv.org/abs/2404.08424