Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
近年来,短视频的发展速度加快,通常包含视觉和音频模态。背景音乐对于短视频来说非常重要,因为它可以显著影响观众的情感。然而,目前,短视频的背景音乐通常由视频制作人选择,缺乏针对短视频的自动音乐推荐方法。本文介绍了一种创新的跨模态检索模型MVBind,用于短片的跨模态检索。MVBind是一种自监督方法,通过直接从数据中获取模态关系的固有知识,无需手动注释。此外,为了弥补短视频没有相应的音乐-视觉对数据集的不足,我们构建了一个数据集SVM-10K(短视频与音乐-10K),主要包含精心选择短的短视频。在这个数据集上,MVBind表现出比其他基线方法显著的优异性能。构建的数据集和代码将发布,以促进未来研究。
https://arxiv.org/abs/2405.09286
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
控制文本到语音合成(TTS)中情感表达仍然具有挑战性。以前的研究主要集中在在句子级别学习全局音调表示,这与语言音调高度相关。我们的目标是构建一个层级情感分布(ED),有效地捕捉情感在各个粒度级别上的强度变化,包括音素、单词和句子。在TTS训练过程中,从地面真实音频中提取层级ED,并引导预测器建立情感和语言音调之间的联系。在运行时推理,TTS模型生成情感语音,同时为语音成分提供情感定量控制。客观和主观评价证实了所提出的框架在情感预测和控制方面的有效性。
https://arxiv.org/abs/2405.09171
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Large language models (LLMs) are transforming human-computer interaction and conceptions of artificial intelligence (AI) with their impressive capacities for conversing and reasoning in natural language. There is growing interest in whether LLMs have theory of mind (ToM); the ability to reason about the mental and emotional states of others that is core to human social intelligence. As LLMs are integrated into the fabric of our personal, professional and social lives and given greater agency to make decisions with real-world consequences, there is a critical need to understand how they can be aligned with human values. ToM seems to be a promising direction of inquiry in this regard. Following the literature on the role and impacts of human ToM, this paper identifies key areas in which LLM ToM will show up in human:LLM interactions at individual and group levels, and what opportunities and risks for alignment are raised in each. On the individual level, the paper considers how LLM ToM might manifest in goal specification, conversational adaptation, empathy and anthropomorphism. On the group level, it considers how LLM ToM might facilitate collective alignment, cooperation or competition, and moral judgement-making. The paper lays out a broad spectrum of potential implications and suggests the most pressing areas for future research.
大语言模型(LLMs)正在通过其在自然语言中的会话和推理能力彻底改变人类与计算机交互以及人工智能(AI)的概念。越来越多的人对LLMs是否具有理论思维(ToM)产生了兴趣,这是人类社交智能的核心,即对他人心理和情感状态进行推理。随着LLMs日益融入我们个人、职业和社交生活的方方面面,并赋予其在现实世界中做出决策的实际后果更大的自主权,理解它们如何与人类价值观保持一致变得至关重要。在这一点上,ToM似乎是一个有前景的研究方向。 本文根据有关人类ToM的作用和影响的文献,识别出LLM ToM在人类互动中的关键领域,以及在这些领域中与人类价值观的 align 所引发的机会和风险。在个体层面上,本文考虑了LLM ToM如何表现为实现目标、对话适应、共情和拟人化。在群体层面上,它考虑了LLM ToM如何促进集体对齐、合作或竞争,以及道德判断。本文勾勒出了一系列可能的影响范围,并提出了未来研究的重点领域。
https://arxiv.org/abs/2405.08154
This study introduces a novel Supervised Info-enhanced Contrastive Learning framework for EEG based Emotion Recognition (SICLEER). SI-CLEER employs multi-granularity contrastive learning to create robust EEG contextual representations, potentiallyn improving emotion recognition effectiveness. Unlike existing methods solely guided by classification loss, we propose a joint learning model combining self-supervised contrastive learning loss and supervised classification loss. This model optimizes both loss functions, capturing subtle EEG signal differences specific to emotion detection. Extensive experiments demonstrate SI-CLEER's robustness and superior accuracy on the SEED dataset compared to state-of-the-art methods. Furthermore, we analyze electrode performance, highlighting the significance of central frontal and temporal brain region EEGs in emotion detection. This study offers an universally applicable approach with potential benefits for diverse EEG classification tasks.
这项研究提出了一种新颖的基于监督的增强对比学习框架,用于基于脑电波(EEG)的情感识别(SICLEER)。SI-CLEER采用多粒度对比学习来创建稳健的EEG上下文表示,从而可能提高情感识别的有效性。与现有的方法仅基于分类损失不同,我们提出了一个结合自监督对比学习损失和监督分类损失的联合学习模型。该模型优化了两个损失函数,捕捉到特定于情感检测的微小EEG信号差异。大量实验证明,SI-CLEER在SEED数据集上的鲁棒性和卓越准确性比最先进的methods要强。此外,我们分析了电极性能,强调了中央前额和颞叶皮层EEG在情感检测中的重要性。这项研究提供了一种通用的方法,对各种EEG分类任务具有潜在的益处。
https://arxiv.org/abs/2405.07260
Most earlier investigations on talking face generation have focused on the synchronization of lip motion and speech content. However, human head pose and facial emotions are equally important characteristics of natural human faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce the Inter-Reconstructed Feature Disentanglement (IRFD) method to decouple human facial features into three latent spaces. We then design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available at the anonymous link: https://anonymous.4open.science/r/SPEAK-F56E
以往关于谈话脸生成的研究主要集中在同步 lips 运动和语言内容的同步上。然而,人类头姿态和面部表情同样重要,是自然人类面的关键特征。虽然音频驱动的谈话脸生成取得了显著的进步,但现有方法要么忽视面部表情,要么仅限于特定个体,不能应用于任意主题。在本文中,我们提出了一个单击谈话脸生成框架(SPEAK),它与普通谈话脸生成方法不同,通过实现情感和姿势控制,从而区别于一般谈话脸生成方法。具体来说,我们引入了跨重构特征解耦(IRFD)方法,将人类面部特征解耦到三个潜在空间中。然后,我们设计了一个面部编辑模块,将语言内容和面部潜在编码合并为一个潜在空间。接下来,我们提出了一个新生成器,它使用编辑模块生成的潜在码来调节面部表情、姿势和语言内容在生成面部动画时的表达。大量试验证明,我们的方法可以生成具有同步 lips 运动、真实面部情感和流畅头部运动的逼真谈话脸。演示视频可在此匿名链接查看:https://anonymous.4open.science/r/SPEAK-F56E
https://arxiv.org/abs/2405.07257
Social robotics researchers are increasingly interested in multi-party trained conversational agents. With a growing demand for real-world evaluations, our study presents Large Language Models (LLMs) deployed in a month-long live show at the Edinburgh Festival Fringe. This case study investigates human improvisers co-creating with conversational agents in a professional theatre setting. We explore the technical capabilities and constraints of on-the-spot multi-party dialogue, providing comprehensive insights from both audience and performer experiences with AI on stage. Our human-in-the-loop methodology underlines the challenges of these LLMs in generating context-relevant responses, stressing the user interface's crucial role. Audience feedback indicates an evolving interest for AI-driven live entertainment, direct human-AI interaction, and a diverse range of expectations about AI's conversational competence and utility as a creativity support tool. Human performers express immense enthusiasm, varied satisfaction, and the evolving public opinion highlights mixed emotions about AI's role in arts.
社会机器人学研究人员越来越关注多方训练的会话机器人。随着对现实世界评估需求的增加,我们的研究在爱丁堡艺术节 Fringe 的一个月现场表演中部署了大型语言模型(LLMs)。这个案例研究探讨了在专业戏剧环境中,人类表演者与会话机器人共同创作的现象。我们探讨了即兴 multi-party 对话的技术能力和限制,同时从观众和表演者的 AI 体验中提供了全面深入的洞察。我们的人机交互方法论突出了这些 LLMs 在生成相关回应方面所面临的挑战,强调了用户界面在生成有用信息中的关键作用。观众反馈表明,AI 驱动的现场娱乐、直接的人机互动和关于 AI 的会话能力及作为创意支持工具的多样期望正在不断演变。人类表演者表达出极大的热情、多样满意的反馈,而公众 opinion 则揭示了关于 AI 在艺术领域中作用的情感和复杂性。
https://arxiv.org/abs/2405.07111
This research develops advanced methodologies for Large Language Models (LLMs) to better manage linguistic behaviors related to emotions and ethics. We introduce DIKE, an adversarial framework that enhances the LLMs' ability to internalize and reflect global human values, adapting to varied cultural contexts to promote transparency and trust among users. The methodology involves detailed modeling of emotions, classification of linguistic behaviors, and implementation of ethical guardrails. Our innovative approaches include mapping emotions and behaviors using self-supervised learning techniques, refining these guardrails through adversarial reviews, and systematically adjusting outputs to ensure ethical alignment. This framework establishes a robust foundation for AI systems to operate with ethical integrity and cultural sensitivity, paving the way for more responsible and context-aware AI interactions.
这项研究开发了大型语言模型(LLMs)更好地管理与情感和伦理行为相关的先进方法。我们引入了DIKE,一种对抗性框架,它增强了LLMs内部化和反映全球人类价值观的能力,适应各种文化环境以促进用户之间的透明度和信任。方法论包括情感的详细建模、语言行为的分类和实施道德边界。我们创新的方法包括使用自我监督学习技术进行情感和行为的映射,通过对抗性审查来优化这些边界,并系统地调整输出以确保伦理一致性。这个框架为AI系统操作道德 integrity 和文化敏感奠定了坚实的基础,为更负责任和上下文敏感的AI交互铺平了道路。
https://arxiv.org/abs/2405.07076
Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.
代码混合是一种在文本或语音中混合两个或更多语言的现象。已有几项研究在构建数据集和执行下游自然语言处理任务方面对代码混合数据进行了研究。尽管观察到三种或更多语言的代码混合并不罕见,但该领域目前可用的数据集中,大多数数据都来自两种语言。在本文中,我们介绍了EmoMix-3L,一个包含三种不同语言代码混合数据的全新多标签情感检测数据集。我们对EmoMix-3L进行了实验,并报告说MuRIL在其他模型在此数据集上表现出色。
https://arxiv.org/abs/2405.06922
This study explores the application of recurrent neural networks to recognize emotions conveyed in music, aiming to enhance music recommendation systems and support therapeutic interventions by tailoring music to fit listeners' emotional states. We utilize Russell's Emotion Quadrant to categorize music into four distinct emotional regions and develop models capable of accurately predicting these categories. Our approach involves extracting a comprehensive set of audio features using Librosa and applying various recurrent neural network architectures, including standard RNNs, Bidirectional RNNs, and Long Short-Term Memory (LSTM) networks. Initial experiments are conducted using a dataset of 900 audio clips, labeled according to the emotional quadrants. We compare the performance of our neural network models against a set of baseline classifiers and analyze their effectiveness in capturing the temporal dynamics inherent in musical expression. The results indicate that simpler RNN architectures may perform comparably or even superiorly to more complex models, particularly in smaller datasets. We've also applied the following experiments on larger datasets: one is augmented based on our original dataset, and the other is from other sources. This research not only enhances our understanding of the emotional impact of music but also demonstrates the potential of neural networks in creating more personalized and emotionally resonant music recommendation and therapy systems.
本研究探讨了循环神经网络(RNN)在识别音乐中传达的情感中的应用,旨在通过将音乐定制以适应听众的情感状态来增强音乐推荐系统和支持治疗干预。我们利用拉塞尔的情感四象限将音乐分为四个不同的情感区域,并开发了能够准确预测这些类别的模型。我们的方法包括使用Librosa提取全面音频特征,并应用各种循环神经网络架构,包括标准的RNN、双向RNN和长短时记忆(LSTM)网络。我们对900个音频片段的音频数据集进行了初始实验,并根据情感四象限进行分类。我们比较了我们的神经网络模型的性能与一系列基线分类器的性能,并分析了它们在捕捉音乐表达中的时间动态方面的有效性。结果显示,简单的RNN架构可能与更复杂的模型表现相当,甚至可能优于它们。我们还将在较大数据集上进行以下实验:一是基于我们原始数据集的增强,二是来自其他来源的。这项研究不仅增强了我们对音乐情感影响的了解,还展示了神经网络在创建更个性化和情感共鸣的音乐推荐和治疗系统方面的潜力。
https://arxiv.org/abs/2405.06747
Emotions guide our decision making process and yet have been little explored in practical ethical decision making scenarios. In this challenge, we explore emotions and how they can influence ethical decision making in a home robot context: which fetch requests should a robot execute, and why or why not? We discuss, in particular, two aspects of emotion: (1) somatic markers: objects to be retrieved are tagged as negative (dangerous, e.g. knives or mind-altering, e.g. medicine with overdose potential), providing a quick heuristic for where to focus attention to avoid the classic Frame Problem of artificial intelligence, (2) emotion inference: users' valence and arousal levels are taken into account in defining how and when a robot should respond to a human's requests, e.g. to carefully consider giving dangerous items to users experiencing intense emotions. Our emotion-based approach builds a foundation for the primary consideration of Safety, and is complemented by policies that support overriding based on Context (e.g. age of user, allergies) and Privacy (e.g. administrator settings). Transparency is another key aspect of our solution. Our solution is defined using behaviour trees, towards an implementable design that can provide reasoning information in real-time.
情感引导着我们的决策过程,但在实际伦理决策场景中,对情感的影响却鲜有关注。在本次挑战中,我们探讨情感及其如何影响家庭机器人环境中的道德决策:机器人应该执行哪些抓取请求,以及原因或者原因不是什么?我们特别讨论了情感的两个方面:(1)身体标记:被标记为负面的物体(例如危险物品,如刀具或具有过量潜在风险的药物),提供了一个快速的方法来集中注意力以避免经典人工智能的框架问题;(2)情感推断:根据用户的情感和唤醒水平,机器应该如何回应人类的请求,例如在用户经历强烈的情绪时,谨慎地考虑给予危险物品给用户。基于情感的 approach 为首要考虑安全奠定了基础,并得到了支持基于上下文(例如用户年龄,过敏)和隐私(例如管理员设置)的政策补充。透明度是解决方案的另一个关键方面。我们的解决方案使用行为树来定义,旨在实现具有实时推理信息的可实现设计。
https://arxiv.org/abs/2405.06543
This paper describes our participation in SemEval 2024 Task 3, which focused on Multimodal Emotion Cause Analysis in Conversations. We developed an early prototype for an end-to-end system that uses graph-based methods from dependency parsing to identify causal emotion relations in multi-party conversations. Our model comprises a neural transformer-based encoder for contextualizing multimodal conversation data and a graph-based decoder for generating the adjacency matrix scores of the causal graph. We ranked 7th out of 15 valid and official submissions for Subtask 1, using textual inputs only. We also discuss our participation in Subtask 2 during post-evaluation using multi-modal inputs.
本文描述了我们参与SemEval 2024 Task 3的经历,该任务专注于在对话中进行多模态情感原因分析。我们开发了一个端到端系统原型,该系统使用基于依赖解析的图方法来确定多方对话中的因果情感关系。我们的模型包括一个基于神经 transformer 的上下文编码器和一个基于图的解码器,用于生成因果图的邻接矩阵分数。我们在仅使用文本输入的情况下排名第三,共15个有效和官方提交的第1个子任务中。我们还讨论了在 post-evaluation 使用多模态输入时参与 Subtask 2。
https://arxiv.org/abs/2405.06483
This study investigates the cognitive plausibility of a pretrained multimodal model, CLIP, in recognizing emotions evoked by abstract visual art. We employ a dataset comprising images with associated emotion labels and textual rationales of these labels provided by human annotators. We perform linguistic analyses of rationales, zero-shot emotion classification of images and rationales, apply similarity-based prediction of emotion, and investigate color-emotion associations. The relatively low, yet above baseline, accuracy in recognizing emotion for abstract images and rationales suggests that CLIP decodes emotional complexities in a manner not well aligned with human cognitive processes. Furthermore, we explore color-emotion interactions in images and rationales. Expected color-emotion associations, such as red relating to anger, are identified in images and texts annotated with emotion labels by both humans and CLIP, with the latter showing even stronger interactions. Our results highlight the disparity between human processing and machine processing when connecting image features and emotions.
本研究旨在探讨预训练的多模态模型CLIP在识别由抽象视觉艺术产生的情感的认知可信度。我们使用一个由人类注释者提供的情感标签的图像数据集。我们对标有情感标签的图像和相应的文本理性进行语言分析,对图像和理性进行零散情感分类,并研究色彩与情感之间的关系。CLIP对抽象图像和理性的情感识别准确性相对较低,但高于基线,这表明CLIP在情感复杂性方面没有很好地与人类认知过程对齐。此外,我们探讨了色彩与情感之间的相互作用。由人类和CLIP注释的图像和文本中预期存在的色彩-情感关系被识别出来,后者的相互作用甚至更强。我们的结果突出了在将图像特征与情感联系起来时,人类处理和机器处理之间的差异。
https://arxiv.org/abs/2405.06319
This paper presents the development of a novel ethical reasoning framework for robots. "Robots Can Feel" is the first system for robots that utilizes a combination of logic and human-like emotion simulation to make decisions in morally complex situations akin to humans. The key feature of the approach is the management of the Emotion Weight Coefficient - a customizable parameter to assign the role of emotions in robot decision-making. The system aims to serve as a tool that can equip robots of any form and purpose with ethical behavior close to human standards. Besides the platform, the system is independent of the choice of the base model. During the evaluation, the system was tested on 8 top up-to-date LLMs (Large Language Models). This list included both commercial and open-source models developed by various companies and countries. The research demonstrated that regardless of the model choice, the Emotions Weight Coefficient influences the robot's decision similarly. According to ANOVA analysis, the use of different Emotion Weight Coefficients influenced the final decision in a range of situations, such as in a request for a dietary violation F(4, 35) = 11.2, p = 0.0001 and in an animal compassion situation F(4, 35) = 8.5441, p = 0.0001. A demonstration code repository is provided at: this https URL
本文介绍了一种名为“Robots Can Feel”的机器人伦理推理框架的发展。该系统采用逻辑和类似于人类情感建模来处理类似于人类的道德复杂情况的决策。该方法的关键特点是情感权重系数的管理 - 一个可定制的参数,用于分配情感在机器人决策中的角色。系统旨在为任何形式和目的的机器人提供接近人类标准的伦理行为。除了平台外,该系统与选择基本模型无关。在评估过程中,系统在8个最新的LLM(大型语言模型)上进行了测试。这份名单包括了各种公司和国家开发的商业和开源模型。研究结果表明,无论选择哪种模型,情感权重系数都会影响机器人的决策,类似于人类的情况。根据ANOVA分析,不同的情感权重系数在各种情况下影响了最终决策,例如在要求违反饮食规定的情境中F(4,35)=11.2,p=0.0001,以及在动物同情情境中F(4,35)=8.5441,p=0.0001。演示代码存储库可在:https:// this URL
https://arxiv.org/abs/2405.05824
Qualitative analysis is a challenging, yet crucial aspect of advancing research in the field of Human-Computer Interaction (HCI). Recent studies show that large language models (LLMs) can perform qualitative coding within existing schemes, but their potential for collaborative human-LLM discovery and new insight generation in qualitative analysis is still underexplored. To bridge this gap and advance qualitative analysis by harnessing the power of LLMs, we propose CHALET, a novel methodology that leverages the human-LLM collaboration paradigm to facilitate conceptualization and empower qualitative research. The CHALET approach involves LLM-supported data collection, performing both human and LLM deductive coding to identify disagreements, and performing collaborative inductive coding on these disagreement cases to derive new conceptual insights. We validated the effectiveness of CHALET through its application to the attribution model of mental-illness stigma, uncovering implicit stigmatization themes on cognitive, emotional and behavioral dimensions. We discuss the implications for future research, methodology, and the transdisciplinary opportunities CHALET presents for the HCI community and beyond.
定性分析是推动人机交互领域(HCI)研究的重要方面。最近的研究表明,大型语言模型(LLMs)可以在现有方案中进行定性编码,但它们在定性分析中进行合作人机模型发现和新见解生成的潜力仍然没有被充分发掘。为了弥合这一空白,并利用LLMs的力量推动定性分析,我们提出了CHALET,一种新方法,它利用了人机交互模式,以促进概念化和增强定性研究。CHALET方法涉及LLM支持的数据收集,进行人机模型演绎编码以识别分歧,并对这些分歧情况执行合作归纳编码,以产生新的概念见解。我们通过将CHALET应用于疾病污名归因模型来验证CHALET的有效性,并揭示了认知、情感和行为维度中隐含的污名化主题。我们讨论了未来研究、方法和跨学科机会对HCI社区以及之外的潜在影响。
https://arxiv.org/abs/2405.05758
We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants' strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.
我们对3个系统(ELIZA、GPT-3.5和GPT-4)在随机、有控制和预先注册的Turing测试中进行了评估。参与者在与人类或AI进行5分钟的对话时进行判断,判断他们是否认为他们的对话伙伴是人类。GPT-4被判断为人类54%的时间,超过了ELIZA(22%),但落后于真正的人类(67%)。结果提供了第一个经过实证检验的证明,即任何人工智能系统都可以通过互动的2人Turing测试。这些结果对围绕机器智能的辩论以及更紧迫地说,对当前AI系统的欺骗行为可能被察觉的观点具有影响。参与者的策略和推理的分析表明,文体和情感因素在通过Turing测试时扮演比传统智力概念更大的角色。
https://arxiv.org/abs/2405.08007
Emotion recognition is an important part of affective computing. Extracting emotional cues from human gaits yields benefits such as natural interaction, a nonintrusive nature, and remote detection. Recently, the introduction of self-supervised learning techniques offers a practical solution to the issues arising from the scarcity of labeled data in the field of gait-based emotion recognition. However, due to the limited diversity of gaits and the incompleteness of feature representations for skeletons, the existing contrastive learning methods are usually inefficient for the acquisition of gait emotions. In this paper, we propose a contrastive learning framework utilizing selective strong augmentation (SSA) for self-supervised gait-based emotion representation, which aims to derive effective representations from limited labeled gait data. First, we propose an SSA method for the gait emotion recognition task, which includes upper body jitter and random spatiotemporal mask. The goal of SSA is to generate more diverse and targeted positive samples and prompt the model to learn more distinctive and robust feature representations. Then, we design a complementary feature fusion network (CFFN) that facilitates the integration of cross-domain information to acquire topological structural and global adaptive features. Finally, we implement the distributional divergence minimization loss to supervise the representation learning of the generally and strongly augmented queries. Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
情感识别是情感计算的重要组成部分。从人类脚步中提取情感线索带来了诸如自然互动、非侵入性、远程检测等好处。最近,自监督学习技术的发展为基于脚步情感识别领域缺乏标注数据的问题提供了一个实际解决方案。然而,由于脚步动作的多样性有限和骨骼特征表示的不完整性,现有的对比学习方法通常对于获取有限标注数据的脚步情感识别效果不佳。在本文中,我们提出了一种使用选择性 strong augmentation (SSA) 的对比学习框架,用于自监督基于脚步情感表示,旨在从有限的标注数据中提取有效的情感表示。首先,我们提出了一种 SSA 方法来处理脚步情感识别任务,包括上半身抖动和随机时空掩码。SSA 的目标是生成更多多样化和针对性的正样本,并促使模型学习更具有特色和鲁棒性的特征表示。然后,我们设计了一个互补特征融合网络(CFFN),促进跨领域信息的整合以获取拓扑结构和全局自适应特征。最后,我们实现了一种分布差异最小化损失,用于指导一般和强烈 augmented 查询的特征学习。我们的方法在 Emotion-Gait 和 Emilya 数据集上的验证结果表明,它在不同评估协议下优于最先进的methods。
https://arxiv.org/abs/2405.04900
Recent social media posts on the cholera outbreak in Hammanskraal have highlighted the diverse range of emotions people experienced in response to such an event. The extent of people's opinions varies greatly depending on their level of knowledge and information about the disease. The documented re-search about Cholera lacks investigations into the classification of emotions. This study aims to examine the emotions expressed in social media posts about Chol-era. A dataset of 23,000 posts was extracted and pre-processed. The Python Nat-ural Language Toolkit (NLTK) sentiment analyzer library was applied to deter-mine the emotional significance of each text. Additionally, Machine Learning (ML) models were applied for emotion classification, including Long short-term memory (LSTM), Logistic regression, Decision trees, and the Bidirectional En-coder Representations from Transformers (BERT) model. The results of this study demonstrated that LSTM achieved the highest accuracy of 75%. Emotion classification presents a promising tool for gaining a deeper understanding of the impact of Cholera on society. The findings of this study might contribute to the development of effective interventions in public health strategies.
近期哈姆skraal地区社交媒体上关于霍乱疫情的社会动态强调了人们在此类事件中经历的多种情感反应。人们对此事件的观点差异很大,这很大程度上取决于他们对这种疾病的了解和知识水平。关于霍乱的详细研究缺乏对情感分类的调查。本研究旨在研究社交媒体上关于霍乱的情感表达。 数据集提取并经过预处理,共有23,000条帖子。Python自然语言处理工具包(NLTK)情感分析库应用于确定每个文本的情感重要性。此外,应用机器学习(ML)模型进行情感分类,包括长短时记忆(LSTM)、逻辑回归、决策树和双向编码来自Transformer(BERT)模型。本研究的成果表明,LSTM取得了最高准确率75%。情感分类成为深入了解霍乱对社会影响的有趣工具。本研究的发现可能有助于在公共卫生策略的发展中制定有效的干预措施。
https://arxiv.org/abs/2405.04897
Agents represent one of the most emerging applications of Large Language Models (LLMs) and Generative AI, with their effectiveness hinging on multimodal capabilities to navigate complex user environments. Conversational Health Agents (CHAs), a prime example of this, are redefining healthcare by offering nuanced support that transcends textual analysis to incorporate emotional intelligence. This paper introduces an LLM-based CHA engineered for rich, multimodal dialogue-especially in the realm of mental health support. It adeptly interprets and responds to users' emotional states by analyzing multimodal cues, thus delivering contextually aware and empathetically resonant verbal responses. Our implementation leverages the versatile openCHA framework, and our comprehensive evaluation involves neutral prompts expressed in diverse emotional tones: sadness, anger, and joy. We evaluate the consistency and repeatability of the planning capability of the proposed CHA. Furthermore, human evaluators critique the CHA's empathic delivery, with findings revealing a striking concordance between the CHA's outputs and evaluators' assessments. These results affirm the indispensable role of vocal (soon multimodal) emotion recognition in strengthening the empathetic connection built by CHAs, cementing their place at the forefront of interactive, compassionate digital health solutions.
代理是大型语言模型(LLMs)和生成式人工智能(GA)的一种最新兴的应用,其有效性取决于多模态能力来解决复杂的用户环境。会话健康代理(CHAs)是这种最著名的例子,通过提供 nuanced 的支持,超越了文本分析,并纳入了情商。本文介绍了一种基于LLM的CHA,特别适用于心理健康支持领域。它通过分析多模态线索来理解并回应用户的情绪状态,从而提供具有上下文感知和富有同情心的口头回应。我们的实现利用了灵活的openCHA框架,综合评估涉及多种情绪调度的中立提示。我们评估了所提出的CHA的规划能力的一致性和可重复性。此外,人类评估员批评了CHA的富有同情心的交付,这些研究结果揭示了CHA的输出与评估员评价之间令人印象深刻的一致性。这些结果证实了在加强CHAs所建立的同情连接方面,语音(很快将多模态)情感识别至关重要,巩固了它们在交互式、富有同情心的数字健康解决方案前沿的地位。
https://arxiv.org/abs/2405.04777
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. Each of these methods works in isolation instead of synergistically. Here we address this problem and build a language-driven human understanding system -- ChatHuman, which combines and integrates the skills of many different methods. To do so, we finetune a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs. In doing so, ChatHuman is able to combine information from multiple tools to solve problems more accurately than the individual tools themselves and to leverage tool output to improve its ability to reason about humans. The novel features of ChatHuman include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding. Our experiments show that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks. ChatHuman is a step towards consolidating diverse methods for human analysis into a single, powerful, system for 3D human reasoning.
已有许多方法被提出用于检测、估计和分析图像中的人,包括估计3D姿势、形状、接触、人机交互、情绪等。这些方法彼此独立工作而不是协同工作。我们解决这个问题,并构建了一个语言驱动的人理解系统--ChatHuman,它结合和整合了许多不同方法的技能。为此,我们通过微调一个大语言模型(LLM)来选择和使用广泛的现有工具来响应用户输入。这样做,ChatHuman能够将多个工具的信息结合在一起,使其更准确地解决问题,并利用工具的输出来提高其对人类的推理能力。ChatHuman的新特点包括利用学术出版物指导3D与人相关的工具的应用,采用检索增强生成模型生成处理新工具的上下文学习示例,以及通过区分和整合工具结果来增强3D人类理解。我们的实验证明,ChatHuman在工具选择准确性和多个3D人机交互任务中的表现优于现有模型。ChatHuman是将多样化方法合并为单个、强大的3D人类推理系统的一步。
https://arxiv.org/abs/2405.04533