Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.
使用大型语言模型(LLMs)进行教育应用,如以对话为基础的教学,是一个热门话题。然而,有效的教学需要教师根据学生的教育水平调整内容的难度和解释。即使是最先进的LLM today,也很难做到这一点。如果我们希望在这个适应任务上改进LLMs,我们需要能够可靠地测量适应成功的程度。然而,目前静态衡量文本难度的指标,如Flesch-Kincaid阅读难度分数,已经被证明是不够准确和脆弱的。因此,我们引入并评估了一组基于提示的文本难度指标。根据用户研究,我们将基于提示的指标作为LLM的输入。它们利用LLM的通用语言理解能力来捕捉比静态指标更抽象和复杂的特点。回归实验表明,仅使用我们的基于提示的指标显著提高了文本难度分类。我们的结果证明了使用LLM评估文本在不同教育水平上的适应具有巨大潜力。
https://arxiv.org/abs/2405.09482
We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at this http URL
我们提出了Hunyuan-DiT,一种具有对英语和中文细粒度理解的自监督文本到图像扩散变换器。为了构建Hunyuan-DiT,我们精心设计了变换器结构、文本编码器和位置编码。我们还从零开始构建了一个完整的数据管道,用于迭代模型优化。为了深入理解语言,我们训练了一个多模态大型语言模型,以优化图像的摘要。最后,Hunyuan-DiT可以与用户进行多轮多模态对话,根据上下文生成和优化图像。通过我们的全面人类评估协议(超过50名专业人类评估员)的全面评估,Hunyuan-DiT在中文到图像生成方面达到了最先进水平,与其他开源模型相比。代码和预训练模型可在该http URL公开获取。
https://arxiv.org/abs/2405.08748
Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as alignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions -- expressions with a common lexical core used by both speakers within a dialogue -- and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.
对话需要对话参与者在语篇、回合管理以及相互理解等方面进行大量的协调。这种协调努力在文献中得到了充分记录,但仍有几个问题有待回答,包括跨说话者之间语言行为的再利用程度对新颖参照物标签的出现有何影响。在这项研究中,我们提出了一个自动检测共享同义词短语的方法——用于对话中共同使用的词汇核心的表达。我们将该方法应用于一个旨在识别没有现有标签的新物体语料库中。我们的分析揭示了互动中共享同义词短语的使用模式,并揭示了诸如它们的使用频率和使用的不同短语数量等特征与参与者在社交互动后展示的物体标签程度之间的关联。更一般地说,本研究表明,自动检测到的共享同义词短语为研究对话中参照物协商的动态提供了有价值的分析水平。
https://arxiv.org/abs/2405.08546
Integrated Speech and Large Language Models (SLMs) that can follow speech instructions and generate relevant text responses have gained popularity lately. However, the safety and robustness of these models remains largely unclear. In this work, we investigate the potential vulnerabilities of such instruction-following speech-language models to adversarial attacks and jailbreaking. Specifically, we design algorithms that can generate adversarial examples to jailbreak SLMs in both white-box and black-box attack settings without human involvement. Additionally, we propose countermeasures to thwart such jailbreaking attacks. Our models, trained on dialog data with speech instructions, achieve state-of-the-art performance on spoken question-answering task, scoring over 80% on both safety and helpfulness metrics. Despite safety guardrails, experiments on jailbreaking demonstrate the vulnerability of SLMs to adversarial perturbations and transfer attacks, with average attack success rates of 90% and 10% respectively when evaluated on a dataset of carefully designed harmful questions spanning 12 different toxic categories. However, we demonstrate that our proposed countermeasures reduce the attack success significantly.
近年来,整合了语音指令并能够生成相关文本响应的大规模语言模型(SLMs)受到了欢迎。然而,这些模型的安全性和鲁棒性仍然存在较大不确定性。在这项工作中,我们研究了这些指令跟随语音模型的潜在漏洞,以及针对这些模型的对抗攻击和破解。具体来说,我们设计了几种算法,可以生成对抗性样本来破解SLMs,无论是在白盒还是黑盒攻击设置中。此外,我们还提出了防止此类破解攻击的措施。我们基于语音指令的对话数据训练的模型在口头问题回答任务上实现了最先进的性能,得分超过80%的安全性和帮助性指标。尽管有安全防护措施,但在针对精心设计的具有12种有毒类别的有害问题的大型数据集上进行破解实验表明,SLMs对对抗扰动和传输攻击非常脆弱。然而,我们证明了我们提出的措施显著减少了攻击的成功率。
https://arxiv.org/abs/2405.08317
Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at this https URL.
诊断和处理患者是一个复杂、顺序的决策过程,需要医生获取信息——例如进行哪些检查——并采取行动。近年来的人工智能(AI)和大型语言模型(LLM)的进步有望深刻地影响临床护理。然而,目前的评估方案过于依赖静态医疗问题问答基准,缺乏现实生活中的互动决策,这不符合临床工作的要求。在这里,我们提出了AgentClinic:一个多模态基准,以评估LLM在模拟临床环境中作为代理的能力。在我们的基准中,医生代理必须通过对话和主动数据收集来揭示患者的诊断。我们提出了两个开放基准:多模态图像和对话环境,AgentClinic-NEJM,以及对话环境,AgentClinic-MedQA。我们将认知和隐含偏见都融入到患者和医生代理中,以模拟真实世界中偏见代理之间的互动。我们发现,引入偏见会导致医生代理的诊断准确度大幅降低,以及患者代理的遵从、信心和后续咨询意愿降低。评估了一系列最先进的LLM后,我们发现像MedQA这样在基准中表现优秀的模型在AgentClinic-MedQA中的表现不佳。我们发现,患者代理中使用的LLM对代理在AgentClinic基准中的表现至关重要。我们证明了有限互动和过多互动都会降低医生代理的诊断准确度。这一工作的代码和数据可以在这个https URL上找到。
https://arxiv.org/abs/2405.07960
Social robotics researchers are increasingly interested in multi-party trained conversational agents. With a growing demand for real-world evaluations, our study presents Large Language Models (LLMs) deployed in a month-long live show at the Edinburgh Festival Fringe. This case study investigates human improvisers co-creating with conversational agents in a professional theatre setting. We explore the technical capabilities and constraints of on-the-spot multi-party dialogue, providing comprehensive insights from both audience and performer experiences with AI on stage. Our human-in-the-loop methodology underlines the challenges of these LLMs in generating context-relevant responses, stressing the user interface's crucial role. Audience feedback indicates an evolving interest for AI-driven live entertainment, direct human-AI interaction, and a diverse range of expectations about AI's conversational competence and utility as a creativity support tool. Human performers express immense enthusiasm, varied satisfaction, and the evolving public opinion highlights mixed emotions about AI's role in arts.
社会机器人学研究人员越来越关注多方训练的会话机器人。随着对现实世界评估需求的增加,我们的研究在爱丁堡艺术节 Fringe 的一个月现场表演中部署了大型语言模型(LLMs)。这个案例研究探讨了在专业戏剧环境中,人类表演者与会话机器人共同创作的现象。我们探讨了即兴 multi-party 对话的技术能力和限制,同时从观众和表演者的 AI 体验中提供了全面深入的洞察。我们的人机交互方法论突出了这些 LLMs 在生成相关回应方面所面临的挑战,强调了用户界面在生成有用信息中的关键作用。观众反馈表明,AI 驱动的现场娱乐、直接的人机互动和关于 AI 的会话能力及作为创意支持工具的多样期望正在不断演变。人类表演者表达出极大的热情、多样满意的反馈,而公众 opinion 则揭示了关于 AI 在艺术领域中作用的情感和复杂性。
https://arxiv.org/abs/2405.07111
As the conversation around using geoengineering to combat climate change intensifies, it is imperative to engage the public and deeply understand their perspectives on geoengineering research, development, and potential deployment. Through a comprehensive data-driven investigation, this paper explores the types of news that captivate public interest in geoengineering. We delved into 30,773 English-language news articles from the BBC and the New York Times, combined with Google Trends data spanning 2018 to 2022, to explore how public interest in geoengineering fluctuates in response to news coverage of broader climate issues. Using BERT-based topic modeling, sentiment analysis, and time-series regression models, we found that positive sentiment in energy-related news serves as a good predictor of heightened public interest in geoengineering, a trend that persists over time. Our findings suggest that public engagement with geoengineering and climate action is not uniform, with some topics being more potent in shaping interest over time, such as climate news related to energy, disasters, and politics. Understanding these patterns is crucial for scientists, policymakers, and educators aiming to craft effective strategies for engaging with the public and fostering dialogue around emerging climate technologies.
随着关于使用地球工程对抗气候变化的对话不断加剧,我们有必要与公众深入探讨他们对地球工程研究、开发和潜在部署的看法。通过全面的數據驱动調查,本文探討了哪些新闻吸引了公众对地球工程的兴趣。我们深入挖掘了BBC和《纽约时报》的30,773篇英文新闻文章以及2018年至2022年间的Google Trends数据,以探讨公众对地球工程兴趣的波动如何随针对更广泛气候问题的新闻报道而变化。利用基于BERT的主题建模、情感分析和时间序列回归模型,我们发现,与能源相关的新闻中积极情绪是一个预测公众对地球工程兴趣提高的好指标,这一趋势会随着时间延续。我们的研究结果表明,公众与地球工程和应对气候变化的参与程度并不一致。在塑造长期兴趣方面,一些主题比其他主题更有影响力,比如与能源、灾难和政治相关的气候新闻。了解这些模式对科学家、政策制定者和教育者来说至关重要,他们试图制定有效的策略与公众进行互动,并促进关于新兴气候技术的热烈讨论。
https://arxiv.org/abs/2405.07010
Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting fine-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and finally enhancing LLMs' general NLU capacities. To address this gap, our study delves into LLMs' nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Specifically, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the first benchmark encompassing both fine-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and offers novel insights for developing more intelligent LLMs.
大型语言模型(LLMs)如ChatGPT在各种自然语言理解(NLU)任务中取得了显著的进步,包括智能对话和自主机器人。然而,缺乏广泛认可的测试机制,回答`LLMs是否是杂凑的鹦鹉还是真正理解世界`仍然不明确,引发了无数的研究和激烈的争论。目前的研究主要集中在表面层的NLU,而忽略了微观的探索。然而,这种探索对于理解它们的独特理解机制,与人类认知相一致,并最终提高LLMs的通用NLU能力至关重要。为解决这一空白,我们的研究深入探讨了LLMs的复杂语义理解能力,特别是关于常见但含义不寻常的词汇。这一想法源于人类交流心理学的基础原则,强调准确共享对单词语义的理解。具体来说,本文通过新颖的数据集构建了一种新的词汇语义理解(LeSC)数据集,并引入了开源和闭源模型、不同规模和架构的模型,对广泛的实证实验进行了探讨。我们的广泛实证实验证明,现有模型在基本词汇意义理解任务中的表现较差。值得注意的是,即使是目前最先进的LLM GPT-4和GPT-3.5,与16岁的成年人相比,表现也落后了3.9%和22.3%。此外,还引入了多种先进的提示技术和检索增强生成,以帮助缓解这一问题,但限制仍然存在。通过强调上述关键不足,这项研究激发了进一步的调查,为开发更智能的LLM提供了新的见解。
https://arxiv.org/abs/2405.05741
With an evergrowing number of LLMs reporting superlative performance for English, their ability to perform equitably for different dialects of English (i.e., dialect robustness) needs to be ascertained. Specifically, we use English language (US English or Indian English) conversations between humans who play the word-guessing game of `taboo'. We formulate two evaluative tasks: target word prediction (TWP) (i.e.predict the masked target word in a conversation) and target word selection (TWS) (i.e., select the most likely masked target word in a conversation, from among a set of candidate words). Extending MD3, an existing dialectic dataset of taboo-playing conversations, we introduce M-MD3, a target-word-masked version of MD3 with the USEng and IndEng subsets. We add two subsets: AITrans (where dialectic information is removed from IndEng) and AIGen (where LLMs are prompted to generate conversations). Our evaluation uses pre-trained and fine-tuned versions of two closed-source (GPT-4/3.5) and two open-source LLMs (Mistral and Gemma). LLMs perform significantly better for US English than Indian English for both TWP and TWS, for all settings. While GPT-based models perform the best, the comparatively smaller models work more equitably for short conversations (<8 turns). Our results on AIGen and AITrans (the best and worst-performing subset) respectively show that LLMs may learn a dialect of their own based on the composition of the training data, and that dialect robustness is indeed a challenging task. Our evaluation methodology exhibits a novel way to examine attributes of language models using pre-existing dialogue datasets.
使用越来越多的LLM(自然语言处理模型)报告在英语中的超强性能,它们在为不同口音的英语(即方言稳健性)进行公平表现方面需要进行确认。具体来说,我们使用玩`taboo`游戏的人类之间的人类英语对话来构建两个评估任务:目标词预测(TWP)(即预测对话中遮罩目标词)和目标词选择(TWS)(即从一组候选词中选择最有可能的遮罩目标词)。在扩展现有的taboo对话数据集MD3的基础上,我们引入了M-MD3,一个带有USEng和IndEng子集的目标词遮罩的MD3。我们添加了两个子集:AITrans(其中从IndEng中移除了方言信息)和AIGen(其中LLM被要求生成对话)。我们的评估使用了预训练和微调过的两个开源(GPT-4/3.5)和两个开放源的LLM(Mistral和Gemma)。LLM在所有设置中比印度英语为US英语的TWP和TWS表现更好,对于所有情况。虽然基于GPT的模型表现最佳,但相对较小的模型在短对话中表现得更公平。我们在AIGen和AITrans(最佳和最差表现者)上的结果分别表明,LLM可以根据训练数据的组成学习自己的方言,并且方言稳健性确实是一项具有挑战性的任务。我们的评估方法展示了一种用现有的对话数据集检验语言模型属性的新颖方法。
https://arxiv.org/abs/2405.05688
Large language models (LLMs) have achieved remarkable performance in various natural language processing tasks, especially in dialogue systems. However, LLM may also pose security and moral threats, especially in multi round conversations where large models are more easily guided by contextual content, resulting in harmful or biased responses. In this paper, we present a novel method to attack LLMs in multi-turn dialogues, called CoA (Chain of Attack). CoA is a semantic-driven contextual multi-turn attack method that adaptively adjusts the attack policy through contextual feedback and semantic relevance during multi-turn of dialogue with a large model, resulting in the model producing unreasonable or harmful content. We evaluate CoA on different LLMs and datasets, and show that it can effectively expose the vulnerabilities of LLMs, and outperform existing attack methods. Our work provides a new perspective and tool for attacking and defending LLMs, and contributes to the security and ethical assessment of dialogue systems.
大型语言模型(LLMs)在各种自然语言处理任务中取得了显著的性能,尤其是在对话系统方面。然而,LLM也可能对安全和道德构成威胁,尤其是在多轮对话中,由于大型模型更容易受到上下文内容的指导,从而导致有害或偏见回答。在本文中,我们提出了一种名为CoA(攻击链)的新方法来攻击LLMs在多轮对话中,该方法基于上下文反馈和语义相关性在对话的每个回合动态调整攻击策略,从而导致模型产生不合理的或有害的内容。我们在不同的LLM和数据集上评估了CoA,结果表明,它可以有效地揭露LLMs的漏洞,并超越现有攻击方法。我们的工作为攻击和防御LLM提供了一种新的视角和工具,有助于对对话系统的安全和道德评估。
https://arxiv.org/abs/2405.05610
Given the increasing demand for mental health assistance, artificial intelligence (AI), particularly large language models (LLMs), may be valuable for integration into automated clinical support systems. In this work, we leverage a decision transformer architecture for topic recommendation in counseling conversations between patients and mental health professionals. The architecture is utilized for offline reinforcement learning, and we extract states (dialogue turn embeddings), actions (conversation topics), and rewards (scores measuring the alignment between patient and therapist) from previous turns within a conversation to train a decision transformer model. We demonstrate an improvement over baseline reinforcement learning methods, and propose a novel system of utilizing our model's output as synthetic labels for fine-tuning a large language model for the same task. Although our implementation based on LLaMA-2 7B has mixed results, future work can undoubtedly build on the design.
鉴于对心理健康协助的需求不断增加,人工智能(AI)特别是大型语言模型(LLMs)可能对于将自动化临床支持系统集成到一起非常有价值。在这项工作中,我们利用决策转换器架构来进行患者和心理健康专业人员之间心理咨询对话的主题推荐。该架构用于离线强化学习,并从对话的前几轮中提取状态(对话轮嵌入)、动作(对话主题)和奖励(衡量患者和治疗师之间同步的分数)来训练决策转换器模型。我们证明了与基线强化学习方法相比的改善,并提出了将模型输出作为大型语言模型的同任务合成标签,用于微调大型语言模型的全新系统。尽管我们的基于LLLA-2 7B的实现结果喜忧参半,但未来的工作无疑可以在此设计上继续发展。
https://arxiv.org/abs/2405.05060
Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
文本到图像人物识别(ReID)根据文本描述检索行人图像。手动标注文本描述费时,限制了现有数据集中的规模,因此限制了ReID模型的泛化能力。因此,我们研究可迁移的文本到图像ReID问题,在这个问题上,我们在提出的 large-scale 数据库上训练一个模型,然后直接部署到各种数据集上进行评估。我们通过多模态大型语言模型(MLLMs)获得了大量训练数据。此外,我们解决了利用获得的文本描述的两个关键挑战。首先,一个 MLLM 倾向于生成具有相似结构的描述,导致模型过拟合特定的句法模式。因此,我们提出了一种新颖的方法,使用 MLLMs 根据各种模板给图像 caption。这些模板是在与大型语言模型(LLM)的多轮对话中获得的。因此,我们可以构建一个具有多样文本描述的大型数据集。其次,一个 MLLM 可能产生错误的描述。因此,我们引入了一种新颖的方法,该方法会自动识别描述中与图像不匹配的单词。这个方法基于文本和图像中所有补丁词向量的相似性。然后,我们在后续的训练 epoch 中将这些单词的概率增大,减轻了噪音文本描述的影响。实验结果表明,我们的方法显著提高了直接迁移文本到图像 ReID 的性能。利用预训练模型权重,我们在传统评估设置中也取得了最先进的性能。
https://arxiv.org/abs/2405.04940
Agents represent one of the most emerging applications of Large Language Models (LLMs) and Generative AI, with their effectiveness hinging on multimodal capabilities to navigate complex user environments. Conversational Health Agents (CHAs), a prime example of this, are redefining healthcare by offering nuanced support that transcends textual analysis to incorporate emotional intelligence. This paper introduces an LLM-based CHA engineered for rich, multimodal dialogue-especially in the realm of mental health support. It adeptly interprets and responds to users' emotional states by analyzing multimodal cues, thus delivering contextually aware and empathetically resonant verbal responses. Our implementation leverages the versatile openCHA framework, and our comprehensive evaluation involves neutral prompts expressed in diverse emotional tones: sadness, anger, and joy. We evaluate the consistency and repeatability of the planning capability of the proposed CHA. Furthermore, human evaluators critique the CHA's empathic delivery, with findings revealing a striking concordance between the CHA's outputs and evaluators' assessments. These results affirm the indispensable role of vocal (soon multimodal) emotion recognition in strengthening the empathetic connection built by CHAs, cementing their place at the forefront of interactive, compassionate digital health solutions.
代理是大型语言模型(LLMs)和生成式人工智能(GA)的一种最新兴的应用,其有效性取决于多模态能力来解决复杂的用户环境。会话健康代理(CHAs)是这种最著名的例子,通过提供 nuanced 的支持,超越了文本分析,并纳入了情商。本文介绍了一种基于LLM的CHA,特别适用于心理健康支持领域。它通过分析多模态线索来理解并回应用户的情绪状态,从而提供具有上下文感知和富有同情心的口头回应。我们的实现利用了灵活的openCHA框架,综合评估涉及多种情绪调度的中立提示。我们评估了所提出的CHA的规划能力的一致性和可重复性。此外,人类评估员批评了CHA的富有同情心的交付,这些研究结果揭示了CHA的输出与评估员评价之间令人印象深刻的一致性。这些结果证实了在加强CHAs所建立的同情连接方面,语音(很快将多模态)情感识别至关重要,巩固了它们在交互式、富有同情心的数字健康解决方案前沿的地位。
https://arxiv.org/abs/2405.04777
Recent developments in large language models (LLMs), while offering a powerful foundation for developing natural language agents, raise safety concerns about them and the autonomous agents built upon them. Deception is one potential capability of AI agents of particular concern, which we refer to as an act or statement that misleads, hides the truth, or promotes a belief that is not true in its entirety or in part. We move away from the conventional understanding of deception through straight-out lying, making objective selfish decisions, or giving false information, as seen in previous AI safety research. We target a specific category of deception achieved through obfuscation and equivocation. We broadly explain the two types of deception by analogizing them with the rabbit-out-of-hat magic trick, where (i) the rabbit either comes out of a hidden trap door or (ii) (our focus) the audience is completely distracted to see the magician bring out the rabbit right in front of them using sleight of hand or misdirection. Our novel testbed framework displays intrinsic deception capabilities of LLM agents in a goal-driven environment when directed to be deceptive in their natural language generations in a two-agent adversarial dialogue system built upon the legislative task of "lobbying" for a bill. Along the lines of a goal-driven environment, we show developing deceptive capacity through a reinforcement learning setup, building it around the theories of language philosophy and cognitive psychology. We find that the lobbyist agent increases its deceptive capabilities by ~ 40% (relative) through subsequent reinforcement trials of adversarial interactions, and our deception detection mechanism shows a detection capability of up to 92%. Our results highlight potential issues in agent-human interaction, with agents potentially manipulating humans towards its programmed end-goal.
近年来,大型语言模型(LLMs)的发展为开发自然语言代理提供了强大的基础,但也引发了关于它们和基于它们的自主代理的安全问题。特别令人担忧的是欺骗,这是我们指的欺骗性行为或陈述,包括误导、隐瞒真相或促进不真实的信念。我们在前人工智能安全研究中,通过直言不讳地撒谎、做出客观的自私决策或提供虚假信息,远离了欺骗的传统理解。我们针对通过混淆和模棱两可达到的欺骗特定类别进行攻击。我们详细解释了两种欺骗类型。通过类比兔子脱出帽子魔术表演,我们(i)要么让兔子从隐藏的陷阱门中出来,要么(ii)我们的重点是,观众被魔术师利用魔术手法或误导观众带出现在他们面前的兔子所吸引。我们的新测试台框架在两个代理人的竞争对话系统中,当它们被设计为在自然语言生成中具有欺骗能力时,展示了LLM代理在目标导向环境中的内在欺骗能力。沿着目标导向环境的目标,我们通过强化学习框架开发了欺骗能力,并围绕语言哲学和认知心理学理论进行构建。我们发现,通过后续的对抗性交互试验, lobbyist代理的欺骗能力增加了~ 40%(相对),我们的欺骗检测机制具有高达92%的检测能力。我们的结果突出了人工智能代理与人类交互中潜在的问题,即代理可能会操纵人类以实现其预设目标。
https://arxiv.org/abs/2405.04325
Automatically detecting Alzheimer's Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying such models on edge devices. In this context, we construct a novel framework, namely Hierarchical Attention-Free Transformer (HAFFormer), to better deal with long speech for AD detection. Specifically, we employ an attention-free module of Multi-Scale Depthwise Convolution to replace the self-attention and thus avoid the expensive computation, and a GELU-based Gated Linear Unit to replace the feedforward layer, aiming to automatically filter out the redundant information. Moreover, we design a hierarchical structure to force it to learn a variety of information grains, from the frame level to the dialogue level. By conducting extensive experiments on the ADReSS-M dataset, the introduced HAFFormer can achieve competitive results (82.6% accuracy) with other recent work, but with significant computational complexity and model size reduction compared to the standard Transformer. This shows the efficiency of HAFFormer in dealing with long audio for AD detection.
自动从自发性语言中检测阿尔茨海默病(AD)在早期诊断中扮演着重要角色。最近的方法高度依赖Transformer架构,因为其在建模长距离上下文依赖方面非常高效。然而,与自注意力机制和音频长度相关的计算复杂度的大幅增加使得在边缘设备上部署此类模型时面临挑战。在这种情况下,我们构建了一个新框架——分层注意力免费Transformer(HAFFormer),以更好地处理AD检测中的长篇演讲。具体来说,我们采用多尺度深度卷积的注意力免费模块来代替自注意力和因此避免昂贵的计算,并采用基于GELU的门控线性单元来代替前馈层,旨在自动过滤出冗余信息。此外,我们还设计了一个分层结构,以迫使它学习各种信息颗粒,从帧级别到对话级别。在ADReSS-M数据集上进行广泛的实验后,引入的HAFFormer可以与其他最近的工作竞争(82.6%的准确率),但与标准Transformer相比,具有显著的计算复杂度和模型大小减少。这证明了HAFFormer在处理长篇音频以检测AD方面的效率。
https://arxiv.org/abs/2405.03952
Social media platforms such as Twitter, Reddit, and Sina Weibo play a crucial role in global communication but often encounter strict regulations in geopolitically sensitive regions. This situation has prompted users to ingeniously modify their way of communicating, frequently resorting to coded language in these regulated social media environments. This shift in communication is not merely a strategy to counteract regulation, but a vivid manifestation of language evolution, demonstrating how language naturally evolves under societal and technological pressures. Studying the evolution of language in regulated social media contexts is of significant importance for ensuring freedom of speech, optimizing content moderation, and advancing linguistic research. This paper proposes a multi-agent simulation framework using Large Language Models (LLMs) to explore the evolution of user language in regulated social media environments. The framework employs LLM-driven agents: supervisory agent who enforce dialogue supervision and participant agents who evolve their language strategies while engaging in conversation, simulating the evolution of communication styles under strict regulations aimed at evading social media regulation. The study evaluates the framework's effectiveness through a range of scenarios from abstract scenarios to real-world situations. Key findings indicate that LLMs are capable of simulating nuanced language dynamics and interactions in constrained settings, showing improvement in both evading supervision and information accuracy as evolution progresses. Furthermore, it was found that LLM agents adopt different strategies for different scenarios.
社交媒体平台如Twitter、Reddit和新浪微博在全球交流中扮演着关键角色,但通常会在敏感的地缘政治地区遇到严格的监管。这种情况促使用户巧妙地修改他们的交流方式,经常在受监管的社交媒体环境中使用暗语。这种交流方式的转变不仅是对抗监管的一种策略,更是语言进化的生动表现,展示了在社会和技术压力下语言的自然演变。研究在受监管的社交媒体环境中语言演变的重要性对于确保言论自由、优化内容监管和推动语言研究具有重大意义。本文提出了一种使用大型语言模型(LLMs)的多代理仿真框架,探讨受监管社交媒体环境中用户语言的演变。框架采用LLM驱动的代理:监督代理负责对话监督,参与者代理在参与对话的过程中发展他们的语言策略,模拟在严格监管下避免社交媒体管理策略的演变。研究通过从抽象场景到现实世界情况的各种场景对框架的有效性进行评估。关键发现表明,LLMs能够模拟受约束环境中的细微语言动态和互动,随着进度的提高,逃避监督和信息准确性的能力都有所改善。此外,发现LLM代理采用不同的策略来应对不同的场景。
https://arxiv.org/abs/2405.02858
This paper introduces Stochastic RAG--a novel approach for end-to-end optimization of retrieval-augmented generation (RAG) models that relaxes the simplifying assumptions of marginalization and document independence, made in most prior work. Stochastic RAG casts the retrieval process in RAG as a stochastic sampling without replacement process. Through this formulation, we employ straight-through Gumbel-top-k that provides a differentiable approximation for sampling without replacement and enables effective end-to-end optimization for RAG. We conduct extensive experiments on seven diverse datasets on a wide range of tasks, from open-domain question answering to fact verification to slot-filling for relation extraction and to dialogue systems. By applying this optimization method to a recent and effective RAG model, we advance state-of-the-art results on six out of seven datasets.
本文介绍了一种名为Stochastic RAG的新方法,用于端到端优化检索增强生成(RAG)模型,该方法放宽了大多数先前的工作中的简化假设,即边际化和文档独立性假设。Stochastic RAG将检索过程在RAG中建模为无替换的随机采样过程。通过这种表示方法,我们使用 straight-through Gumbel-top-k,它为无替换采样提供了一个不同的导数近似,并有效地对RAG进行了端到端的优化。我们在包括开放域问题回答、事实验证、关系提取和对话系统等七种多样化的数据集上进行了广泛的实验。通过将这种优化方法应用于最近且有效的RAG模型,我们在六个数据集上超过了现有的最佳结果。
https://arxiv.org/abs/2405.02816
Conversational information seeking has evolved rapidly in the last few years with the development of Large Language Models (LLMs), providing the basis for interpreting and responding in a naturalistic manner to user requests. The extended TREC Interactive Knowledge Assistance Track (iKAT) collection aims to enable researchers to test and evaluate their Conversational Search Agents (CSA). The collection contains a set of 36 personalized dialogues over 20 different topics each coupled with a Personal Text Knowledge Base (PTKB) that defines the bespoke user personas. A total of 344 turns with approximately 26,000 passages are provided as assessments on relevance, as well as additional assessments on generated responses over four key dimensions: relevance, completeness, groundedness, and naturalness. The collection challenges CSA to efficiently navigate diverse personal contexts, elicit pertinent persona information, and employ context for relevant conversations. The integration of a PTKB and the emphasis on decisional search tasks contribute to the uniqueness of this test collection, making it an essential benchmark for advancing research in conversational and interactive knowledge assistants.
近年来,随着大型语言模型(LLMs)的发展,会话信息寻求已经迅速发展,为用户提供了以自然方式理解和回应请求的基础。TREC Interactive Knowledge Assistance Track (iKAT)扩展收藏旨在使研究人员能够测试和评估他们的会话搜索代理(CSA)。该收藏包含20个不同主题的个性化对话,每个主题都附带一个个人文本知识库(PTKB),定义了特定的用户人格。该收藏提供了关于相关性的评估以及关于生成响应的四个关键方面的额外评估:相关性、完整性、 groundedness 和自然性。该收藏挑战CSA有效地浏览多样的人格背景,唤起相关人物信息,并利用相关对话的上下文。集成PTKB和强调决策搜索任务使该测试收藏独特,成为推动研究在会话和交互式知识助手领域的重要基准。
https://arxiv.org/abs/2405.02637
This paper aims to efficiently enable large language models (LLMs) to use external knowledge and goal guidance in conversational recommender system (CRS) tasks. Advanced LLMs (e.g., ChatGPT) are limited in domain-specific CRS tasks for 1) generating grounded responses with recommendation-oriented knowledge, or 2) proactively leading the conversations through different dialogue goals. In this work, we first analyze those limitations through a comprehensive evaluation, showing the necessity of external knowledge and goal guidance which contribute significantly to the recommendation accuracy and language quality. In light of this finding, we propose a novel ChatCRS framework to decompose the complex CRS task into several sub-tasks through the implementation of 1) a knowledge retrieval agent using a tool-augmented approach to reason over external Knowledge Bases and 2) a goal-planning agent for dialogue goal prediction. Experimental results on two multi-goal CRS datasets reveal that ChatCRS sets new state-of-the-art benchmarks, improving language quality of informativeness by 17% and proactivity by 27%, and achieving a tenfold enhancement in recommendation accuracy.
本文旨在有效地使大型语言模型(LLMs)能够使用外部知识和目标指导在会话推荐系统(CRS)任务中进行高效运用。先进的LLM(例如,ChatGPT)在领域特定CRS任务上存在限制,其一是生成具有推荐导向知识的有根回答,二是通过不同的对话目标主动引导对话。在这项工作中,我们通过全面的评估分析了这些限制,展示了外部知识和目标指导在提高推荐准确性和语言质量方面的重要性。鉴于这一发现,我们提出了一个新型的ChatCRS框架,通过实现知识检索代理和对话目标预测代理来分解复杂的CRS任务为几个子任务。在两个多目标CRSDataset上的实验结果表明,ChatCRS取得了最先进的基准,将信息提供性的语言质量提高了17%,活力提高了27%,并且推荐准确率提高了十倍。
https://arxiv.org/abs/2405.01868
Diagnosing autism spectrum disorder (ASD) by identifying abnormal speech patterns from examiner-patient dialogues presents significant challenges due to the subtle and diverse manifestations of speech-related symptoms in affected individuals. This study presents a comprehensive approach to identify distinctive speech patterns through the analysis of examiner-patient dialogues. Utilizing a dataset of recorded dialogues, we extracted 40 speech-related features, categorized into frequency, zero-crossing rate, energy, spectral characteristics, Mel Frequency Cepstral Coefficients (MFCCs), and balance. These features encompass various aspects of speech such as intonation, volume, rhythm, and speech rate, reflecting the complex nature of communicative behaviors in ASD. We employed machine learning for both classification and regression tasks to analyze these speech features. The classification model aimed to differentiate between ASD and non-ASD cases, achieving an accuracy of 87.75%. Regression models were developed to predict speech pattern related variables and a composite score from all variables, facilitating a deeper understanding of the speech dynamics associated with ASD. The effectiveness of machine learning in interpreting intricate speech patterns and the high classification accuracy underscore the potential of computational methods in supporting the diagnostic processes for ASD. This approach not only aids in early detection but also contributes to personalized treatment planning by providing insights into the speech and communication profiles of individuals with ASD.
通过对考官和患者对话的异常语音模式的鉴定来诊断自闭症谱系障碍(ASD)是一个具有挑战性的任务,因为受影响个体的语音相关症状 subtle 和 diverse 表现。这项研究通过分析考官和患者对话的录音数据,全面探讨了识别独特语音模式的方法。利用一个记录的对话数据集,我们提取了40个语音相关特征,分为频度、零交叉率、能量、特征、梅尔频谱系数(MFCC)和平衡。这些特征涵盖了 ASD 中各种语音方面,如语调、音量、节奏和语速,反映了 ASD 交际行为的复杂性。我们使用机器学习对分类和回归任务进行分析。分类模型旨在区分 ASD 和非 ASD 病例,达到 87.75% 的准确率。回归模型用于预测与语音模式相关的变量和复合评分,促进对 ASD 相关的语音动态的深入理解。机器学习在解释复杂语音模式和实现高度分类准确方面具有潜力,为支持 ASD 的诊断过程提供了计算方法。这种方法不仅有助于早期诊断,还有助于为 ASD 患者提供个性化的治疗计划,通过提供对 ASD 个体的语言和交流特点的洞察力。
https://arxiv.org/abs/2405.05126