Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
近年来,短视频的发展速度加快,通常包含视觉和音频模态。背景音乐对于短视频来说非常重要,因为它可以显著影响观众的情感。然而,目前,短视频的背景音乐通常由视频制作人选择,缺乏针对短视频的自动音乐推荐方法。本文介绍了一种创新的跨模态检索模型MVBind,用于短片的跨模态检索。MVBind是一种自监督方法,通过直接从数据中获取模态关系的固有知识,无需手动注释。此外,为了弥补短视频没有相应的音乐-视觉对数据集的不足,我们构建了一个数据集SVM-10K(短视频与音乐-10K),主要包含精心选择短的短视频。在这个数据集上,MVBind表现出比其他基线方法显著的优异性能。构建的数据集和代码将发布,以促进未来研究。
https://arxiv.org/abs/2405.09286
Traditional recommendation proposals, including content-based and collaborative filtering, usually focus on similarity between items or users. Existing approaches lack ways of introducing unexpectedness into recommendations, prioritizing globally popular items over exposing users to unforeseen items. This investigation aims to design and evaluate a novel layer on top of recommender systems suited to incorporate relational information and suggest items with a user-defined degree of surprise. We propose a Knowledge Graph (KG) based recommender system by encoding user interactions on item catalogs. Our study explores whether network-level metrics on KGs can influence the degree of surprise in recommendations. We hypothesize that surprisingness correlates with certain network metrics, treating user profiles as subgraphs within a larger catalog KG. The achieved solution reranks recommendations based on their impact on structural graph metrics. Our research contributes to optimizing recommendations to reflect the metrics. We experimentally evaluate our approach on two datasets of LastFM listening histories and synthetic Netflix viewing profiles. We find that reranking items based on complex network metrics leads to a more unexpected and surprising composition of recommendation lists.
传统推荐策略,包括基于内容和协同过滤的方法,通常关注物品或用户之间的相似性。现有方法缺乏将意外性引入推荐的方法,将全局热门物品优先于向用户展示未知的物品。本研究旨在设计并评估一种新层,将关系信息编码在物品目录上,用于在推荐系统中建议具有用户定义程度惊喜的物品。我们提出了一个基于知识图谱的推荐系统,通过编码用户在目录上的交互来实现。我们的研究探讨了网络级指标在知识图谱上的影响是否会影响推荐中的惊喜程度。我们假设惊喜程度与某些网络指标相关,将用户个人档案视为大型目录知识图谱中的子图。所实现的结果根据其对结构图指标的影响对推荐进行排序。我们的研究为优化推荐以反映这些指标做出了贡献。我们在LastFM听书历史数据集和合成Netflix观看个人资料数据集上进行了实验评估。我们发现,根据复杂的网络指标重新排列物品会导致推荐列表更加意外和令人惊讶。
https://arxiv.org/abs/2405.08465
Factorization-based models have gained popularity since the Netflix challenge {(2007)}. Since that, various factorization-based models have been developed and these models have been proven to be efficient in predicting users' ratings towards items. A major concern is that explaining the recommendations generated by such methods is non-trivial because the explicit meaning of the latent factors they learn are not always clear. In response, we propose a novel model that combines factorization-based methods with argumentation frameworks (AFs). The integration of AFs provides clear meaning at each stage of the model, enabling it to produce easily understandable explanations for its recommendations. In this model, for every user-item interaction, an AF is defined in which the features of items are considered as arguments, and the users' ratings towards these features determine the strength and polarity of these arguments. This perspective allows our model to treat feature attribution as a structured argumentation procedure, where each calculation is marked with explicit meaning, enhancing its inherent interpretability. Additionally, our framework seamlessly incorporates side information, such as user contexts, leading to more accurate predictions. We anticipate at least three practical applications for our model: creating explanation templates, providing interactive explanations, and generating contrastive explanations. Through testing on real-world datasets, we have found that our model, along with its variants, not only surpasses existing argumentation-based methods but also competes effectively with current context-free and context-aware methods.
自从Netflix挑战赛(2007年)以来,基于因子分解的模型已经取得了很大成功。此后,已经开发了许多基于因子分解的模型,并证明这些模型在预测用户对物品的评分方面非常有效。一个主要问题是,解释这些方法生成的推荐是非常困难的,因为它们学习到的隐含因素的明确意义并不总是清晰的。为了应对这个问题,我们提出了一个新模型,将因子分解方法与论证框架(AFs)相结合。AFs在模型每个阶段都提供了明确的含义,使得模型能够轻松地生产易于理解的解释。在这个模型中,对于每个用户-项目交互,定义一个AF,其中物品的特征被视为论据,用户对这些特征的评分决定了这些论据的强度和极性。这种观点使得我们的模型将特征归因视为一个结构化的论证过程,其中每个计算都带有明确的含义,增强了其固有的可解释性。此外,我们的框架无缝地融入了附加信息,如用户上下文,从而提高了预测的准确性。我们预计,我们的模型及其变体至少有以下三个实际应用:创建解释模板、提供交互式解释和生成对比性解释。通过在现实世界数据集上的测试,我们发现,与现有基于论证的方法相比,我们的模型及其变体超过了它们,并且与当前的无上下文和上下文感知方法竞争相当。
https://arxiv.org/abs/2405.08131
The novel coronavirus (COVID-19), a highly infectious respiratory disease caused by the SARS-CoV-2 has emerged as an unprecedented healthcare crisis. The pandemic had a devastating impact on the health, well-being, and economy of the global population. Early screening and diagnosis of symptomatic patients plays crucial role in isolation of patient to help stop community transmission as well as providing early treatment helping in reducing the mortality rate. Although, the RT-PCR test is the gold standard for COVID-19 testing, it is a manual, laborious, time consuming, uncomfortable, and invasive process. Due to its accessibility, availability, lower-cost, ease of sanitisation, and portable setup, chest X-Ray imaging can serve as an effective screening and diagnostic tool. In this study, we first highlight limitations of existing datasets and studies in terms of data quality, data imbalance, and evaluation strategy. Second, we curated a large-scale COVID-19 chest X-ray dataset from many publicly available COVID-19 imaging databases and proposed a pre-processing pipeline to improve quality of the dataset. We proposed CoVScreen, an CNN architecture to train and test the curated dataset. The experimental results applying different classification scenarios on the curated dataset in terms of various evaluation metrics demonstrate the effectiveness of proposed methodology in the screening of COVID-19 infection.
新型冠状病毒(COVID-19),由SARS-CoV-2引起的高度传染性呼吸疾病,已成为前所未有的卫生危机。大流行对全球人口的健康、福祉和经济都造成了毁灭性影响。在症状性患者早期筛查和诊断在隔离患者以帮助阻止社区传播以及提供早期治疗以降低死亡率方面起着关键作用。尽管COVID-19测试的RT-PCR测试是金标准,但它是一个手动、费力、耗时、不舒适和侵入性的过程。由于其可获取性、可用性、较低成本、易消毒和便携式设置,胸部X光影像可以成为有效的筛查和诊断工具。在本研究中,我们首先强调了现有数据和研究的局限性,即数据质量、数据不平衡和评估策略。然后,我们从多个公开可用的COVID-19成像数据库中收集了大规模COVID-19胸X光数据,并提出了一种预处理方案以提高数据质量。我们提出了COVScreen,一种CNN架构,用于训练和测试所选数据的预处理后的数据。通过对所选数据集的不同分类情景在各种评估指标上的实验结果,表明所提出的方法在筛查COVID-19感染方面非常有效。
https://arxiv.org/abs/2405.07674
Last year has witnessed the considerable interest of Large Language Models (LLMs) for their potential applications in recommender systems, which may mitigate the persistent issue of data sparsity. Though large efforts have been made for user-item graph augmentation with better graph-based recommendation performance, they may fail to deal with the dynamic graph recommendation task, which involves both structural and temporal graph dynamics with inherent complexity in processing time-evolving data. To bridge this gap, in this paper, we propose a novel framework, called DynLLM, to deal with the dynamic graph recommendation task with LLMs. Specifically, DynLLM harnesses the power of LLMs to generate multi-faceted user profiles based on the rich textual features of historical purchase records, including crowd segments, personal interests, preferred categories, and favored brands, which in turn supplement and enrich the underlying relationships between users and items. Along this line, to fuse the multi-faceted profiles with temporal graph embedding, we engage LLMs to derive corresponding profile embeddings, and further employ a distilled attention mechanism to refine the LLM-generated profile embeddings for alleviating noisy signals, while also assessing and adjusting the relevance of each distilled facet embedding for seamless integration with temporal graph embedding from continuous time dynamic graphs (CTDGs). Extensive experiments on two real e-commerce datasets have validated the superior improvements of DynLLM over a wide range of state-of-the-art baseline methods.
去年,大型语言模型(LLMs)对推荐系统潜在应用的兴趣浓厚,这可能减轻数据稀疏性问题。尽管在用户-物品图增强方面已经做出了很大努力,以提高基于图的推荐性能,但他们可能无法应对动态图推荐任务,该任务涉及具有内在复杂性的时间演化数据和结构化图动态。为了填补这一空白,在本文中,我们提出了一个名为DynLLM的新框架,用于处理带有LLMs的动态图推荐任务。具体来说,DynLLM利用LLMs的功率,根据历史购买记录的丰富文本特征生成多方面用户概况,包括人群细分、个人兴趣、偏好类别和喜欢品牌,从而补充和丰富用户与物品之间的底层关系。沿着这一思路,将多方面概况与时间图嵌入相融合,我们使用LLMs计算相应的概况嵌入,并进一步采用去噪关注机制对LLM生成的概况嵌入进行优化,同时评估和调整每个去噪关注方面的相关性,以实现与连续时间动态图(CTDGs)的平滑集成。在两个真实电子商务数据集上的大量实验证实,DynLLM在各种最先进的基线方法上具有优越的改进效果。
https://arxiv.org/abs/2405.07580
Conversational Recommender System (CRS) leverages real-time feedback from users to dynamically model their preferences, thereby enhancing the system's ability to provide personalized recommendations and improving the overall user experience. CRS has demonstrated significant promise, prompting researchers to concentrate their efforts on developing user simulators that are both more realistic and trustworthy. The emergence of Large Language Models (LLMs) has marked the onset of a new epoch in computational capabilities, exhibiting human-level intelligence in various tasks. Research efforts have been made to utilize LLMs for building user simulators to evaluate the performance of CRS. Although these efforts showcase innovation, they are accompanied by certain limitations. In this work, we introduce a Controllable, Scalable, and Human-Involved (CSHI) simulator framework that manages the behavior of user simulators across various stages via a plugin manager. CSHI customizes the simulation of user behavior and interactions to provide a more lifelike and convincing user interaction experience. Through experiments and case studies in two conversational recommendation scenarios, we show that our framework can adapt to a variety of conversational recommendation settings and effectively simulate users' personalized preferences. Consequently, our simulator is able to generate feedback that closely mirrors that of real users. This facilitates a reliable assessment of existing CRS studies and promotes the creation of high-quality conversational recommendation datasets.
对话推荐系统(CRS)利用用户的实时反馈来动态建模他们的喜好,从而增强系统提供个性化推荐的能力,提高用户体验。CRS已经表现出巨大的潜力,促使研究人员将精力集中在开发更真实和可靠的用户模拟器上。大型语言模型的出现标志着计算能力的到一个新纪元,展示了人类水平智能在各种任务中的应用。为了利用LLMs构建用户模拟器以评估CRS的表现,研究人员进行了努力。尽管这些努力展示了创新,但它们附带有一些局限性。在这篇工作中,我们介绍了一个可控制、可扩展且涉及人类参与的(CSHI)模拟器框架,它通过插件管理器来管理用户模拟器的行为。CSHI定制了用户行为的模拟,提供了更真实和令人信服的用户交互体验。通过两个对话推荐场景的实验和案例研究,我们证明了我们的框架可以适应各种对话推荐设置,并有效模拟用户的个性化喜好。因此,我们的模拟器能够生成与真实用户非常相似的反馈。这有助于对现有CRS研究进行可靠的评估,并促进高质量对话推荐数据集的创建。
https://arxiv.org/abs/2405.08035
This study explores the application of recurrent neural networks to recognize emotions conveyed in music, aiming to enhance music recommendation systems and support therapeutic interventions by tailoring music to fit listeners' emotional states. We utilize Russell's Emotion Quadrant to categorize music into four distinct emotional regions and develop models capable of accurately predicting these categories. Our approach involves extracting a comprehensive set of audio features using Librosa and applying various recurrent neural network architectures, including standard RNNs, Bidirectional RNNs, and Long Short-Term Memory (LSTM) networks. Initial experiments are conducted using a dataset of 900 audio clips, labeled according to the emotional quadrants. We compare the performance of our neural network models against a set of baseline classifiers and analyze their effectiveness in capturing the temporal dynamics inherent in musical expression. The results indicate that simpler RNN architectures may perform comparably or even superiorly to more complex models, particularly in smaller datasets. We've also applied the following experiments on larger datasets: one is augmented based on our original dataset, and the other is from other sources. This research not only enhances our understanding of the emotional impact of music but also demonstrates the potential of neural networks in creating more personalized and emotionally resonant music recommendation and therapy systems.
本研究探讨了循环神经网络(RNN)在识别音乐中传达的情感中的应用,旨在通过将音乐定制以适应听众的情感状态来增强音乐推荐系统和支持治疗干预。我们利用拉塞尔的情感四象限将音乐分为四个不同的情感区域,并开发了能够准确预测这些类别的模型。我们的方法包括使用Librosa提取全面音频特征,并应用各种循环神经网络架构,包括标准的RNN、双向RNN和长短时记忆(LSTM)网络。我们对900个音频片段的音频数据集进行了初始实验,并根据情感四象限进行分类。我们比较了我们的神经网络模型的性能与一系列基线分类器的性能,并分析了它们在捕捉音乐表达中的时间动态方面的有效性。结果显示,简单的RNN架构可能与更复杂的模型表现相当,甚至可能优于它们。我们还将在较大数据集上进行以下实验:一是基于我们原始数据集的增强,二是来自其他来源的。这项研究不仅增强了我们对音乐情感影响的了解,还展示了神经网络在创建更个性化和情感共鸣的音乐推荐和治疗系统方面的潜力。
https://arxiv.org/abs/2405.06747
Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine practices and assumptions surrounding the causes of disagreement--some challenged by perspectivist approaches, and some that remain to be addressed--as well as practical and normative challenges for work operating under these assumptions. We conclude with recommendations for the data labeling pipeline and avenues for future research engaging with subjectivity and disagreement.
长期的数据标注实践涉及从多个标注者那里收集和汇总标签。但是,当标注者之间存在分歧时,我们该怎么办呢?尽管一直以来我们都认为减少标注者之间的分歧是一个问题,但是新的主观主义方法挑战了这个假设,将分歧视为一个有价值的来源。在这篇论文中,我们检查了分歧产生的原因——有些被 Perspectivist 方法挑战,有些尚未得到解决——以及在这些假设下进行实际和规范工作的挑战。我们最后提出了关于数据标注流程的建议,以及与主观性和分歧相关的研究方向。
https://arxiv.org/abs/2405.05860
Given the increasing demand for mental health assistance, artificial intelligence (AI), particularly large language models (LLMs), may be valuable for integration into automated clinical support systems. In this work, we leverage a decision transformer architecture for topic recommendation in counseling conversations between patients and mental health professionals. The architecture is utilized for offline reinforcement learning, and we extract states (dialogue turn embeddings), actions (conversation topics), and rewards (scores measuring the alignment between patient and therapist) from previous turns within a conversation to train a decision transformer model. We demonstrate an improvement over baseline reinforcement learning methods, and propose a novel system of utilizing our model's output as synthetic labels for fine-tuning a large language model for the same task. Although our implementation based on LLaMA-2 7B has mixed results, future work can undoubtedly build on the design.
鉴于对心理健康协助的需求不断增加,人工智能(AI)特别是大型语言模型(LLMs)可能对于将自动化临床支持系统集成到一起非常有价值。在这项工作中,我们利用决策转换器架构来进行患者和心理健康专业人员之间心理咨询对话的主题推荐。该架构用于离线强化学习,并从对话的前几轮中提取状态(对话轮嵌入)、动作(对话主题)和奖励(衡量患者和治疗师之间同步的分数)来训练决策转换器模型。我们证明了与基线强化学习方法相比的改善,并提出了将模型输出作为大型语言模型的同任务合成标签,用于微调大型语言模型的全新系统。尽管我们的基于LLLA-2 7B的实现结果喜忧参半,但未来的工作无疑可以在此设计上继续发展。
https://arxiv.org/abs/2405.05060
Collaborative filtering (CF) methods for recommendation systems have been extensively researched, ranging from matrix factorization and autoencoder-based to graph filtering-based methods. Recently, lightweight methods that require almost no training have been recently proposed to reduce overall computation. However, existing methods still have room to improve the trade-offs among accuracy, efficiency, and robustness. In particular, there are no well-designed closed-form studies for \emph{balanced} CF in terms of the aforementioned trade-offs. In this paper, we design SVD-AE, a simple yet effective singular vector decomposition (SVD)-based linear autoencoder, whose closed-form solution can be defined based on SVD for CF. SVD-AE does not require iterative training processes as its closed-form solution can be calculated at once. Furthermore, given the noisy nature of the rating matrix, we explore the robustness against such noisy interactions of existing CF methods and our SVD-AE. As a result, we demonstrate that our simple design choice based on truncated SVD can be used to strengthen the noise robustness of the recommendation while improving efficiency. Code is available at this https URL.
协作过滤(CF)方法在推荐系统领域得到了广泛研究,从矩阵分解和自编码器为基础到图过滤为基础的方法。最近,人们提出了一些轻量级的方法,几乎不需要训练,以降低总计算量。然而,现有的方法在准确性和效率之间仍存在潜在的权衡。特别是,在上述权衡方面,没有得到良好设计的闭合形式研究。在本文中,我们设计了一个简单的 yet 有效的 singular vector decomposition (SVD)-based linear autoencoder,称为 SVD-AE,其闭合形式解决方案可以根据 SVD 定义。SVD-AE 不需要迭代训练过程,因为其闭合形式解决方案可以一次性计算出来。此外,考虑到评分矩阵的噪声性质,我们研究了现有 CF 方法的鲁棒性,以及我们 SVD-AE 对这种噪声交互的鲁棒性。结果,我们证明了基于截断 SVD 的简单设计选择可以用于增强推荐系统的噪声鲁棒性,同时提高效率。代码可在此处访问:https://www.acm.org/dl/doi/10.1145/2848006.2848015
https://arxiv.org/abs/2405.04746
Augmenting Large Language Models (LLMs) with image-understanding capabilities has resulted in a boom of high-performing Vision-Language models (VLMs). While studying the alignment of LLMs to human values has received widespread attention, the safety of VLMs has not received the same attention. In this paper, we explore the impact of jailbreaking on three state-of-the-art VLMs, each using a distinct modeling approach. By comparing each VLM to their respective LLM backbone, we find that each VLM is more susceptible to jailbreaking. We consider this as an undesirable outcome from visual instruction-tuning, which imposes a forgetting effect on an LLM's safety guardrails. Therefore, we provide recommendations for future work based on evaluation strategies that aim to highlight the weaknesses of a VLM, as well as take safety measures into account during visual instruction tuning.
通过增强大型语言模型(LLMs)的图像理解能力,已经出现了性能卓越的视觉语言模型(VLMs)的繁荣。尽管研究LLMs与人类价值观的同步已经得到了广泛关注,但VLMs的安全性并没有得到同样的关注。在本文中,我们探讨了对三个最先进的VLMs进行破解的影响,每个VLM使用独特的建模方法进行建模。通过比较每个VLM与它们的相应LLM骨干模型,我们发现每个VLM都更容易被破解。我们将这种情况视为来自视觉指令调整的负面影响,该调整会忘记LLM的安全边界。因此,我们根据旨在突出VLM弱点的评估策略,提出了未来的建议。在视觉指令调整期间,还应该采取安全措施。
https://arxiv.org/abs/2405.04403
Contemporary recommender systems predominantly rely on collaborative filtering techniques, employing ID-embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance in cold-start scenarios and long-tail user recommendations. Leveraging the capabilities of Large Language Models (LLMs) pretrained on massive text corpus presents a promising avenue for enhancing recommender systems by integrating open-world domain knowledge. In this paper, we propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge. We address computational complexity concerns by utilizing pretrained LLMs as item encoders and freezing LLM parameters to avoid catastrophic forgetting and preserve open-world knowledge. To bridge the gap between the open-world and collaborative domains, we design a twin-tower structure supervised by the recommendation task and tailored for practical industrial application. Through offline experiments on the large-scale industrial dataset and online experiments on A/B tests, we demonstrate the efficacy of our approach.
当代推荐系统主要依赖合作过滤技术,使用ID嵌入来捕捉用户和物品之间的潜在关联。然而,这种方法忽视了物品文本描述中所蕴含的丰富语义信息,导致在冷启动场景和长尾用户推荐中性能较差。利用大型语言模型(LLMs)预先训练在大型文本语料库的能力,为增强推荐系统提供了一个有前途的途径,将开放世界领域知识与协作知识相结合。在本文中,我们提出了一个Llm驱动的知道适应推荐(LEARN)框架,将开放世界知识和协作知识相结合。我们通过使用预训练的LLMs作为项目编码器并冻结LLM参数来解决计算复杂性问题。为了弥合开放世界和协作领域之间的差距,我们设计了一个由推荐任务监督的双层结构,并针对实际工业应用进行了优化。通过在大型工业数据集的离线实验和A/B测试在线实验,我们证明了我们方法的有效性。
https://arxiv.org/abs/2405.03988
Drug discovery is a complex process that involves sequentially screening and examining a vast array of molecules to identify those with the target properties. This process, also referred to as sequential experimentation, faces challenges due to the vast search space, the rarity of target molecules, and constraints imposed by limited data and experimental budgets. To address these challenges, we introduce a human-in-the-loop framework for sequential experiments in drug discovery. This collaborative approach combines human expert knowledge with deep learning algorithms, enhancing the discovery of target molecules within a specified experimental budget. The proposed algorithm processes experimental data to recommend both promising molecules and those that could improve its performance to human experts. Human experts retain the final decision-making authority based on these recommendations and their domain expertise, including the ability to override algorithmic recommendations. We applied our method to drug discovery tasks using real-world data and found that it consistently outperforms all baseline methods, including those which rely solely on human or algorithmic input. This demonstrates the complementarity between human experts and the algorithm. Our results provide key insights into the levels of humans' domain knowledge, the importance of meta-knowledge, and effective work delegation strategies. Our findings suggest that such a framework can significantly accelerate the development of new vaccines and drugs by leveraging the best of both human and artificial intelligence.
药物发现是一个复杂的过程,它涉及对大量分子进行逐一筛选和评估,以确定具有目标特性的分子。这个过程有时也称为序列实验,因为它需要对巨大的搜索空间、目标分子的稀有性和有限的数据和实验预算施加约束。为了应对这些挑战,我们引入了一个将人类专家知识与深度学习算法相结合的药物发现序列实验的人机合作框架。这种合作方法将人类专家的专业知识与深度学习算法相结合,从而在指定的实验预算内显著提高目标分子的发现。所提出的算法对实验数据进行处理,根据这些建议推荐有前景的分子和那些可能改进其表现至人类专家水平的分子。人类专家保留最终决策权,基于这些建议及其专业领域的知识,包括能够推翻算法建议的能力。我们将我们的方法应用于药物发现任务,使用真实世界数据,发现它 consistently超越了所有基线方法,包括那些仅依赖于人类或算法输入的方法。这表明了人类专家和专业算法之间的互补性。我们的结果提供了关于人类领域知识的水平、元知识的价值和有效的工作分配策略的关键见解。我们的发现表明,利用人类和人工智能的最佳优势,这种框架可以显著加速新疫苗和药物的开发。
https://arxiv.org/abs/2405.03942
With the impressive performance in various downstream tasks, large language models (LLMs) have been widely integrated into production pipelines, like recruitment and recommendation systems. A known issue of models trained on natural language data is the presence of human biases, which can impact the fairness of the system. This paper investigates LLMs' behavior with respect to gender stereotypes, in the context of occupation decision making. Our framework is designed to investigate and quantify the presence of gender stereotypes in LLMs' behavior via multi-round question answering. Inspired by prior works, we construct a dataset by leveraging a standard occupation classification knowledge base released by authoritative agencies. We tested three LLMs (RoBERTa-large, GPT-3.5-turbo, and Llama2-70b-chat) and found that all models exhibit gender stereotypes analogous to human biases, but with different preferences. The distinct preferences of GPT-3.5-turbo and Llama2-70b-chat may imply the current alignment methods are insufficient for debiasing and could introduce new biases contradicting the traditional gender stereotypes.
在各种下游任务的出色表现促使大型语言模型(LLMs)广泛地被集成到生产流程中,如招聘和推荐系统。在自然语言数据上训练模型的常见问题之一是存在人类偏见,这可能影响系统的公平性。本文研究了LLMs在性别刻板印象方面的行为,并在此基础上进行了多轮问题回答。我们通过利用权威机构发布的标准职业分类知识库构建了一个数据集,以调查和量化LLMs行为中性别刻板印象的存在。我们测试了三种LLM(RoBERTa-large、GPT-3.5-turbo和Llama2-70b-chat),发现所有模型都表现出类似于人类偏见的性别刻板印象,但具有不同的偏好。GPT-3.5-turbo和Llama2-70b-chat的显著偏好可能表明,当前的归一化方法不足以消除偏见,并可能引入新的偏见,与传统性别刻板印象相矛盾。
https://arxiv.org/abs/2405.06687
In time-critical decisions, human decision-makers can interact with AI-enabled situation-aware software to evaluate many imminent and possible scenarios, retrieve billions of facts, and estimate different outcomes based on trillions of parameters in a fraction of a second. In high-order reasoning, "what-if" questions can be used to challenge the assumptions or pre-conditions of the reasoning, "why-not" questions can be used to challenge on the method applied in the reasoning, "so-what" questions can be used to challenge the purpose of the decision, and "how-about" questions can be used to challenge the applicability of the method. When above high-order reasoning questions are applied to assist human decision-making, it can help humans to make time-critical decisions and avoid false-negative or false-positive types of errors. In this paper, we present a model of high-order reasoning to offer recommendations in evidence-based medicine in a time-critical fashion for the applications in ICU. The Large Language Model (LLM) is used in our system. The experiments demonstrated the LLM exhibited optimal performance in the "What-if" scenario, achieving a similarity of 88.52% with the treatment plans of human doctors. In the "Why-not" scenario, the best-performing model tended to opt for alternative treatment plans in 70% of cases for patients who died after being discharged from the ICU. In the "So-what" scenario, the optimal model provided a detailed analysis of the motivation and significance of treatment plans for ICU patients, with its reasoning achieving a similarity of 55.6% with actual diagnostic information. In the "How-about" scenario, the top-performing LLM demonstrated a content similarity of 66.5% in designing treatment plans transferring for similar diseases. Meanwhile, LLMs managed to predict the life status of patients after their discharge from the ICU with an accuracy of 70%.
在时间紧迫的决策中,人类决策者可以通过与具有情境感知能力的AI软件互动来评估许多即将发生和可能的场景,检索数十亿个事实,并根据数万亿个参数在几秒钟内估计不同的结果。在高级推理中,"如果"问题可以用来挑战推理的假设或前提条件,"为什么不是"问题可以用来挑战推理的方法,"所以什么"问题可以用来挑战决策的目的,"那么怎么样"问题可以用来挑战方法的适用性。当高阶推理问题应用于帮助人类决策时,可以帮助人类做出时间紧迫的决策,并避免错误的否定或积极结果。在本文中,我们提出了一个用于ICU应用的高阶推理模型,以在时间紧迫的情况下为医疗领域提供建议。我们使用的模型是大型语言模型(LLM)。实验证明,LLM在"如果"场景中表现最佳,与人类医生的治疗计划相似度为88.52%。在"为什么不是"场景中,最佳模型在70%的患者死亡后被出院后,倾向于选择替代治疗计划。在"所以什么"场景中,最优模型对ICU患者的治疗计划动机和重要性进行了深入分析,其推理与实际诊断信息的相似度为55.6%。在"那么怎么样"场景中,最高性能的LLM在设计类似疾病治疗计划时的内容相似度为66.5%。同时,LLM还能够预测患者从ICU出院后的生命状态,准确度为70%。
https://arxiv.org/abs/2405.03010
Conversational recommender systems have emerged as a potent solution for efficiently eliciting user preferences. These systems interactively present queries associated with "key terms" to users and leverage user feedback to estimate user preferences more efficiently. Nonetheless, most existing algorithms adopt a centralized approach. In this paper, we introduce FedConPE, a phase elimination-based federated conversational bandit algorithm, where $M$ agents collaboratively solve a global contextual linear bandit problem with the help of a central server while ensuring secure data management. To effectively coordinate all the clients and aggregate their collected data, FedConPE uses an adaptive approach to construct key terms that minimize uncertainty across all dimensions in the feature space. Furthermore, compared with existing federated linear bandit algorithms, FedConPE offers improved computational and communication efficiency as well as enhanced privacy protections. Our theoretical analysis shows that FedConPE is minimax near-optimal in terms of cumulative regret. We also establish upper bounds for communication costs and conversation frequency. Comprehensive evaluations demonstrate that FedConPE outperforms existing conversational bandit algorithms while using fewer conversations.
对话推荐系统已成为有效发掘用户偏好的强大解决方案。这些系统通过交互地向用户呈现与“关键词”相关的查询,并利用用户反馈来更有效地估计用户偏好。然而,大多数现有算法采用集中方法。在本文中,我们引入了FedConPE,一种基于阶段消除的联邦对话博弈算法,其中M个代理商通过中央服务器协作解决全球上下文线性博弈问题,同时确保安全数据管理。为了有效地协调所有客户端并汇总他们收集的数据,FedConPE采用一种自适应方法构造关键词,以最小化所有维度特征空间中的不确定性。此外,与现有联邦线性博弈算法相比,FedConPE在计算和通信效率以及隐私保护方面提供了改进。我们的理论分析表明,FedConPE在累积后悔度方面是最小最大最优的。我们还建立了通信成本和对话频率的上界。全面评估表明,尽管使用较少对话,FedConPE在现有对话博弈算法中表现优异。
https://arxiv.org/abs/2405.02881
Traditional recommender systems such as matrix factorization methods rely on learning a shared dense embedding space to represent both items and user preferences. Sequence models such as RNN, GRUs, and, recently, Transformers have also excelled in the task of sequential recommendation. This task requires understanding the sequential structure present in users' historical interactions to predict the next item they may like. Building upon the success of Large Language Models (LLMs) in a variety of tasks, researchers have recently explored using LLMs that are pretrained on vast corpora of text for sequential recommendation. To use LLMs in sequential recommendations, both the history of user interactions and the model's prediction of the next item are expressed in text form. We propose CALRec, a two-stage LLM finetuning framework that finetunes a pretrained LLM in a two-tower fashion using a mixture of two contrastive losses and a language modeling loss: the LLM is first finetuned on a data mixture from multiple domains followed by another round of target domain finetuning. Our model significantly outperforms many state-of-the-art baselines (+37% in Recall@1 and +24% in NDCG@10) and systematic ablation studies reveal that (i) both stages of finetuning are crucial, and, when combined, we achieve improved performance, and (ii) contrastive alignment is effective among the target domains explored in our experiments.
传统推荐系统,如矩阵分解方法,依赖于学习共同的密集嵌入空间来表示物品和用户偏好。序列模型,如RNN、GRU和最近的自然语言处理模型Transformer,也在序列推荐任务中表现出色。这项任务需要理解用户历史交互中的序列结构,以预测他们可能喜欢下一个项目。在大型语言模型(LLMs)在各种任务上的成功的基础上,研究人员最近探索了使用LLMs进行序列推荐。要使用LLMs进行序列推荐,用户历史交互的背景和模型的预测下一个项目的文本形式都需要表达。我们提出了CALRec,一种两阶段LLM微调框架,它通过混合两种对比性损失和一个语言建模损失,在两层结构中对预训练的LLM进行微调:首先对来自多个领域的数据混合进行微调,然后进行目标域微调。我们的模型在许多最先进的基线(+37%的召回度@1和+24%的NDCG@10)上都显著超过了很多状态-of-the-art,而且系统消融研究揭示了(i)微调阶段是至关重要的,而且,当结合时,我们实现更好的性能,(ii)在探索的各个目标域中,对比性对目标域之间的差异有效的特点。
https://arxiv.org/abs/2405.02429
The Adversarial Markov Decision Process (AMDP) is a learning framework that deals with unknown and varying tasks in decision-making applications like robotics and recommendation systems. A major limitation of the AMDP formalism, however, is pessimistic regret analysis results in the sense that although the cost function can change from one episode to the next, the evolution in many settings is not adversarial. To address this, we introduce and study a new variant of AMDP, which aims to minimize regret while utilizing a set of cost predictors. For this setting, we develop a new policy search method that achieves a sublinear optimistic regret with high probability, that is a regret bound which gracefully degrades with the estimation power of the cost predictors. Establishing such optimistic regret bounds is nontrivial given that (i) as we demonstrate, the existing importance-weighted cost estimators cannot establish optimistic bounds, and (ii) the feedback model of AMDP is different (and more realistic) than the existing optimistic online learning works. Our result, in particular, hinges upon developing a novel optimistically biased cost estimator that leverages cost predictors and enables a high-probability regret analysis without imposing restrictive assumptions. We further discuss practical extensions of the proposed scheme and demonstrate its efficacy numerically.
Adversarial Markov Decision Process(AMDP)是一种处理决策应用中未知且变化的任务的学习框架,如机器人技术和推荐系统。然而,AMDP形式的一个主要局限性是悲观的后悔分析结果,这意味着虽然成本函数可以从每一刻变化,但许多环境中的演变不是对抗的。为了解决这个问题,我们引入并研究了一种新的AMDP变体,旨在最小化后悔,同时利用一组成本预测器。对于这个设置,我们开发了一种新的策略搜索方法,实现了具有高概率的非线性乐观后悔,即在估计能力下,后悔的上界。建立这样的乐观后悔上界并非易事,因为(i)正如我们所证明的,现有的重要性加权成本估计器无法建立乐观的上界;(ii)AMDP的反馈模型与现有的乐观在线学习工作不同(更现实)。我们的结果,特别是取决于开发了一个新的具有高概率的乐观 biased 成本估计器,利用成本预测器,从而在没有强制假设的情况下实现高概率后悔分析。我们进一步讨论了所提出的方案的实用扩展,并将其有效性进行了数值证明。
https://arxiv.org/abs/2405.02188
Structured science summaries or research contributions using properties or dimensions beyond traditional keywords enhances science findability. Current methods, such as those used by the Open Research Knowledge Graph (ORKG), involve manually curating properties to describe research papers' contributions in a structured manner, but this is labor-intensive and inconsistent between the domain expert human curators. We propose using Large Language Models (LLMs) to automatically suggest these properties. However, it's essential to assess the readiness of LLMs like GPT-3.5, Llama 2, and Mistral for this task before application. Our study performs a comprehensive comparative analysis between ORKG's manually curated properties and those generated by the aforementioned state-of-the-art LLMs. We evaluate LLM performance through four unique perspectives: semantic alignment and deviation with ORKG properties, fine-grained properties mapping accuracy, SciNCL embeddings-based cosine similarity, and expert surveys comparing manual annotations with LLM outputs. These evaluations occur within a multidisciplinary science setting. Overall, LLMs show potential as recommendation systems for structuring science, but further finetuning is recommended to improve their alignment with scientific tasks and mimicry of human expertise.
使用超越传统关键词的属性或维度来结构化科学摘要或研究贡献可以提高科学可查找性。目前的方法,如Open Research Knowledge Graph (ORKG)中所使用的,需要手动编辑属性以描述研究论文的贡献,但这是劳动密集型且与领域专家人类编者之间存在不一致性。我们提出使用大型语言模型(LLMs)来自动建议这些属性。然而,在应用之前评估LLMs(如GPT-3.5、Llama 2和Mistral)的准备情况至关重要。 我们的研究对ORKG手动编辑的属性和上述最先进的LLM生成的属性进行了全面比较分析。我们通过四个独特的视角来评估LLM性能:语义对齐和与ORKG属性之间的偏移,细粒度属性映射准确度,基于SciNCL嵌入的余弦相似度,以及专家调查与LLM输出之间的比较。这些评估发生在多学科科学环境中。 总的来说,LLMs在构建科学推荐系统方面具有潜力,但需要进一步的微调以改善其与科学任务的同步性和模拟人类专业知识的能力。
https://arxiv.org/abs/2405.02105
This paper aims to efficiently enable large language models (LLMs) to use external knowledge and goal guidance in conversational recommender system (CRS) tasks. Advanced LLMs (e.g., ChatGPT) are limited in domain-specific CRS tasks for 1) generating grounded responses with recommendation-oriented knowledge, or 2) proactively leading the conversations through different dialogue goals. In this work, we first analyze those limitations through a comprehensive evaluation, showing the necessity of external knowledge and goal guidance which contribute significantly to the recommendation accuracy and language quality. In light of this finding, we propose a novel ChatCRS framework to decompose the complex CRS task into several sub-tasks through the implementation of 1) a knowledge retrieval agent using a tool-augmented approach to reason over external Knowledge Bases and 2) a goal-planning agent for dialogue goal prediction. Experimental results on two multi-goal CRS datasets reveal that ChatCRS sets new state-of-the-art benchmarks, improving language quality of informativeness by 17% and proactivity by 27%, and achieving a tenfold enhancement in recommendation accuracy.
本文旨在有效地使大型语言模型(LLMs)能够使用外部知识和目标指导在会话推荐系统(CRS)任务中进行高效运用。先进的LLM(例如,ChatGPT)在领域特定CRS任务上存在限制,其一是生成具有推荐导向知识的有根回答,二是通过不同的对话目标主动引导对话。在这项工作中,我们通过全面的评估分析了这些限制,展示了外部知识和目标指导在提高推荐准确性和语言质量方面的重要性。鉴于这一发现,我们提出了一个新型的ChatCRS框架,通过实现知识检索代理和对话目标预测代理来分解复杂的CRS任务为几个子任务。在两个多目标CRSDataset上的实验结果表明,ChatCRS取得了最先进的基准,将信息提供性的语言质量提高了17%,活力提高了27%,并且推荐准确率提高了十倍。
https://arxiv.org/abs/2405.01868