With the advent of image super-resolution (SR) algorithms, how to evaluate the quality of generated SR images has become an urgent task. Although full-reference methods perform well in SR image quality assessment (SR-IQA), their reliance on high-resolution (HR) images limits their practical applicability. Leveraging available reconstruction information as much as possible for SR-IQA, such as low-resolution (LR) images and the scale factors, is a promising way to enhance assessment performance for SR-IQA without HR for reference. In this letter, we attempt to evaluate the perceptual quality and reconstruction fidelity of SR images considering LR images and scale factors. Specifically, we propose a novel dual-branch reduced-reference SR-IQA network, \ie, Perception- and Fidelity-aware SR-IQA (PFIQA). The perception-aware branch evaluates the perceptual quality of SR images by leveraging the merits of global modeling of Vision Transformer (ViT) and local relation of ResNet, and incorporating the scale factor to enable comprehensive visual perception. Meanwhile, the fidelity-aware branch assesses the reconstruction fidelity between LR and SR images through their visual perception. The combination of the two branches substantially aligns with the human visual system, enabling a comprehensive SR image evaluation. Experimental results indicate that our PFIQA outperforms current state-of-the-art models across three widely-used SR-IQA benchmarks. Notably, PFIQA excels in assessing the quality of real-world SR images.
随着图像超分辨率(SR)算法的出现,如何评估生成SR图像的质量已成为一个紧迫的任务。尽管全参考方法在SR图像质量评估(SR-IQA)中表现良好,但它们对高分辨率(HR)图像的依赖限制了其实用性。充分利用SR-IQA中可用的重建信息,如低分辨率(LR)图像和比例因子,是一种提高SR-IQA性能而无需参考HR图像的方法。在本文中,我们试图评估SR图像的感知质量和重构准确性,同时考虑LR图像和比例因子。具体来说,我们提出了一个新颖的双分支减少参考SR-IQA网络,即感知-和可靠性-感知SR-IQA(PFIQA)。感知分支通过利用Vision Transformer(ViT)的全局建模优点和ResNet的局部关系,以及包含比例因子来提高综合视觉 perception。同时,可靠性分支通过它们的视觉感知评估LR和SR图像之间的重构准确性。两个分支的组合使得PFIQA与人类视觉系统高度契合,实现了全面的SR图像评估。实验结果表明,我们的PFIQA在三个广泛使用的SR-IQA基准测试中超过了最先进的模型。值得注意的是,PFIQA在评估真实世界SR图像的质量方面表现出色。
https://arxiv.org/abs/2405.09472
In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at \url{this https URL}.
在本文中,我们提出了一种简单但有效的增强社交媒体视频的盲视频质量评估(BVQA)模型的方法。受到之前研究者利用从各种计算机视觉模型提取的预训练特征作为BVQA特征表示的启发,我们进一步探索了预训练的盲图像质量评估(BIQA)和BVQA模型的丰富质量感知特征作为辅助特征,以帮助BVQA模型处理社交媒体视频的复杂扭曲和多样内容。具体来说,我们使用SimpleVQA,一种由可训练的Swin Transformer-B和固定的SlowFast组成的BVQA模型,作为我们的基础模型。Swin Transformer-B和SlowFast组件分别负责提取空间和运动特征。然后,我们从Q-Align、LIQE和FAST-VQA中提取三种特征,分别捕捉帧级质量感知特征、帧级质量感知特征以及时空质量感知特征。通过连接这些特征,我们采用多层感知器(MLP)网络将它们回归为质量分数。实验结果表明,与三个公共社交媒体VQA数据集上的其他模型相比,所提出的模型在性能上取得了最佳结果。此外,所提出的模型在CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge上获得了第一名。代码可在此处访问:\url{此链接}。
https://arxiv.org/abs/2405.08745
With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.
随着深度学习的 recent 进步,已经开发了众多算法来提高视频质量、减少视觉伪影和提高感知质量。然而,关于增强内容质量评估的研究仍然很少。增强内容的质量评估通常基于为压缩应用设计的质量度量指标。在本文中,我们提出了一个专门针对增强视频内容的新颖盲深度视频质量评估(VQA)方法。它采用了一种基于新数据库(包含 13K 训练补丁)的内容质量感知循环记忆转置(RMT)网络架构来获得视频质量表示,并通过一种基于新数据库(包含13K训练补丁)的内容质量感知对比学习策略来优化它。提取出的质量表示通过线性回归合并生成视频级的质量索引。所提出的方法,RMT-BVQA,通过五倍交叉验证对 VDPVE(用于感知视频增强的数据集)数据库进行了评估。结果表明,与十个现有无参考质量度量指标相比,其相关性能优越。
https://arxiv.org/abs/2405.08621
Portrait images typically consist of a salient person against diverse backgrounds. With the development of mobile devices and image processing techniques, users can conveniently capture portrait images anytime and anywhere. However, the quality of these portraits may suffer from the degradation caused by unfavorable environmental conditions, subpar photography techniques, and inferior capturing devices. In this paper, we introduce a dual-branch network for portrait image quality assessment (PIQA), which can effectively address how the salient person and the background of a portrait image influence its visual quality. Specifically, we utilize two backbone networks (\textit{i.e.,} Swin Transformer-B) to extract the quality-aware features from the entire portrait image and the facial image cropped from it. To enhance the quality-aware feature representation of the backbones, we pre-train them on the large-scale video quality assessment dataset LSVQ and the large-scale facial image quality assessment dataset GFIQA. Additionally, we leverage LIQE, an image scene classification and quality assessment model, to capture the quality-aware and scene-specific features as the auxiliary features. Finally, we concatenate these features and regress them into quality scores via a multi-perception layer (MLP). We employ the fidelity loss to train the model via a learning-to-rank manner to mitigate inconsistencies in quality scores in the portrait image quality assessment dataset PIQ. Experimental results demonstrate that the proposed model achieves superior performance in the PIQ dataset, validating its effectiveness. The code is available at \url{this https URL}.
肖像图像通常由一个突出的人物和多种不同的背景组成。随着移动设备的发展和图像处理技术的不断发展,用户可以随时随地方便地捕捉到肖像图像。然而,这些肖像可能会受到不良环境条件、拍摄技巧和低质量采集设备等因素引起的质量下降的影响。在本文中,我们提出了一个用于肖像图像质量评估(PIQA)的双分支网络,可以有效地解决突出的人物和肖像图像背景如何影响其视觉质量的问题。具体来说,我们使用两个骨干网络(即Swin Transformer-B)从整个肖像图像和从其中提取的面部图像中提取质量感知特征。为了提高骨干网络的质量感知特征表示,我们在LSVQ和GFIQA等大规模视频质量评估数据集上进行预训练。此外,我们还利用LIQE,一种图像场景分类和质量评估模型,作为辅助特征来捕捉质量感知和场景特定的特征。最后,我们通过多感知层(MLP)将这些特征进行特征串联并对其进行回归,并通过一个多感知层(MLP)将特征和质量评分回归到质量分数。我们使用可靠性损失来通过学习排序的方式来训练模型,以减轻肖像图像质量评估数据集中质量评分不一致的问题。实验结果表明,与原始数据集相比,所提出的模型在PIQA数据集上取得了卓越的性能,验证了其有效性。代码可在此处访问:\url{this <https:// this URL>.
https://arxiv.org/abs/2405.08555
This paper describes our approach to the MEDIQA-CORR shared task, which involves error detection and correction in clinical notes curated by medical professionals. This task involves handling three subtasks: detecting the presence of errors, identifying the specific sentence containing the error, and correcting it. Through our work, we aim to assess the capabilities of Large Language Models (LLMs) trained on a vast corpora of internet data that contain both factual and unreliable information. We propose to comprehensively address all subtasks together, and suggest employing a unique prompt-based in-context learning strategy. We will evaluate its efficacy in this specialized task demanding a combination of general reasoning and medical knowledge. In medical systems where prediction errors can have grave consequences, we propose leveraging self-consistency and ensemble methods to enhance error correction and error detection performance.
本文描述了我们针对MEDIQA-CORR共享任务的 approach,该任务涉及临床笔记由医疗专业人士编写的错误检测和纠正。这项任务包括处理三个子任务:检测到错误的 presence,确定包含错误的具体句子,并对其进行纠正。通过我们的工作,我们旨在评估基于互联网数据的大型语言模型(LLMs)的 capabilities,这些数据集包含事实和不可靠信息。我们提出了一种全面解决所有子任务的策略,并建议采用一种基于独特提示的上下文学习策略。我们将评估其在要求一般推理和医学知识的专用任务中的有效性。在医疗系统中,预测错误可能会带来严重的后果时,我们提出利用自一致性和集成方法增强错误检测和纠正性能。
https://arxiv.org/abs/2405.08373
This paper presents BARKPLUG V.2, a Large Language Model (LLM)-based chatbot system built using Retrieval Augmented Generation (RAG) pipelines to enhance the user experience and access to information within academic settings.The objective of BARKPLUG V.2 is to provide information to users about various campus resources, including academic departments, programs, campus facilities, and student resources at a university setting in an interactive fashion. Our system leverages university data as an external data corpus and ingests it into our RAG pipelines for domain-specific question-answering tasks. We evaluate the effectiveness of our system in generating accurate and pertinent responses for Mississippi State University, as a case study, using quantitative measures, employing frameworks such as Retrieval Augmented Generation Assessment(RAGAS). Furthermore, we evaluate the usability of this system via subjective satisfaction surveys using the System Usability Scale (SUS). Our system demonstrates impressive quantitative performance, with a mean RAGAS score of 0.96, and experience, as validated by usability assessments.
本文介绍了BARKPLUG V.2,一种基于Retrieval Augmented Generation(RAG)流程的大型语言模型(LLM)聊天机器人系统,旨在提高用户体验和学术环境中的信息访问。BARKPLUG V.2的目的是以交互的方式向用户提供有关各种校园资源的信息,包括学术部门、项目、校园设施和学生资源等。我们的系统利用大学数据作为外部数据语料库,并将其输入到我们的RAG流程中进行特定领域问题回答任务。我们用定量和框架评估我们的系统的有效性,如Retrieval Augmented Generation评估(RAGAS)。此外,我们还通过主观满意度调查评估了该系统的可用性,使用了System Usability Scale(SUS)。我们的系统表现出惊人的定量性能,平均RAGAS得分达到了0.96,这是我们通过可用性评估验证的。
https://arxiv.org/abs/2405.08120
Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation. Each table in the dataset corresponds to a sub-graph of the entire KB, and every question requires the integration of information from both the table and the sub-graph to be answered. To extract pertinent information from the vast knowledge sub-graph and apply it to TableQA, we design a retriever-reasoner structured pipeline model. Experimental results demonstrate that our model consistently achieves remarkable relative performance improvements ranging from 1.9 to 6.5 times and absolute improvements of 11.66% to 44.64% on EM scores across three distinct settings (fine-tuning, zero-shot, and few-shot), in comparison with solely relying on table information in the traditional TableQA manner. However, even the best model achieves a 60.23% EM score, which still lags behind the human-level performance, highlighting the challenging nature of KET-QA for the question-answering community. We also provide a human evaluation of error cases to analyze further the aspects in which the model can be improved. Project page: this https URL.
由于表格的简洁和结构化特点,其中包含的知识可能不完整或缺失,这对表格问题回答(TableQA)和数据分析系统构成了重大挑战。现有数据集中,要么未能解决表格外部知识的这个问题,要么仅将结构化文本作为表格的补充信息。在本文中,我们将知识库(KB)作为表格问题回答的外部知识来源,并构建了一个细粒度 gold 证据注释的 dataset KET-QA。数据集中的每个表对应于整个知识库的子图,每个问题都需要从表格和子图整合信息来回答。为了从庞大的知识子图中提取相关信息并应用于表格问题回答,我们设计了一个retriever-reasoner结构化管道模型。实验结果表明,我们的模型在三个不同的设置(微调、零散和少散)上实现了引人注目的相对性能改进,从1.9到6.5倍,以及绝对性能改进11.66%到44.64%。与仅依赖表格信息的传统 TableQA 方法相比,即使在最好的模型上,EM 分数也落后了人类水平。这揭示了对于问答社区来说,KET-QA 的具有挑战性的本质。我们还提供了人类评估错误案例以进一步分析模型可以改进的方面。项目页面:https:// this URL。
https://arxiv.org/abs/2405.08099
Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information -- such as which tests to perform -- and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient's diagnosis through dialogue and active data collection. We present two open benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at this https URL.
诊断和处理患者是一个复杂、顺序的决策过程,需要医生获取信息——例如进行哪些检查——并采取行动。近年来的人工智能(AI)和大型语言模型(LLM)的进步有望深刻地影响临床护理。然而,目前的评估方案过于依赖静态医疗问题问答基准,缺乏现实生活中的互动决策,这不符合临床工作的要求。在这里,我们提出了AgentClinic:一个多模态基准,以评估LLM在模拟临床环境中作为代理的能力。在我们的基准中,医生代理必须通过对话和主动数据收集来揭示患者的诊断。我们提出了两个开放基准:多模态图像和对话环境,AgentClinic-NEJM,以及对话环境,AgentClinic-MedQA。我们将认知和隐含偏见都融入到患者和医生代理中,以模拟真实世界中偏见代理之间的互动。我们发现,引入偏见会导致医生代理的诊断准确度大幅降低,以及患者代理的遵从、信心和后续咨询意愿降低。评估了一系列最先进的LLM后,我们发现像MedQA这样在基准中表现优秀的模型在AgentClinic-MedQA中的表现不佳。我们发现,患者代理中使用的LLM对代理在AgentClinic基准中的表现至关重要。我们证明了有限互动和过多互动都会降低医生代理的诊断准确度。这一工作的代码和数据可以在这个https URL上找到。
https://arxiv.org/abs/2405.07960
In this paper, we introduce EconLogicQA, a rigorous benchmark designed to assess the sequential reasoning capabilities of large language models (LLMs) within the intricate realms of economics, business, and supply chain management. Diverging from traditional benchmarks that predict subsequent events individually, EconLogicQA poses a more challenging task: it requires models to discern and sequence multiple interconnected events, capturing the complexity of economic logics. EconLogicQA comprises an array of multi-event scenarios derived from economic articles, which necessitate an insightful understanding of both temporal and logical event relationships. Through comprehensive evaluations, we exhibit that EconLogicQA effectively gauges a LLM's proficiency in navigating the sequential complexities inherent in economic contexts. We provide a detailed description of EconLogicQA dataset and shows the outcomes from evaluating the benchmark across various leading-edge LLMs, thereby offering a thorough perspective on their sequential reasoning potential in economic contexts. Our benchmark dataset is available at this https URL.
在本文中,我们引入了EconLogicQA,一个严谨的基准,旨在评估大型语言模型(LLMs)在经济学、商业和供应链管理复杂领域中的序列推理能力。与传统基准预测后续事件逐一不同,EconLogicQA提出了更具挑战性的任务:它要求模型能够辨别和排序多个相互关联的事件,捕捉经济逻辑的复杂性。EconLogicQA由来自经济文章的多事件场景组成,这需要对时间和逻辑事件关系进行深入的理解。通过全面的评估,我们证明了EconLogicQA有效地衡量了LLM在经济学环境中导航复杂序列的能力。我们详细描述了EconLogicQA数据集,并展示了评估该基准在不同领先LLM上的结果,从而对其在经济学环境中的序列推理潜力进行了深入的视角。我们的基准数据集可通过此链接访问:https://www.academia.edu/39511841/EconLogicQA_Determining_the_Sequential_Reasoning_Capabilities_of_LLMs_within_Economic_Contexts.
https://arxiv.org/abs/2405.07938
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to this https URL and this https URL for more detailed information.
在本技术报告中,我们介绍了基于人类反馈的在线迭代强化学习(RLHF)的工作流程,其在最近的大规模语言模型(LLM)文献中被广泛报道,远超其线下对应方案。然而,现有的开源RLHF项目仍然主要局限于离线学习环境。在本技术报告中,我们的目标是填补这一空白,并为在线迭代RLHF提供详细的可重复食谱。 首先,由于在线人类反馈通常对于资源有限的开源社区是不可行的,我们首先使用多样化的开源数据集构建偏好模型,并使用构建的代理偏好模型来近似人类反馈。然后,我们讨论了在线迭代RLHF的理论和算法原理,接着是详细的实践实现。 经过训练,我们的LLM模型SFR-Iterative-DPO-LLaMA-3-8B-R在LLM聊天机器人基准测试中取得了令人印象深刻的性能,包括AlpacaEval-2、Arena-Hard和MT-Bench,以及包括HumanEval和TruthfulQA在内的其他学术基准。我们证明了监督微调(SFT)和迭代RLHF可以通过完全开源数据集获得最先进的性能。此外,我们将模型、精心挑选的数据集以及全面的步骤指南公开发布。请参阅此[https://url和https://url以获取更多信息。
https://arxiv.org/abs/2405.07863
This paper undertakes an empirical study to revisit the latest advancements in Multimodal Large Language Models (MLLMs): Video Assistant. This study, namely FreeVA, aims to extend existing image-based MLLM to the video domain in a training-free manner. The study provides an essential, yet must-know baseline, and reveals several surprising findings: 1) FreeVA, leveraging only offline image-based MLLM without additional training, excels in zero-shot video question-answering (e.g., MSVD-QA, ActivityNet-QA, and MSRVTT-QA), even surpassing state-of-the-art methods that involve video instruction tuning. 2) While mainstream video-based MLLMs typically initialize with an image-based MLLM (e.g., LLaVA) and then fine-tune using video instruction tuning, the study indicates that utilizing the widely adopted VideoInstruct-100K for video instruction tuning doesn't actually lead to better performance compared to not training at all. 3) The commonly used evaluation metrics in existing works are significantly influenced by changes in the GPT API version over time. If ignored, this could affect the fairness and uniformity of comparisons between different methods and impact the analysis and judgment of researchers in the field. The advancement of MLLMs is currently thriving, drawing numerous researchers into the field. We aim for this work to serve as a plug-and-play, simple yet effective baseline, encouraging the direct evaluation of existing MLLMs in video domain while also standardizing the field of video conversational models to a certain extent. Also, we encourage researchers to reconsider: Have current video MLLM methods truly acquired knowledge beyond image MLLM? Code is available at this https URL
本论文进行了一项实证研究,旨在回顾在Multimodal Large Language Models(MLLMs)方面最新的进展:Video Assistant。这项研究,即FreeVA,旨在以无需额外训练的方式将现有的基于图像的MLLM扩展到视频领域。这项研究提供了一个基本的、必不可少的基线,并揭示了几项令人惊讶的发现:1)FreeVA,仅利用离线图像为基础的MLLM,在零散拍摄视频问题回答(例如,MSVD-QA,ActivityNet-QA和MSRVTT-QA)方面表现优异,甚至超过了涉及视频指令微调的先进方法。2)虽然主流视频MLLM通常从基于图像的MLLM(例如,LLaVA)开始,然后通过视频指令微调进行微调,但这项研究表明,使用广泛采用的VideoInstruct-100K进行视频指令微调实际上并不能带来更好的性能,与不进行训练相比。3)现有作品中共享的评估指标在很大程度上受到GPT API版本变化的影响。如果被忽略,这可能会影响不同方法的公平性和可比性,从而影响领域内研究人员的分析和判断。MLLM的发展目前正处于繁荣时期,吸引了大量研究人员进入该领域。我们希望这项工作能成为一种可插拔、简单而有效的基线,鼓励研究人员直接评估现有的MLLM在视频领域的性能,同时标准化该领域的视频会话模型。此外,我们鼓励研究人员重新考虑:当前的视频MLLM方法是否已经获得了超出图像MLLM的知识?代码在此处:https://this URL
https://arxiv.org/abs/2405.07798
There has been increasing interest in investigating the behaviours of large language models (LLMs) and LLM-powered chatbots by treating an LLM as a participant in a psychological experiment. We therefore developed an R package called "MacBehaviour" that aims to interact with more than 60 language models in one package (e.g., OpenAI's GPT family, the Claude family, Gemini, Llama family, and open-source models) and streamline the experimental process of LLMs behaviour experiments. The package offers a comprehensive set of functions designed for LLM experiments, covering experiment design, stimuli presentation, model behaviour manipulation, logging response and token probability. To demonstrate the utility and effectiveness of "MacBehaviour," we conducted three validation experiments on three LLMs (GPT-3.5, Llama-2 7B, and Vicuna-1.5 13B) to replicate sound-gender association in LLMs. The results consistently showed that they exhibit human-like tendencies to infer gender from novel personal names based on their phonology, as previously demonstrated (Cai et al., 2023). In summary, "MacBehaviour" is an R package for machine behaviour studies which offers a user-friendly interface and comprehensive features to simplify and standardize the experimental process.
近年来,对大型语言模型(LLMs)及其配备聊天机器人的行为表现进行研究越来越感兴趣。为此,我们开发了一个名为 "MacBehaviour" 的R包,旨在在一个包中(例如,OpenAI的GPT家庭,Claude家族,Gemini,Llama家族和开源模型)与超过60个语言模型进行交互,并简化LLM行为实验的实验过程。该包为LLM实验提供了全面的功能集,包括实验设计、刺激呈现、模型行为操作、日志响应和词概率。为了证明"MacBehaviour"的实用性和有效性,我们在三个LLM(GPT-3.5,Llama-2 7B和Vicuna-1.5 13B)上进行了三个验证实验,以复制LLM根据其音标推断性别的现象。结果表明,它们表现出类似于人类的倾向,根据他们的音标推断性别,正如之前研究所示(Cai等人,2023)。总之,"MacBehaviour"是一个用于机器行为研究的R包,它提供了友好界面的用户友好的接口和全面的功能,以简化并标准化实验过程。
https://arxiv.org/abs/2405.07495
While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the image-text matching knowledge of the pretrained model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, we propose a TSG+ module to transfer the image-text matching knowledge from CLIP models to our region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the pretrained image-text knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods.
虽然视觉语言预训练模型(VLMs)在各种多模态理解任务中表现出色,但它们在细粒度音频-视觉推理方面(特别是音频-视觉问题回答,AVQA)的潜力仍然没有被充分利用。由于在区域级别需要视觉理解并且与音频模态无缝集成,AVQA为VLMs提出了特定的挑战。之前基于VLMs的AVQA方法仅仅将CLIP用作特征编码器,但忽略了其知识,还将音频和视频视为双流框架中的单独实体,正如大多数AVQA方法一样。本文提出了一种新的CLIP驱动的目标感知单流(TASS)网络用于AVQA,通过自然音频-视觉匹配特性将预训练模型的图像-文本匹配知识传递给我们的区域文本匹配过程。它包括两个关键组件:目标感知空间基线模块(TSG+)和单流联合时间基线模块(JTG)。具体来说,我们提出了一个TSG+模块,将其从CLIP模型中的图像-文本匹配知识传递到我们的区域文本匹配过程,而无需相应的标注目标。此外,与之前单独的双流网络不同,JTG通过简单的单流架构将音频和视频融合在一起,并在保留其时间关联的同时扩展了预训练的图像-文本知识到音频-文本匹配。通过在MUSIC-AVQA基准上进行的大量实验证实,我们的方法在现有技术水平上具有优越性。
https://arxiv.org/abs/2405.07451
ChatGPT, the AI-powered chatbot with a massive user base of hundreds of millions, has become a global phenomenon. However, the use of Conversational AI Systems (CAISs) like ChatGPT for research in the field of Social Simulation is still limited. Specifically, there is no evidence of its usage in Agent-Based Social Simulation (ABSS) model design. While scepticism towards anything new is inherent to human nature, we firmly believe it is imperative to initiate the use of this innovative technology to support ABSS model design. This paper presents a proof-of-concept that demonstrates how CAISs can facilitate the development of innovative conceptual ABSS models in a concise timeframe and with minimal required upfront case-based knowledge. By employing advanced prompt engineering techniques and adhering to the Engineering ABSS framework, we have constructed a comprehensive prompt script that enables the design of ABSS models with or by the CAIS. The effectiveness of the script is demonstrated through an illustrative case study concerning the use of adaptive architecture in museums. Despite occasional inaccuracies and divergences in conversation, the CAIS proved to be a valuable companion for ABSS modellers.
ChatGPT,这个拥有数亿用户的大型AI聊天机器人,已经成为了一个全球的现象。然而,在社交仿真领域中使用像ChatGPT这样的会话人工智能系统(CAIS)进行研究仍然有限。具体来说,没有证据表明它在代理为基础的社交仿真(ABSS)模型设计中使用。虽然对于新事物持怀疑态度是人类的天性,但我们坚信必须启动这项创新技术以支持ABSS模型设计。本文介绍了一种证明概念,展示了CAIS如何通过简洁的时间框架内和最小的所需先验案例知识促进创新概念ABSS模型的开发。通过采用高级提示工程技术和遵循工程ABSS框架,我们构建了一个全面的提示脚本,可以设计由CAIS或通过CAIS设计的ABSS模型。脚本的有效性通过一个关于博物馆采用自适应架构的示例研究得到了证明。尽管在对话中偶尔存在不准确和偏差,但CAIS证明了一个有价值的助手,帮助ABSS建模者。
https://arxiv.org/abs/2405.08032
We present MedConceptsQA, a dedicated open source benchmark for medical concepts question answering. The benchmark comprises of questions of various medical concepts across different vocabularies: diagnoses, procedures, and drugs. The questions are categorized into three levels of difficulty: easy, medium, and hard. We conducted evaluations of the benchmark using various Large Language Models. Our findings show that pre-trained clinical Large Language Models achieved accuracy levels close to random guessing on this benchmark, despite being pre-trained on medical data. However, GPT-4 achieves an absolute average improvement of nearly 27%-37% (27% for zero-shot learning and 37% for few-shot learning) when compared to clinical Large Language Models. Our benchmark serves as a valuable resource for evaluating the understanding and reasoning of medical concepts by Large Language Models. Our benchmark is available at this https URL
我们提出了MedConceptsQA,一个专用的开源医疗概念问题回答基准。基准包括来自不同词汇表的医疗概念问题的各种类型:诊断、程序和药物。问题分为三个难度级别:容易、中难和难。我们使用各种大型语言模型对基准进行评估。我们的研究结果表明,尽管预先训练的临床大型语言模型在医疗数据上进行预训练,但它们在基准上的准确性接近于随机猜测。然而,GPT-4在临床大型语言模型上实现了几乎27%-37%(27%的零散学习百分比和37%的少数学习百分比)的绝对平均改善。我们的基准为评估大型语言模型对医疗概念的理解和推理提供了有价值的资源。基准可以在https:// this URL
https://arxiv.org/abs/2405.07348
Artificial Intelligence Generated Content (AIGC) has grown rapidly in recent years, among which AI-based image generation has gained widespread attention due to its efficient and imaginative image creation ability. However, AI-generated Images (AIGIs) may not satisfy human preferences due to their unique distortions, which highlights the necessity to understand and evaluate human preferences for AIGIs. To this end, in this paper, we first establish a novel Image Quality Assessment (IQA) database for AIGIs, termed AIGCIQA2023+, which provides human visual preference scores and detailed preference explanations from three perspectives including quality, authenticity, and correspondence. Then, based on the constructed AIGCIQA2023+ database, this paper presents a MINT-IQA model to evaluate and explain human preferences for AIGIs from Multi-perspectives with INstruction Tuning. Specifically, the MINT-IQA model first learn and evaluate human preferences for AI-generated Images from multi-perspectives, then via the vision-language instruction tuning strategy, MINT-IQA attains powerful understanding and explanation ability for human visual preference on AIGIs, which can be used for feedback to further improve the assessment capabilities. Extensive experimental results demonstrate that the proposed MINT-IQA model achieves state-of-the-art performance in understanding and evaluating human visual preferences for AIGIs, and the proposed model also achieves competing results on traditional IQA tasks compared with state-of-the-art IQA models. The AIGCIQA2023+ database and MINT-IQA model will be released to facilitate future research.
近年来,人工智能生成内容(AIGC)快速发展,其中基于AI的图像生成因其高效和富有创意的图像创作能力而广受关注。然而,由AI生成的图像(AIGIs)可能无法满足人类偏好,因为它们独特的扭曲,这凸显了了解和评估人类对AIGIs的偏好具有必要性。为此,在本文中,我们首先建立了名为AIGCIQA2023+的新图像质量评估(IQA)数据库,用于评估AIGIs,该数据库从质量、真实性和匹配三个角度提供了人类视觉偏好的分数和详细解释。接着,在构建的AIGCIQA2023+数据库的基础上,本文提出了一种使用多视角 Instruction Tuning(MINT)模型从多角度评估和解释人类对AIGIs的视觉偏好的方法。具体来说,MINT-IQA模型首先从多角度学习并评估人类对AI生成图像的偏好,然后通过视觉-语言指令调整策略,MINT-IQA在AIGIs上实现了对人类视觉偏好的强大理解和解释能力,这可以为反馈以进一步改进测评能力提供依据。大量实验结果表明,与最先进的AIGC评估和解释方法相比,所提出的MINT-IQA模型在理解和评估人类对AIGIs的视觉偏好方面取得了最先进的性能,同时与最先进的传统IQA模型也取得了竞争性的结果。AIGCIQA2023+数据库和MINT-IQA模型将公开发布,以促进未来研究。
https://arxiv.org/abs/2405.07346
Educational scholars have analyzed various image data acquired from teaching and learning situations, such as photos that shows classroom dynamics, students' drawings with regard to the learning content, textbook illustrations, etc. Unquestioningly, most qualitative analysis of and explanation on image data have been conducted by human researchers, without machine-based automation. It was partially because most image processing artificial intelligence models were not accessible to general educational scholars or explainable due to their complex deep neural network architecture. However, the recent development of Visual Question Answering (VQA) techniques is accomplishing usable visual language models, which receive from the user a question about the given image and returns an answer, both in natural language. Particularly, GPT-4V released by OpenAI, has wide opened the state-of-the-art visual langauge model service so that VQA could be used for a variety of purposes. However, VQA and GPT-4V have not yet been applied to educational studies much. In this position paper, we suggest that GPT-4V contributes to realizing VQA for education. By 'realizing' VQA, we denote two meanings: (1) GPT-4V realizes the utilization of VQA techniques by any educational scholars without technical/accessibility barrier, and (2) GPT-4V makes educational scholars realize the usefulness of VQA to educational research. Given these, this paper aims to introduce VQA for educational studies so that it provides a milestone for educational research methodology. In this paper, chapter II reviews the development of VQA techniques, which primes with the release of GPT-4V. Chapter III reviews the use of image analysis in educational studies. Chapter IV demonstrates how GPT-4V can be used for each research usage reviewed in Chapter III, with operating prompts provided. Finally, chapter V discusses the future implications.
教育学者们对从教学和学习情境中获取的各种图像数据进行了分析,例如显示课堂动态的照片、关于学习内容的学生的绘画,教科书插图等。毫无疑问,大多数图像数据的可视化和解释都是通过人类研究人员进行的,没有机器基于的自动化。这部分是因为大多数图像处理人工智能模型对一般教育学者来说难以获取,或者由于其复杂深度神经网络架构,难以解释。然而,最近开发的视觉问答技术(VQA)正在取得可用性,该技术接受用户关于给定图像的问题,并返回自然语言的答案。特别是,OpenAI 发布的 GPT-4V 已经大大拓展了最先进的视觉语言模型服务,使得 VQA 可以用于各种目的。然而,迄今为止,VQA 和 GPT-4V 还没有在教育研究中得到广泛应用。在本文论文中,我们建议 GPT-4V 对教育研究有所贡献。通过“实现” VQA,我们指的是两个含义:(1)GPT-4V 实现了任何教育学者在不存在技术/可用性障碍的情况下利用 VQA 技术,以及(2)GPT-4V 使教育学者意识到 VQA 对教育研究的有用性。基于这些,本文旨在为教育研究提供 VQA 的里程碑,以便为教育研究方法论提供基准。本文第 II 章回顾了 VQA 技术的发展历程,第 III 章讨论了图像分析在教育研究中的应用,第 IV 章展示了 GPT-4V 在每个审查的研究用途中的应用,并提供操作提示。最后,第 V 章讨论了未来的影响。
https://arxiv.org/abs/2405.07163
Existing action quality assessment (AQA) methods mainly learn deep representations at the video level for scoring diverse actions. Due to the lack of a fine-grained understanding of actions in videos, they harshly suffer from low credibility and interpretability, thus insufficient for stringent applications, such as Olympic diving events. We argue that a fine-grained understanding of actions requires the model to perceive and parse actions in both time and space, which is also the key to the credibility and interpretability of the AQA technique. Based on this insight, we propose a new fine-grained spatial-temporal action parser named \textbf{FineParser}. It learns human-centric foreground action representations by focusing on target action regions within each frame and exploiting their fine-grained alignments in time and space to minimize the impact of invalid backgrounds during the assessment. In addition, we construct fine-grained annotations of human-centric foreground action masks for the FineDiving dataset, called \textbf{FineDiving-HM}. With refined annotations on diverse target action procedures, FineDiving-HM can promote the development of real-world AQA systems. Through extensive experiments, we demonstrate the effectiveness of FineParser, which outperforms state-of-the-art methods while supporting more tasks of fine-grained action understanding. Data and code are available at \url{this https URL}.
现有的动作质量评估(AQA)方法主要通过视频级别学习对评分多样动作的深层表示。由于视频中对动作缺乏详细的理解,它们在可信度和可解释性方面严重不足,因此不够适用于对运动员进行评分等严格应用,如奥运会跳水比赛。我们认为,动作的详细理解需要模型在时间和空间中感知和解析动作,这也是AQA技术可信度和可解释性的关键。基于这一洞察,我们提出了一个名为FineParser的新细粒度空间-时间动作解析器。它通过专注于每个帧中的目标动作区域,并利用它们在时间和空间中的细粒度对齐,来最小化评估过程中无效背景的影响。此外,我们还为FineDiving数据集中的以人为中心的粗粒度动作掩码构建了细粒度注释,称为FineDiving-HM。通过为细粒度动作过程提供精细注释,FineDiving-HM可以推动现实世界AQA系统的开发。通过广泛的实验,我们证明了FineParser的有效性,它超越了最先进的方法,同时支持更多细粒度动作理解的更多任务。数据和代码可在此处访问:https://this URL。
https://arxiv.org/abs/2405.06887
An important handicap of document analysis research is that documents tend to be copyrighted or contain private information, which prohibits their open publication and the creation of centralised, large-scale document datasets. Instead, documents are scattered in private data silos, making extensive training over heterogeneous data a tedious task. In this work, we explore the use of a federated learning (FL) scheme as a way to train a shared model on decentralised private document data. We focus on the problem of Document VQA, a task particularly suited to this approach, as the type of reasoning capabilities required from the model can be quite different in diverse domains. Enabling training over heterogeneous document datasets can thus substantially enrich DocVQA models. We assemble existing DocVQA datasets from diverse domains to reflect the data heterogeneity in real-world applications. We explore the self-pretraining technique in this multi-modal setting, where the same data is used for both pretraining and finetuning, making it relevant for privacy preservation. We further propose combining self-pretraining with a Federated DocVQA training method using centralized adaptive optimization that outperforms the FedAvg baseline. With extensive experiments, we also present a multi-faceted analysis on training DocVQA models with FL, which provides insights for future research on this task. We show that our pretraining strategies can effectively learn and scale up under federated training with diverse DocVQA datasets and tuning hyperparameters is essential for practical document tasks under federation.
文档分析研究的一个重要障碍是,文档通常受到版权保护或包含私人信息,这禁止了它们的公开发布和集中式大规模文档数据集的创建。相反,文档在私有的数据 silo 中分散,使得在异构数据上的广泛训练变得枯燥乏味。在这项工作中,我们探讨了使用联邦学习(FL)方案作为在分布式私人物品数据上训练共享模型的方法。我们重点关注文档视觉量子算法(Document VQA)问题,这是一种特别适用于这种方法的任务,因为模型在各个领域所需的推理能力有很大的差异。因此,通过在异质文档数据集上进行训练,可以极大地丰富文档视觉量子算法模型。我们将来自不同领域的现有文档视觉量子算法数据集组装在一起,以反映现实应用中的数据异质性。我们在这个多模态环境中探讨了这种自监督技术,其中相同的数据用于预训练和微调,使得隐私保护有意义。我们还提出了一个使用集中自适应优化结合自监督训练的联邦文档视觉量子算法训练方法,该方法在联邦平均基线之上实现了优异的性能。通过广泛的实验,我们还对使用FL训练文档视觉量子算法模型进行了全面分析,为未来研究这个任务提供了有价值的见解。我们证明了我们的预训练策略可以在多样化的文档视觉量子算法数据集上有效学习和扩展。在联邦训练下调整超参数对实际文档任务来说至关重要。
https://arxiv.org/abs/2405.06636
Community Question-Answering (CQA) forums have revolutionized how people seek information, especially those related to their healthcare needs, placing their trust in the collective wisdom of the public. However, there can be several answers in response to a single query, which makes it hard to grasp the key information related to the specific health concern. Typically, CQA forums feature a single top-voted answer as a representative summary for each query. However, a single answer overlooks the alternative solutions and other information frequently offered in other responses. Our research focuses on aspect-based summarization of health answers to address this limitation. Summarization of responses under different aspects such as suggestions, information, personal experiences, and questions can enhance the usability of the platforms. We formalize a multi-stage annotation guideline and contribute a unique dataset comprising aspect-based human-written health answer summaries. We build an automated multi-faceted answer summarization pipeline with this dataset based on task-specific fine-tuning of several state-of-the-art models. The pipeline leverages question similarity to retrieve relevant answer sentences, subsequently classifying them into the appropriate aspect type. Following this, we employ several recent abstractive summarization models to generate aspect-based summaries. Finally, we present a comprehensive human analysis and find that our summaries rank high in capturing relevant content and a wide range of solutions.
社区问答(CQA)论坛已经彻底颠覆了人们获取信息的方式,尤其是那些与他们的医疗需求相关的人,将他们的信任寄托在公众的集体智慧上。然而,回应单个查询可能会有多个答案,这使得很难理解与特定健康问题相关的关键信息。通常,CQA论坛提供一个单一的顶级答案作为每个查询的代表性摘要。然而,一个单一的答案忽视了其他回答中经常提供的替代方案和其他信息。我们的研究聚焦于基于方面的摘要健康答案,以解决这个限制。对回答的摘要可以根据建议、信息、个人经历和问题进行不同方面的总结,从而提高平台的可用性。我们 formalize了一个多阶段注释指南,并贡献了一个由基于方面的的人类撰写健康答案摘要组成的数据集。我们基于这个数据集构建了一个自动的多方面答案摘要流水线,并对多个先进的模型进行任务特定的微调。该流水线利用问题相似性来检索相关答案句子,然后将它们分类为适当的方面类型。接下来,我们使用几个最近的抽象ive摘要模型来生成基于方面的摘要。最后,我们提交了全面的用户分析和研究,我们发现我们的摘要在捕捉相关内容以及广泛的解决方案方面表现出色。
https://arxiv.org/abs/2405.06295