This study evaluates the performance of Recurrent Neural Network (RNN) and Transformer in replicating cross-language structural priming: a key indicator of abstract grammatical representations in human language processing. Focusing on Chinese-English priming, which involves two typologically distinct languages, we examine how these models handle the robust phenomenon of structural priming, where exposure to a particular sentence structure increases the likelihood of selecting a similar structure subsequently. Additionally, we utilize large language models (LLM) to measure the cross-lingual structural priming effect. Our findings indicate that Transformer outperform RNN in generating primed sentence structures, challenging the conventional belief that human sentence processing primarily involves recurrent and immediate processing and suggesting a role for cue-based retrieval mechanisms. Overall, this work contributes to our understanding of how computational models may reflect human cognitive processes in multilingual contexts.
本研究评估了循环神经网络(RNN)和Transformer在复制跨语言结构预处理方面的性能:这是人类语言处理中抽象语义表示的关键指标。我们重点研究了汉语-英语预处理,涉及两种具有不同类型的典型语言,并探讨了这些模型如何处理结构预定的稳健现象,即暴露于特定句子结构会增加随后选择类似结构的概率。此外,我们还利用大型语言模型(LLM)测量跨语言结构预处理效果。我们的研究结果表明,Transformer在生成预置句子结构方面优于RNN,挑战了传统观念,即人类句子处理主要涉及循环和即时的处理,并表明了基于提示的检索机制的作用。总的来说,这项工作对我们的理解在多语言环境中计算模型如何反映人类认知过程做出了贡献。
https://arxiv.org/abs/2405.09508
With the advent of image super-resolution (SR) algorithms, how to evaluate the quality of generated SR images has become an urgent task. Although full-reference methods perform well in SR image quality assessment (SR-IQA), their reliance on high-resolution (HR) images limits their practical applicability. Leveraging available reconstruction information as much as possible for SR-IQA, such as low-resolution (LR) images and the scale factors, is a promising way to enhance assessment performance for SR-IQA without HR for reference. In this letter, we attempt to evaluate the perceptual quality and reconstruction fidelity of SR images considering LR images and scale factors. Specifically, we propose a novel dual-branch reduced-reference SR-IQA network, \ie, Perception- and Fidelity-aware SR-IQA (PFIQA). The perception-aware branch evaluates the perceptual quality of SR images by leveraging the merits of global modeling of Vision Transformer (ViT) and local relation of ResNet, and incorporating the scale factor to enable comprehensive visual perception. Meanwhile, the fidelity-aware branch assesses the reconstruction fidelity between LR and SR images through their visual perception. The combination of the two branches substantially aligns with the human visual system, enabling a comprehensive SR image evaluation. Experimental results indicate that our PFIQA outperforms current state-of-the-art models across three widely-used SR-IQA benchmarks. Notably, PFIQA excels in assessing the quality of real-world SR images.
随着图像超分辨率(SR)算法的出现,如何评估生成SR图像的质量已成为一个紧迫的任务。尽管全参考方法在SR图像质量评估(SR-IQA)中表现良好,但它们对高分辨率(HR)图像的依赖限制了其实用性。充分利用SR-IQA中可用的重建信息,如低分辨率(LR)图像和比例因子,是一种提高SR-IQA性能而无需参考HR图像的方法。在本文中,我们试图评估SR图像的感知质量和重构准确性,同时考虑LR图像和比例因子。具体来说,我们提出了一个新颖的双分支减少参考SR-IQA网络,即感知-和可靠性-感知SR-IQA(PFIQA)。感知分支通过利用Vision Transformer(ViT)的全局建模优点和ResNet的局部关系,以及包含比例因子来提高综合视觉 perception。同时,可靠性分支通过它们的视觉感知评估LR和SR图像之间的重构准确性。两个分支的组合使得PFIQA与人类视觉系统高度契合,实现了全面的SR图像评估。实验结果表明,我们的PFIQA在三个广泛使用的SR-IQA基准测试中超过了最先进的模型。值得注意的是,PFIQA在评估真实世界SR图像的质量方面表现出色。
https://arxiv.org/abs/2405.09472
This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
这篇论文对通过一系列实验对可解释性事实核查进行全面分析,重点关注大型语言模型验证公共卫生主张及其真实性评估的能力。我们研究了零/少抽样提示和参数高效的微调在各种开源和闭源模型上的有效性,并探讨了其在真实性预测和解释生成的单独和联合任务上的表现。重要的是,我们采用了一种包括先前确定的自动指标和通过人类评价的新颖标准的双重评估方法。自动评估表明,在零抽样场景中,GPT-4脱颖而出,但在少抽样和参数高效的微调环境中,开源模型表明其不仅弥合了性能差距,而且在某些情况下甚至超过了GPT-4。人类评估揭示了更多的细节,并表明了黄金解释存在的一些潜在问题。
https://arxiv.org/abs/2405.09454
This paper introduces the Global-Local Image Perceptual Score (GLIPS), an image metric designed to assess the photorealistic image quality of AI-generated images with a high degree of alignment to human visual perception. Traditional metrics such as FID and KID scores do not align closely with human evaluations. The proposed metric incorporates advanced transformer-based attention mechanisms to assess local similarity and Maximum Mean Discrepancy (MMD) to evaluate global distributional similarity. To evaluate the performance of GLIPS, we conducted a human study on photorealistic image quality. Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores. Additionally, we introduce the Interpolative Binning Scale (IBS), a refined scaling method that enhances the interpretability of metric scores by aligning them more closely with human evaluative standards. The proposed metric and scaling approach not only provides more reliable assessments of AI-generated images but also suggest pathways for future enhancements in image generation technologies.
本文介绍了全局局部图像感知分数(GLIPS),一种用于评估高度与人类视觉感知一致的AI生成的图像的图像指标。传统指标如FID和KID得分与人类评价的关系不密切。所提出的指标采用先进的Transformer基注意力机制来评估局部相似度和最大均方差(MMD)以评估全局分布相似度。为了评估GLIPS的性能,我们在图像质量方面进行了一项人类研究。对各种生成模型进行的全局测试表明,GLIPS在与人类评分的一致性方面显著优于现有指标如FID、SSIM和MS-SSIM。此外,我们还引入了平滑分割尺度(IBS),一种通过更紧密地与人类评价标准对齐来提高指标分数解释性的平滑分割方法。所提出的指标和分割方法不仅为AI生成图像提供了更可靠的评估,还提出了未来图像生成技术改进的路径。
https://arxiv.org/abs/2405.09426
Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
近年来,大型语言模型(LLMs)已经在预测神经科学研究结果方面超越了人类专家(Luo等人,2024)。这种超群表现的基础是什么?一种可能性是,相对于从更广泛的训练中产生的广泛的推理能力,该特定科学文献中的统计模式是LLMs性能超越人类专家的原因。为了评估这种可能性,我们在1300亿个领域特定知识点的数据上训练了一个规模相对较小的124M参数GPT-2模型。尽管规模比训练在数十亿个token的较大LLM要小得多,但小模型在预测神经科学结果方面实现了专家水平。当小模型通过专门为神经科学文本训练的tokenizer进行训练时,或者当神经科学文献被用于微调预训练的GPT-2时,它们的成功预测结果表明,专家级的性能可能通过领域特定的自回归训练方法实现。
https://arxiv.org/abs/2405.09395
Synthetic aperture radar (SAR) is essential in actively acquiring information for Earth observation. SAR Automatic Target Recognition (ATR) focuses on detecting and classifying various target categories under different image conditions. The current deep learning-based SAR ATR methods are typically designed for specific datasets and applications. Various target characteristics, scene background information, and sensor parameters across ATR datasets challenge the generalization of those methods. This paper aims to achieve general SAR ATR based on a foundation model with Self-Supervised Learning (SSL). Our motivation is to break through the specific dataset and condition limitations and obtain universal perceptual capabilities across the target, scene, and sensor. A foundation model named SARATR-X is proposed with the following four aspects: pre-training dataset, model backbone, SSL, and evaluation task. First, we integrated 14 datasets with various target categories and imaging conditions as a pre-training dataset. Second, different model backbones were discussed to find the most suitable approaches for remote-sensing images. Third, we applied two-stage training and SAR gradient features to ensure the diversity and scalability of SARATR-X. Finally, SARATR-X has achieved competitive and superior performance on 5 datasets with 8 task settings, which shows that the foundation model can achieve universal SAR ATR. We believe it is time to embrace fundamental models for SAR image interpretation in the era of increasing big data.
合成孔雷达(SAR)在积极获取地球观测信息方面至关重要。SAR自动目标识别(ATR)关注于在不同的图像条件下检测和分类各种目标类别。目前基于深度学习的SAR ATR方法通常是为特定数据集和应用设计的。各种目标特征、场景背景信息和ATR数据集中的传感器参数挑战了这些方法的一般化。本文旨在基于自监督学习(SSL)的基础模型实现通用SAR ATR。我们的目标是突破特定数据和条件的限制,获得目标、场景和传感器之间的普遍感知能力。 我们提出了一个名为SARATR-X的基础模型,包括以下四个方面:预训练数据集、模型骨架、SSL和评估任务。首先,我们将14个数据集与各种目标和成像条件集成作为一个预训练数据集。其次,讨论了不同的模型骨架,以找到最适合远程感测图像的适当方法。第三,我们应用了两阶段培训和SAR梯度特征来确保SARATR-X的多样性和可扩展性。最后,SARATR-X在5个数据集和8个任务设置上实现了竞争性和卓越性能,这表明基础模型可以实现通用SAR ATR。我们认为,在数据和数据量不断增加的时代,应该拥抱基本模型用于SAR图像解释。
https://arxiv.org/abs/2405.09365
Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.
尽管语言模型(LMs)已经提高了问答系统的性能,但它们仍然需要大量数据。数据标注是一个耗时过程。相比之下,尤其是在问答系统中,可能需要对大量文档进行解析和标注,以及问题和它们的相应答案。此外,问答模型通常只在训练它们的领域中表现良好。由于标注成本高昂,我们认为LMs中具有领域无关知识(如语言理解)足够创建一个经过良好筛选的 dataset。以此动机,我们证明了在几 shot设置中,使用大型语言模型可以比最先进的方法提高各种数据集的问答系统性能。为此,我们利用提示框架进行数据生成,表明语言模型包含有价值的任务无关知识,可以用于超越常见的预训练/微调方案。因此,我们在几 shot问题回答方面持续优于之前的解决方案。
https://arxiv.org/abs/2405.09335
Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges. Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings. Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts encompassed a diverse range of mental health topics, including depression, anxiety, and trauma, to ensure a comprehensive assessment. Results: The results demonstrated a significant difference in performance between the two models (p > 0.05). GPT-4 achieved an average rating of 8.29 out of 10, while Chat-GPT received an average rating of 6.52. The clinical psychologist's evaluation suggested that GPT-4 was more effective at generating clinically relevant and empathetic responses, thereby providing better support and guidance to potential users. Conclusions: This study contributes to the growing body of literature on the applicability of large language models in mental health care settings. The findings underscore the importance of continued research and development in the field to optimize these models for clinical use. Further investigation is necessary to understand the specific factors underlying the performance differences between the two models and to explore their generalizability across various populations and mental health conditions.
背景:自然语言处理技术的快速发展导致开发了具有潜在推翻精神卫生保健变革的大语言模型。这些模型在协助临床医生和为各种心理挑战经历的个人提供支持方面显示出希望。目标:本研究旨在比较GPT-4和Chat-GPT在回答一组18个心理提示方面的性能,以评估它们在精神卫生保健场所的潜在应用。方法:采用盲法进行研究,临床心理学家的评估过程中不知道模型的来源。提示涵盖了广泛的心理健康问题,包括抑郁症、焦虑和创伤,以确保全面评估。结果:研究结果表明,两个模型在性能上存在显著差异(p > 0.05)。GPT-4获得了平均评分8.29分,而Chat-GPT获得了平均评分6.52分。临床心理学家的评估表明,GPT-4在生成具有临床相关性和同情心的响应方面效果更好,从而为潜在用户提供了更好的支持和指导。结论:本研究为大型语言模型在精神卫生保健场所的应用提供了越来越多的证据。研究结果强调,该领域需要继续研究和开发,以优化这些模型用于临床使用。还需要进一步研究来了解两个模型性能差异的具体原因,并探讨它们在不同人群和心理状况下的可推广性。
https://arxiv.org/abs/2405.09300
Despite the recent ubiquity of large language models and their high zero-shot prompted performance across a wide range of tasks, it is still not known how well they perform on tasks which require processing of potentially idiomatic language. In particular, how well do such models perform in comparison to encoder-only models fine-tuned specifically for idiomaticity tasks? In this work, we attempt to answer this question by looking at the performance of a range of LLMs (both local and software-as-a-service models) on three idiomaticity datasets: SemEval 2022 Task 2a, FLUTE, and MAGPIE. Overall, we find that whilst these models do give competitive performance, they do not match the results of fine-tuned task-specific models, even at the largest scales (e.g. for GPT-4). Nevertheless, we do see consistent performance improvements across model scale. Additionally, we investigate prompting approaches to improve performance, and discuss the practicalities of using LLMs for these tasks.
尽管大型语言模型和它们在各种任务上的高零样本触发性能最近非常普遍,但还不清楚它们在处理可能具有地道性需求的任务上的表现。特别是,这些模型与专门为地道性任务微调的编码器模型相比,表现如何?在这项工作中,我们试图回答这个问题,通过观察一系列LLM(包括本地和软件服务模型)在三个地道性数据集上的表现:SemEval 2022 Task 2a,FLUTE和MAGPIE。总体而言,我们发现,尽管这些模型确实具有竞争力的性能,但它们并不匹配针对特定任务进行微调的模型,即使在最大的规模上(例如,对于GPT-4)。然而,我们确实看到随着模型规模的一致性能改进。此外,我们研究了提示方法以提高性能,并讨论了为这些任务使用LLM的实用性。
https://arxiv.org/abs/2405.09279
The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.
在机器翻译(MT)中,幻觉和遗漏问题一直是一个长期存在的问题,因为LLM本身容易受到这些现象的影响。在本文中,我们通过引导基于LLM的MT模型进行更好的词对齐来减轻这个问题。我们首先研究了词对齐和MT中幻觉和遗漏现象之间的相关性。然后,我们提出将词对齐作为偏好来优化基于LLM的MT模型。偏好数据是通过选择和拒绝多个MT工具中的翻译来构建的。随后,我们使用直接偏好优化向量来优化LLM-based模型以指向偏好信号。鉴于在MT中没有专门为幻觉和遗漏评估的评估者,我们进一步提出选择难实例并利用GPT-4直接评估模型缓解这些问题。通过实验验证这些设计评估方法的有效性,并通过对结果的广泛分析展示了基于词对齐的偏好优化在减轻幻觉和遗漏方面的有效性。
https://arxiv.org/abs/2405.09223
In this paper, we present the findings of our Project ALPINE which stands for ``Autoregressive Learning for Planning In NEtworks." Project ALPINE initiates a theoretical investigation into the development of planning capabilities in Transformer-based language models through their autoregressive learning mechanisms, aiming to identify any potential limitations in their planning abilities. We abstract planning as a network path-finding task where the objective is to generate a valid path from a specified source node to a designated target node. In terms of expressiveness, we show that the Transformer is capable of executing path-finding by embedding the adjacency and reachability matrices within its weights. Our theoretical analysis of the gradient-based learning dynamic of the Transformer reveals that the Transformer is capable of learning both the adjacency matrix and a limited form of the reachability matrix. These theoretical insights are then validated through experiments, which demonstrate that the Transformer indeed learns the adjacency matrix and an incomplete reachability matrix, which aligns with the predictions made in our theoretical analysis. Additionally, when applying our methodology to a real-world planning benchmark, called Blocksworld, our observations remain consistent. Our theoretical and empirical analyses further unveil a potential limitation of Transformer in path-finding: it cannot identify reachability relationships through transitivity, and thus would fail when path concatenation is needed to generate a path. In summary, our findings shed new light on how the internal mechanisms of autoregressive learning enable planning in networks. This study may contribute to our understanding of the general planning capabilities in other related domains.
在本文中,我们展示了我们的项目ALPINE的研究成果,该项目名为“基于Transformer的自动规划网络”。ALPINE通过研究Transformer基语言模型的自回归学习机制,对基于Transformer的规划能力的发展进行理论探讨,旨在确定其规划能力中存在的任何潜在限制。我们将规划定义为一个网络路径寻址任务,其中目标是从指定的源节点生成一个有效的路径到指定的目标节点。在表现力方面,我们证明了Transformer可以通过将邻接和可达性矩阵嵌入其权重来执行路径寻址。我们对Transformer的自回归学习动态的理论分析揭示了Transformer能够学习邻接矩阵以及有限形式的可达性矩阵。这些理论见解通过实验得到了验证,实验结果表明Transformer确实能够学习邻接矩阵和有限形式的可达性矩阵,这与我们理论分析中的预测相符。此外,当将我们的方法应用于现实世界的规划基准,称为Blocksworld时,我们的观察结果仍然保持一致。我们的理论和实证分析进一步揭示了Transformer在路径寻址中的一个潜在限制:它不能通过传递性识别可达性关系,因此当路径拼接需要生成路径时,它将失败。总之,我们的研究结果为理解自回归学习在网络中的内部机制以及如何实现网络规划提供了新的思路。本研究可能会对我们的理解其他相关领域中的普遍规划能力有所贡献。
https://arxiv.org/abs/2405.09220
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports to address downstream tasks of interest on bone radiography. A practical processing pipeline is introduced to anonymize and process French medical reports. Pretraining then consists in the self-supervised alignment of visual and textual embedding spaces derived from deep model encoders. The resulting image encoder is then used to handle various downstream tasks, including quantification of osteoarthritis, estimation of bone age on pediatric wrists, bone fracture and anomaly detection. Our approach demonstrates competitive performance on downstream tasks, compared to alternatives requiring a significantly larger amount of human expert annotations. Our work stands as the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations, capitalizing on the large quantity of paired images and reports data available in an hospital. By relying on generic vision-laguage deep models in a language-specific scenario, it contributes to the deployement of vision models for wider healthcare applications.
本文提出了一种利用成骨扫描图像搭配法国报告进行视觉-语言预训练的方法,以解决骨摄影下游任务的挑战。具体来说,我们介绍了一个实用的处理流程来匿名化和处理法国医疗报告。预训练包括对深度模型编码器产生的视觉和文本嵌入空间的自监督对齐。然后,经过调整的图像编码器用于处理各种下游任务,包括对骨关节炎的定量、对儿童手腕上的骨龄估计、骨骨折和异常检测。我们的方法在下游任务上具有竞争力的性能,与需要大量人专家注释的替代方法相比。我们的工作是第一项将法国报告整合到专门用于骨X光表示的嵌入空间的研究,充分利用了医院中存在的大量成对图像和报告数据。通过在语言特定的场景中依赖通用视觉语言深度模型,它为更广泛的医疗应用部署视觉模型做出了贡献。
https://arxiv.org/abs/2405.08932
A recent study by De et al. (2022) has reported that large-scale representation learning through pre-training on a public dataset significantly enhances differentially private (DP) learning in downstream tasks, despite the high dimensionality of the feature space. To theoretically explain this phenomenon, we consider the setting of a layer-peeled model in representation learning, which results in interesting phenomena related to learned features in deep learning and transfer learning, known as Neural Collapse (NC). Within the framework of NC, we establish an error bound indicating that the misclassification error is independent of dimension when the distance between actual features and the ideal ones is smaller than a threshold. Additionally, the quality of the features in the last layer is empirically evaluated under different pre-trained models within the framework of NC, showing that a more powerful transformer leads to a better feature representation. Furthermore, we reveal that DP fine-tuning is less robust compared to fine-tuning without DP, particularly in the presence of perturbations. These observations are supported by both theoretical analyses and experimental evaluation. Moreover, to enhance the robustness of DP fine-tuning, we suggest several strategies, such as feature normalization or employing dimension reduction methods like Principal Component Analysis (PCA). Empirically, we demonstrate a significant improvement in testing accuracy by conducting PCA on the last-layer features.
De等人(2022)最近发表的一篇研究报道指出,通过在公共数据集上进行大规模表示学习,可以显著增强下游任务的差异隐私(DP)学习,尽管特征空间具有高维度。为了理论地解释这一现象,我们考虑表示学习中的层剥离设置,该设置在深度学习和迁移学习中学到的特征中产生了关于神经崩塌(NC)有趣的现象。在NC的框架内,我们建立了误分类误差与维度之间的独立性,即实际特征与理想特征之间的距离小于一个阈值时,误分类误差是独立的。此外,在不同预训练模型下对NC框架内最后层的特征进行了实证评估,结果表明,更强大的Transformer导致更好的特征表示。此外,我们还发现,与没有DP的微调相比,DP微调的鲁棒性较低,尤其是在存在扰动的情况下。这些观察结果得到了理论分析和实验评估的支持。此外,为了增强DP微调的鲁棒性,我们提出了几个策略,例如特征归一化或采用如主成分分析(PCA)等降维方法。通过PCA对最后一层特征进行降维,我们得到了显著的测试准确率提升。
https://arxiv.org/abs/2405.08920
We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.
我们提出了一个简单的策略,用于在视觉语言对比学习过程中遮盖图像补丁,从而提高学习表示的质量和训练速度。在训练的每个迭代过程中,我们随机遮盖由其原始像素强度衡量视觉上相似的图像补丁的聚类。这为补丁的对比训练之外提供了一种额外的学习信号,因为强制模型仅从上下文预测被遮盖的视觉结构的单词,而不是仅仅进行对比训练。它还通过减少每个图像中使用的数据量来加速训练。我们通过在多个基准上进行预训练来评估我们模型的效果,发现它在学习表示的质量方面优于其他遮盖策略,例如FLIP。
https://arxiv.org/abs/2405.08815
We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. OpenAI's GPT series models were compared against human annotation. Frequent terms with a high false-positive rate were removed from the dictionary. Analysis of the estimated false-positive rates of the annotated terms revealed 8 ambiguous terms (plus synonyms) used in Instagram posts, which were removed from the original dictionary. To study the effect of removing those terms, we constructed knowledge networks using the refined and the original dictionaries and performed an eigenvector-centrality analysis on both networks. We show that the refined dictionary thus produced leads to a significantly different rank of important terms, as measured by their eigenvector-centrality of the knowledge networks. Furthermore, the most important terms obtained after refinement are of greater medical relevance. In addition, we show that OpenAI's GPT series models fare worse than human annotators in this task.
我们使用了一个基于生物医学术语提取于各种来源的词典,如DrugBank、MedDRA、MedlinePlus、TCMGeneDIT等,来标记在至少提及一次抗癫痫药物的8000万条Instagram帖子。通过对2010年到2016年间的随机样本进行人名标注者评估,来识别出假阳性结果。OpenAI的GPT系列模型与人类标注进行了比较。高假阳性率和高频词汇从词典中移除。对注释词汇的预计假阳性率进行分析,发现其中8个词汇是 Instagram 帖子中使用的模糊词汇(包括同义词),它们被从原始词典中移除。为了研究移除这些词汇的影响,我们使用精化和原始词典来构建知识网络,并分别对两个网络进行了 eigenvector-centrality 分析。我们发现,因此产生的精化词典导致知识网络中重要词汇的排名明显不同,这是通过它们的 eigenvector-centrality 来衡量的。此外,在精化过程中获得的最重要词汇具有更大的医学意义。此外,我们还发现,OpenAI的GPT系列模型在這種任务上表现不佳,远低于人类标注者。
https://arxiv.org/abs/2405.08784
Deep learning has enabled breakthroughs in automated diagnosis from medical imaging, with many successful applications in ophthalmology. However, standard medical image classification approaches only assess disease presence at the time of acquisition, neglecting the common clinical setting of longitudinal imaging. For slow, progressive eye diseases like age-related macular degeneration (AMD) and primary open-angle glaucoma (POAG), patients undergo repeated imaging over time to track disease progression and forecasting the future risk of developing disease is critical to properly plan treatment. Our proposed Longitudinal Transformer for Survival Analysis (LTSA) enables dynamic disease prognosis from longitudinal medical imaging, modeling the time to disease from sequences of fundus photography images captured over long, irregular time periods. Using longitudinal imaging data from the Age-Related Eye Disease Study (AREDS) and Ocular Hypertension Treatment Study (OHTS), LTSA significantly outperformed a single-image baseline in 19/20 head-to-head comparisons on late AMD prognosis and 18/20 comparisons on POAG prognosis. A temporal attention analysis also suggested that, while the most recent image is typically the most influential, prior imaging still provides additional prognostic value.
深度学习在医学影像自动诊断方面取得了突破性进展,在眼科领域有很多成功应用。然而,标准的医学图像分类方法仅在获取时评估疾病的存在,而忽略了常见的临床扫描设置——长期影像扫描。对于像年龄相关性黄斑变性(AMD)和原发性开角型眼压升高(POAG)这样的进展缓慢、进行性的眼病,患者需要重复进行影像检查以跟踪疾病进展,并预测未来患病的风险,以便正确规划治疗。我们提出的纵向Transformer for Survival Analysis(LTSA)可以从长期医学影像中动态预测疾病预后,建模长时间 irregular 时间间隔内捕获的序列帧图像中的疾病从眼轴摄影图中的时间。使用年龄相关性眼病研究(AREDS)和眼压升高治疗研究(OHTS)中的纵向影像数据,LTSA在19/20 头对头比较中显著超过了单张图像基线在晚期 AMD 预后方面的表现,而在18/20 比较中超过了原发性开角型眼压升高预后的表现。时间注意分析还表明,虽然最最新的图像通常是最有影响力的,但之前的图像仍然提供了额外的预后价值。
https://arxiv.org/abs/2405.08780
Indian folk paintings have a rich mosaic of symbols, colors, textures, and stories making them an invaluable repository of cultural legacy. The paper presents a novel approach to classifying these paintings into distinct art forms and tagging them with their unique salient features. A custom dataset named FolkTalent, comprising 2279 digital images of paintings across 12 different forms, has been prepared using websites that are direct outlets of Indian folk paintings. Tags covering a wide range of attributes like color, theme, artistic style, and patterns are generated using GPT4, and verified by an expert for each painting. Classification is performed employing the RandomForest ensemble technique on fine-tuned Convolutional Neural Network (CNN) models to classify Indian folk paintings, achieving an accuracy of 91.83%. Tagging is accomplished via the prominent fine-tuned CNN-based backbones with a custom classifier attached to its top to perform multi-label image classification. The generated tags offer a deeper insight into the painting, enabling an enhanced search experience based on theme and visual attributes. The proposed hybrid model sets a new benchmark in folk painting classification and tagging, significantly contributing to cataloging India's folk-art heritage.
印度民间绘画具有丰富的象征、色彩、纹理和故事,使其成为文化遗产的无价财富。本文提出了一种新颖的方法来将这些绘画分类为不同的艺术形式,并为它们独特的突出特点贴上标签。一个由12种形式、共2279幅绘画图片组成的自定义数据集FolkTalent已经准备就绪,这些图片来源于印度民间绘画的直接网站。使用GPT4生成覆盖颜色、主题、艺术风格和图案等广泛属性的标签,并请专家对每幅绘画进行验证。分类采用随机森林技术对经过微调的卷积神经网络(CNN)模型进行,实现91.83%的准确率。标签通过一个显著地进行微调的CNN骨干网络与自定义分类器连接在一起进行多标签图像分类。生成的标签提供了对绘画的更深刻的洞察,使主题和视觉属性能够成为增强的搜索体验。所提出的混合模型在民间绘画分类和标签方面设定了新的基准,显著地贡献了印度民间艺术遗产的目录。
https://arxiv.org/abs/2405.08776
We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at this http URL
我们提出了Hunyuan-DiT,一种具有对英语和中文细粒度理解的自监督文本到图像扩散变换器。为了构建Hunyuan-DiT,我们精心设计了变换器结构、文本编码器和位置编码。我们还从零开始构建了一个完整的数据管道,用于迭代模型优化。为了深入理解语言,我们训练了一个多模态大型语言模型,以优化图像的摘要。最后,Hunyuan-DiT可以与用户进行多轮多模态对话,根据上下文生成和优化图像。通过我们的全面人类评估协议(超过50名专业人类评估员)的全面评估,Hunyuan-DiT在中文到图像生成方面达到了最先进水平,与其他开源模型相比。代码和预训练模型可在该http URL公开获取。
https://arxiv.org/abs/2405.08748
In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at \url{this https URL}.
在本文中,我们提出了一种简单但有效的增强社交媒体视频的盲视频质量评估(BVQA)模型的方法。受到之前研究者利用从各种计算机视觉模型提取的预训练特征作为BVQA特征表示的启发,我们进一步探索了预训练的盲图像质量评估(BIQA)和BVQA模型的丰富质量感知特征作为辅助特征,以帮助BVQA模型处理社交媒体视频的复杂扭曲和多样内容。具体来说,我们使用SimpleVQA,一种由可训练的Swin Transformer-B和固定的SlowFast组成的BVQA模型,作为我们的基础模型。Swin Transformer-B和SlowFast组件分别负责提取空间和运动特征。然后,我们从Q-Align、LIQE和FAST-VQA中提取三种特征,分别捕捉帧级质量感知特征、帧级质量感知特征以及时空质量感知特征。通过连接这些特征,我们采用多层感知器(MLP)网络将它们回归为质量分数。实验结果表明,与三个公共社交媒体VQA数据集上的其他模型相比,所提出的模型在性能上取得了最佳结果。此外,所提出的模型在CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge上获得了第一名。代码可在此处访问:\url{此链接}。
https://arxiv.org/abs/2405.08745