Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid human-AI technologies, and 3) investigates what these technologies can reveal about individual perspectives in online discussions. We propose a three-layered hierarchy for representing perspectives that can be obtained by a mixture of human intelligence and large language models. We illustrate how these representations can draw insights into the diversity of perspectives and allow us to investigate interactions in online discussions.
现代民主国家面临着公民参与度下降的一个关键问题。在线讨论论坛是提高公民参与度的重要途径。本论文提纲1)识别了促进大规模在线讨论与自然语言处理(NLP)相关的挑战,2)提出了通过融合人机智能技术来解决这些挑战的建议,3)研究了这些技术如何揭示关于在线讨论中个体观点的信息。我们提出了一个三层表示视角的三层结构,可以通过混合人类智慧和大型语言模型的方式获得。我们说明了这些表示如何揭示视角的多样性,并允许我们研究在线讨论中的互动。
https://arxiv.org/abs/2405.09439
Care-giving and assistive robotics, driven by advancements in AI, offer promising solutions to meet the growing demand for care, particularly in the context of increasing numbers of individuals requiring assistance. This creates a pressing need for efficient and safe assistive devices, particularly in light of heightened demand due to war-related injuries. While cost has been a barrier to accessibility, technological progress is able to democratize these solutions. Safety remains a paramount concern, especially given the intricate interactions between assistive robots and humans. This study explores the application of reinforcement learning (RL) and imitation learning, in improving policy design for assistive robots. The proposed approach makes the risky policies safer without additional environmental interactions. Through experimentation using simulated environments, the enhancement of the conventional RL approaches in tasks related to assistive robotics is demonstrated.
照顾和辅助机器人技术,以人工智能的进步为基础,为满足不断增长的健康需求提供了有前景的解决方案,特别是在需要帮助的个体数量增加的情况下。这导致了对高效且安全的辅助设备的需求不断增加,尤其是在战争相关的伤害加剧的情况下。尽管成本是一个障碍,但技术进步能够实现这些解决方案的民主化。考虑到辅助机器人和人类之间的复杂交互,安全性始终是一个首要问题。本研究探讨了强化学习(RL)和模仿学习在改善辅助机器人政策设计中的应用。所提出的方法通过模拟环境实验证明了在辅助机器人任务方面传统RL方法的增强。
https://arxiv.org/abs/2405.07603
Since their inception, programming languages have trended towards greater readability and lower barriers for programmers. Following this trend, natural language can be a promising type of programming language that provides great flexibility and usability and helps towards the democracy of programming. However, the inherent vagueness, ambiguity, and verbosity of natural language pose significant challenges in developing an interpreter that can accurately understand the programming logic and execute instructions written in natural language. Fortunately, recent advancements in Large Language Models (LLMs) have demonstrated remarkable proficiency in interpreting complex natural language. Inspired by this, we develop a novel system for Code Representation and Execution (CoRE), which employs LLM as interpreter to interpret and execute natural language instructions. The proposed system unifies natural language programming, pseudo-code programming, and flow programming under the same representation for constructing language agents, while LLM serves as the interpreter to interpret and execute the agent programs. In this paper, we begin with defining the programming syntax that structures natural language instructions logically. During the execution, we incorporate external memory to minimize redundancy. Furthermore, we equip the designed interpreter with the capability to invoke external tools, compensating for the limitations of LLM in specialized domains or when accessing real-time information. This work is open-source at this https URL.
自从它们诞生以来,编程语言的趋势是越来越具有可读性和越来越低的学习障碍,这对程序员来说是一个积极的趋势。遵循这一趋势,自然语言可以成为一种有前景的编程语言,它提供了极大的灵活性和可用性,有助于实现编程的民主化。然而,自然语言固有的模糊、歧义和冗长性使得开发一个准确理解编程逻辑并执行自然语言指令的解释器具有重大挑战。幸运的是,近年来在大型语言模型(LLMs)上的进步已经展示出在解释复杂自然语言方面非凡的能力。受到这一启发,我们开发了一个名为CoRE(代码表示和执行)的新系统,该系统使用LLM作为解释器来解释和执行自然语言指令。所提出的系统将自然语言编程、伪代码编程和流程编程在构建语言代理的相同表示中统一起来,而LLM则作为解释器来解释和执行代理程序。在本文中,我们首先定义了构成自然语言指令的编程语法。在执行过程中,我们引入了外部内存以最小化冗余。此外,我们还为设计中的解释器配备了调用外部工具的能力,弥补LLM在专业领域或访问实时信息时的限制。这项工作在https://这个网址上是开源的。
https://arxiv.org/abs/2405.06907
Large Language Models (LLMs) deployed on edge devices learn through fine-tuning and updating a certain portion of their parameters. Although such learning methods can be optimized to reduce resource utilization, the overall required resources remain a heavy burden on edge devices. Instead, Retrieval-Augmented Generation (RAG), a resource-efficient LLM learning method, can improve the quality of the LLM-generated content without updating model parameters. However, the RAG-based LLM may involve repetitive searches on the profile data in every user-LLM interaction. This search can lead to significant latency along with the accumulation of user data. Conventional efforts to decrease latency result in restricting the size of saved user data, thus reducing the scalability of RAG as user data continuously grows. It remains an open question: how to free RAG from the constraints of latency and scalability on edge devices? In this paper, we propose a novel framework to accelerate RAG via Computing-in-Memory (CiM) architectures. It accelerates matrix multiplications by performing in-situ computation inside the memory while avoiding the expensive data transfer between the computing unit and memory. Our framework, Robust CiM-backed RAG (RoCR), utilizing a novel contrastive learning-based training method and noise-aware training, can enable RAG to efficiently search profile data with CiM. To the best of our knowledge, this is the first work utilizing CiM to accelerate RAG.
大语言模型(LLMs)在边缘设备上通过微调和完善其参数来学习。尽管这种学习方法可以优化以减少资源利用率,但总体上边缘设备所需的资源仍然沉重负担。相反,检索增强生成(RAG)是一种资源高效的LLM学习方法,可以在不更新模型参数的情况下提高LLM生成的内容的质量。然而,基于RAG的LLM可能需要在每个用户-LLM交互过程中对用户数据进行重复搜索。这种搜索可能导致延迟的积累以及随着用户数据的增长而降低RAG的可扩展性。仍然是一个未解决的问题:如何从边缘设备的延迟和可扩展性约束中解放RAG?在本文中,我们提出了通过计算在内存中的架构加速RAG的新框架。它通过在内存中进行原地计算来加速矩阵乘法,同时避免计算单元和内存之间进行昂贵的数据传输。我们的框架Robust CiM-backed RAG(RoCR)使用了一种新的基于对比学习的学习方法和新颖的噪声感知训练,可以实现RAG与CiM的 efficiently搜索用户数据。据我们所知,这是第一个利用CiM加速RAG的工作。
https://arxiv.org/abs/2405.04700
In this paper, we introduce SUTRA, multilingual Large Language Model architecture capable of understanding, reasoning, and generating text in over 50 languages. SUTRA's design uniquely decouples core conceptual understanding from language-specific processing, which facilitates scalable and efficient multilingual alignment and learning. Employing a Mixture of Experts framework both in language and concept processing, SUTRA demonstrates both computational efficiency and responsiveness. Through extensive evaluations, SUTRA is demonstrated to surpass existing models like GPT-3.5, Llama2 by 20-30% on leading Massive Multitask Language Understanding (MMLU) benchmarks for multilingual tasks. SUTRA models are also online LLMs that can use knowledge from the internet to provide hallucination-free, factual and up-to-date responses while retaining their multilingual capabilities. Furthermore, we explore the broader implications of its architecture for the future of multilingual AI, highlighting its potential to democratize access to AI technology globally and to improve the equity and utility of AI in regions with predominantly non-English languages. Our findings suggest that SUTRA not only fills pivotal gaps in multilingual model capabilities but also establishes a new benchmark for operational efficiency and scalability in AI applications.
在本文中,我们引入了SUTRA,一种可理解、推理和生成超过50种语言的Multilingual大型语言模型架构。SUTRA的设计独特地将核心概念理解与语言特定的处理分离,这有助于实现可扩展和高效的跨语言对齐和学习。在语言和概念处理中采用专家混合框架,SUTRA展示了计算效率和响应性。通过广泛的评估,SUTRA在跨语言任务的主要大型多任务语言理解(MMLU)基准测试中超越了当前的GPT-3.5和Llama2模型,其性能提高了20-30%。SUTRA还是一种在线的LLM,可以使用互联网的知识提供无幻觉、事实和更新的回答,同时保留其跨语言能力。此外,我们还探讨了其架构对跨语言人工智能未来发展的更广泛的影响,强调了其全球民主化AI技术访问和改善地区主要非英语语言地区AI的公平性和实用性。我们的研究结果表明,SUTRA不仅填补了跨语言模型能力的关键空白,而且为AI应用程序的操作效率和可扩展性树立了新的基准。
https://arxiv.org/abs/2405.06694
Document image restoration is a crucial aspect of Document AI systems, as the quality of document images significantly influences the overall performance. Prevailing methods address distinct restoration tasks independently, leading to intricate systems and the incapability to harness the potential synergies of multi-task learning. To overcome this challenge, we propose DocRes, a generalist model that unifies five document image restoration tasks including dewarping, deshadowing, appearance enhancement, deblurring, and binarization. To instruct DocRes to perform various restoration tasks, we propose a novel visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). The DTSPrompt for different tasks comprises distinct prior features, which are additional characteristics extracted from the input image. Beyond its role as a cue for task-specific execution, DTSPrompt can also serve as supplementary information to enhance the model's performance. Moreover, DTSPrompt is more flexible than prior visual prompt approaches as it can be seamlessly applied and adapted to inputs with high and variable resolutions. Experimental results demonstrate that DocRes achieves competitive or superior performance compared to existing state-of-the-art task-specific models. This underscores the potential of DocRes across a broader spectrum of document image restoration tasks. The source code is publicly available at this https URL
文档图像修复是文档人工智能系统的关键方面,因为文档图像的质量会显著影响整体性能。现有的方法独立处理不同的修复任务,导致复杂系统和无法充分利用多任务学习的机会。为了克服这个挑战,我们提出了DocRes,一种统一了包括去畸化、去阴影、增强外观和模糊以及二值化的五种文档图像修复任务的通用模型。要指导DocRes执行各种修复任务,我们提出了名为动态任务特定提示(DTSPrompt)的新颖视觉提示方法。DTSPrompt的不同任务包括从输入图像中提取的独特的先验特征,这些特征用于指导修复任务。DTSPrompt不仅作为任务特定执行的提示,还可以作为增强模型性能的补充信息。此外,DTSPrompt比先前的视觉提示方法更灵活,因为它可以轻松应用于具有高分辨率和高变比率的输入。实验结果表明,DocRes在现有任务特定模型上实现了竞争性或优越性能。这表明DocRes在更广泛的文档图像修复任务的领域具有巨大的潜力。源代码可以在该https URL上获取。
https://arxiv.org/abs/2405.04408
Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively studied and great progress has been made, there are still key challenges that hinder their broad real-world applications: 1) the inherent motion/rotation ambiguity requires either explicit camera motion recovery with extra constraint or complex Procrustean Alignment; 2) existing low-rank modeling of the global shape can over-penalize drastic deformations in the 3D shape sequence. This paper proposes to resolve the above issues from a spatial-temporal modeling perspective. First, we propose a novel Temporally-smooth Procrustean Alignment module that estimates 3D deforming shapes and adjusts the camera motion by aligning the 3D shape sequence consecutively. Our new alignment module remedies the requirement of complex reference 3D shape during alignment, which is more conductive to non-isotropic deformation modeling. Second, we propose a spatial-weighted approach to enforce the low-rank constraint adaptively at different locations to accommodate drastic spatially-variant deformation reconstruction better. Our modeling outperform existing low-rank based methods, and extensive experiments across different datasets validate the effectiveness of our method.
尽管非刚性结构从运动(NRSfM)已经得到了广泛的研究,并取得了很大的进展,但仍然有一些关键挑战阻碍了它们在现实世界的广泛应用:1)固有运动/旋转不确定性要求要么通过额外的约束明确地恢复相机运动,要么通过复杂的Procrustean对齐来处理3D形状序列中的剧烈变形;2)现有的全局形状建模低秩度可以过度惩罚3D形状序列中的剧烈变形。本文从空间-时间建模的角度来解决上述问题。首先,我们提出了一种新颖的时间平滑Procrustean对齐模块,估计3D变形形状,并通过依次对齐3D形状序列来调整相机运动。我们的新对齐模块消除了在对齐过程中的复杂参考3D形状要求,这更有利于非均匀变形建模。其次,我们提出了一种在不同的位置对低秩约束进行自适应的方法,以适应剧烈空间变异性形状重建。我们的建模方法超越了现有的低秩建模方法,而且通过不同数据集的广泛实验验证了我们的方法的有效性。
https://arxiv.org/abs/2405.04309
We explore the capabilities of an augmented democracy system built on off-the-shelf LLMs fine-tuned on data summarizing individual preferences across 67 policy proposals collected during the 2022 Brazilian presidential elections. We use a train-test cross-validation setup to estimate the accuracy with which the LLMs predict both: a subject's individual political choices and the aggregate preferences of the full sample of participants. At the individual level, the accuracy of the out of sample predictions lie in the range 69%-76% and are significantly better at predicting the preferences of liberal and college educated participants. At the population level, we aggregate preferences using an adaptation of the Borda score and compare the ranking of policy proposals obtained from a probabilistic sample of participants and from data augmented using LLMs. We find that the augmented data predicts the preferences of the full population of participants better than probabilistic samples alone when these represent less than 30% to 40% of the total population. These results indicate that LLMs are potentially useful for the construction of systems of augmented democracy.
我们探讨了基于LLM微调的增强民主系统的能力。我们在2022年巴西总统选举期间收集的67个政策建议上进行了微调。我们使用训练-测试交叉验证设置来估计LLMs预测两个方面的准确性:一个主题的个体政治选择和对完整样本偏好的聚合偏好。在个体层面上,预测的准确性在69%-76%之间,明显高于预测自由派和受过高等教育的人群的偏好。在人口层面上,我们使用Borda评分对聚合偏好进行改编。我们比较了来自概率样本的政策的排名以及使用LLM微调的数据增强获得的政策的排名。我们发现,当这些概率样本代表的总人口不到30%至40%时,增强数据预测完整人口的最佳偏好,而概率样本单独预测偏好则较低。这些结果表明,LLM微调可能有助于增强民主系统的构建。
https://arxiv.org/abs/2405.03452
This paper presents GeoContrastNet, a language-agnostic framework to structured document understanding (DU) by integrating a contrastive learning objective with graph attention networks (GATs), emphasizing the significant role of geometric features. We propose a novel methodology that combines geometric edge features with visual features within an overall two-staged GAT-based framework, demonstrating promising results in both link prediction and semantic entity recognition performance. Our findings reveal that combining both geometric and visual features could match the capabilities of large DU models that rely heavily on Optical Character Recognition (OCR) features in terms of performance accuracy and efficiency. This approach underscores the critical importance of relational layout information between the named text entities in a semi-structured layout of a page. Specifically, our results highlight the model's proficiency in identifying key-value relationships within the FUNSD dataset for forms and also discovering the spatial relationships in table-structured layouts for RVLCDIP business invoices. Our code and pretrained models will be accessible on our official GitHub.
本文提出了一种名为GeoContrastNet的多语言文档理解(DU)框架,通过将对比学习目标与图注意力网络(GATs)相结合,强调几何特征在文档理解中的重要性。我们提出了一种结合几何边缘特征和视觉特征的新的方法,构建了一个基于两个阶段GAT的框架,并在链接预测和语义实体识别方面的表现表明了潜在的积极效果。我们的研究结果表明,结合几何和视觉特征可以与高度依赖OCR特征的大型DU模型的性能准确性和效率相匹敌。这种方法突出了在半结构化页面布局中命名文本实体之间关系布局信息的关键性。具体来说,我们的结果强调了模型在FUNSD数据集中识别表单关键值关系的能力,以及发现RVLCDIP商业发票表格结构布局中的空间关系的能力。我们的代码和预训练模型将公开发布在官方GitHub上。
https://arxiv.org/abs/2405.03104
Recent advancements in AI have democratized its deployment as a healthcare assistant. While pretrained models from large-scale visual and audio datasets have demonstrably generalized to this task, surprisingly, no studies have explored pretrained speech models, which, as human-originated sounds, intuitively would share closer resemblance to lung sounds. This paper explores the efficacy of pretrained speech models for respiratory sound classification. We find that there is a characterization gap between speech and lung sound samples, and to bridge this gap, data augmentation is essential. However, the most widely used augmentation technique for audio and speech, SpecAugment, requires 2-dimensional spectrogram format and cannot be applied to models pretrained on speech waveforms. To address this, we propose RepAugment, an input-agnostic representation-level augmentation technique that outperforms SpecAugment, but is also suitable for respiratory sound classification with waveform pretrained models. Experimental results show that our approach outperforms the SpecAugment, demonstrating a substantial improvement in the accuracy of minority disease classes, reaching up to 7.14%.
近年来在人工智能领域的进步使将其部署为医疗助手的形式民主化。虽然从大规模视觉和音频数据集中预训练的大规模模型已经显着地泛化到这项任务,但令人惊讶的是,还没有研究探索预训练的语音模型,因为这些模型作为人类来源的声音,本质上与肺声音会更接近。本文探讨了预训练语音模型在呼吸音分类中的有效性。我们发现,语音和肺声音样本之间存在特征差距,为了弥合这一差距,数据增强是必要的。然而,最广泛使用的音频和语音的增强技术SpecAugment需要二维频谱图格式,不能应用于预训练在语音波形上的模型。为了应对这个问题,我们提出了RepAugment,一种输入无关的表示级增强技术,其性能优于SpecAugment,但同时也适用于具有波形预训练模型的呼吸音分类。实验结果表明,我们的方法超过了SpecAugment,证明了在少数民族疾病类别上的准确性有了实质性提高,达到7.14%。
https://arxiv.org/abs/2405.02996
As interest in large language models (LLMs) grows, the importance of accuracy in automatic speech recognition (ASR) has become more pronounced. This is particularly true for lectures that include specialized terminology, where the success rate of traditional ASR models tends to be low, posing a challenging problem. A method to improve ASR performance for specialized terminology using the word frequency difference approach has been proposed. Through experiments and data analysis, we investigate whether this proposal effectively addresses the issue. Additionally, we introduce the power law as the theoretical foundation for the relative frequency
随着大型语言模型(LLMs)的兴趣不断增长,自动语音识别(ASR)中准确性的重要性变得更加突出。尤其是在包括专业术语的讲座中,传统ASR模型的成功率往往较低,这构成了具有挑战性的问题。提出了一种利用词频差异方法提高专业术语ASR性能的方法。通过实验和数据分析,我们研究了这一建议是否有效解决了这个问题。此外,我们还介绍了幂律作为相对频率的理论基础。
https://arxiv.org/abs/2405.02995
This paper presents a set of intersectional feminist principles for conducting equitable, ethical, and sustainable AI research. In Data Feminism (2020), we offered seven principles for examining and challenging unequal power in data science. Here, we present a rationale for why feminism remains deeply relevant for AI research, rearticulate the original principles of data feminism with respect to AI, and introduce two potential new principles related to environmental impact and consent. Together, these principles help to 1) account for the unequal, undemocratic, extractive, and exclusionary forces at work in AI research, development, and deployment; 2) identify and mitigate predictable harms in advance of unsafe, discriminatory, or otherwise oppressive systems being released into the world; and 3) inspire creative, joyful, and collective ways to work towards a more equitable, sustainable world in which all of us can thrive.
本文提出了一系列交叉性别主义原则,用于进行公正、道德和可持续的人工智能研究。在《数据女性主义》(2020)中,我们提出了七项原则,以审视和挑战数据科学中存在的不平等权力。在这里,我们阐述为什么性别主义在人工智能研究中仍然具有重要意义,重新阐述数据女性主义的原始原则,并引入了与环境保护和同意相关的两个潜在新原则。这些原则一起帮助做到以下几点:1)考虑到人工智能研究、发展和部署中存在的不平等、不民主、剥削和排斥的力量;2)在发布不安全、歧视性或其他压迫性系统之前,提前识别和减轻可能带来的可预见伤害;3)激发有益、快乐和集体的方式,共同努力实现一个更加公正、可持续的世界,在其中我们都能繁荣发展。
https://arxiv.org/abs/2405.01286
This paper presents Callico, a web-based open source platform designed to simplify the annotation process in document recognition projects. The move towards data-centric AI in machine learning and deep learning underscores the importance of high-quality data, and the need for specialised tools that increase the efficiency and effectiveness of generating such data. For document image annotation, Callico offers dual-display annotation for digitised documents, enabling simultaneous visualisation and annotation of scanned images and text. This capability is critical for OCR and HTR model training, document layout analysis, named entity recognition, form-based key value annotation or hierarchical structure annotation with element grouping. The platform supports collaborative annotation with versatile features backed by a commitment to open source development, high-quality code standards and easy deployment via Docker. Illustrative use cases - including the transcription of the Belfort municipal registers, the indexing of French World War II prisoners for the ICRC, and the extraction of personal information from the Socface project's census lists - demonstrate Callico's applicability and utility.
本文介绍了Callico,一个专为简化文档识别项目注释过程的在线开源平台。数据中心的AI在机器学习和深度学习中的趋势强调了高质量数据的重要性,并需要专门的工具来提高生成此类数据的效率和效果。对于文档图像注释,Callico提供了数字文档的双显示注释,实现同时视觉化和注释扫描文本。这种能力对于OCR和HTR模型训练、文档布局分析、命名实体识别、基于形式的键值注释或层次结构注释 with element grouping至关重要。平台支持通过承诺开源开发、高质量代码标准和通过Docker进行轻松部署,与各种插件和扩展兼容的协作注释。包括——包括Belfort市政登记的转录、为ICRC索引法国二战囚犯和从Socface项目的人口普查列表中提取个人信息——Callico的应用价值和实用性得到了说明。
https://arxiv.org/abs/2405.01071
Recent breakthroughs in single-image 3D portrait reconstruction have enabled telepresence systems to stream 3D portrait videos from a single camera in real-time, potentially democratizing telepresence. However, per-frame 3D reconstruction exhibits temporal inconsistency and forgets the user's appearance. On the other hand, self-reenactment methods can render coherent 3D portraits by driving a personalized 3D prior, but fail to faithfully reconstruct the user's per-frame appearance (e.g., facial expressions and lighting). In this work, we recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism. To this end, we propose a new fusion-based method that fuses a personalized 3D subject prior with per-frame information, producing temporally stable 3D videos with faithful reconstruction of the user's per-frame appearances. Trained only using synthetic data produced by an expression-conditioned 3D GAN, our encoder-based method achieves both state-of-the-art 3D reconstruction accuracy and temporal consistency on in-studio and in-the-wild datasets.
近年来,在单图像3D人物重建方面的突破使得远程会诊系统能够实时从单个相机流式传输3D人物视频,这有可能使远程会诊民主化。然而,每帧3D重建展示出时间不一致性,并忘记用户的形象。另一方面,自演算法可以通过驱动个性化的3D先验来生成连贯的3D肖像,但它无法准确地重构用户的每帧外貌(例如,面部表情和照明)。在本文中,我们认识到需要维持连贯的身份和动态每帧外貌,以实现最佳的现实感。为此,我们提出了一个新的基于融合的方法,将个性化的3D主体先验与每帧信息相结合,产生具有忠实重构用户每帧外貌的temporally stable 3D视频。仅使用通过表情条件生成器的合成数据进行训练,我们的编码器基础方法在实验室和自然数据集上实现最佳的3D重建精度和时间一致性。
https://arxiv.org/abs/2405.00794
In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Parser (CREPE), our method uniquely integrates these capabilities by introducing a special token for OCR text, and token-triggered coordinate decoding. We also proposed a weakly-supervised framework for cost-efficient training, requiring only parsing annotations without high-cost coordinate annotations. Our experimental evaluations demonstrate CREPE's state-of-the-art performances on document parsing tasks. Beyond that, CREPE's adaptability is further highlighted by its successful usage in other document understanding tasks such as layout analysis, document visual question answering, and so one. CREPE's abilities including OCR and semantic parsing not only mitigate error propagation issues in existing OCR-dependent methods, it also significantly enhance the functionality of sequence generation models, ushering in a new era for document understanding studies.
在这项研究中,我们提出了一个无需光学字符识别(OCR)的视觉文档理解(VDU)序列生成模型。我们的模型不仅解析了文档图像中的文本,还根据多头架构提取了文本的空间坐标。我们为其命名为“ Coordinate-aware End-to-end Document Parser (CREPE)”,通过引入一个特殊标记来标记OCR文本,并实现标记触发的位置解码。我们还提出了一种弱监督的训练框架,只需解析无高成本坐标注释的标注即可。我们的实验评估结果表明,CREPE在文档解析任务中具有最先进的性能。除此之外,CREPE在其他文档理解任务(如布局分析、文档视觉问答等)中的应用也进一步证明了其灵活性。CREPE的包括OCR和语义解析的能力不仅减轻了现有OCR依赖方法中的错误传播问题,还显著增强了序列生成模型的功能,引领了文档理解研究的新纪元。
https://arxiv.org/abs/2405.00260
We present Aptly, an extension of the MIT App Inventor platform enabling mobile app development via natural language powered by code-generating large language models (LLMs). Aptly complements App Inventor's block language with a text language designed to allow visual code generation via text-based LLMs. We detail the technical aspects of how the Aptly server integrates LLMs with a realtime collaboration function to facilitate the automated creation and editing of mobile apps given user instructions. The paper concludes with insights from a study of a pilot implementation involving high school students, which examines Aptly's practicality and user experience. The findings underscore Aptly's potential as a tool that democratizes app development and fosters technological creativity.
我们发表了一篇名为Aptly的文章,它是在MIT App Inventor平台的基础上进行的,通过自然语言由代码生成的大语言模型(LLMs)实现了移动应用程序开发。Aptly通过文本语言与基于文本的LLMs相结合,弥补了App Inventor的块语言。本文详细介绍了Aptly服务器如何将LLMs与实时协作功能集成,以根据用户指令自动创建和编辑移动应用程序。论文最后从高中学生的试点实施研究中得出了见解,探讨了Aptly的实用性和用户体验。研究结果强调了Aptly作为促进应用程序开发和促进技术创造力的工具的潜力。
https://arxiv.org/abs/2405.00229
In the current digital era, the rapid spread of misinformation on online platforms presents significant challenges to societal well-being, public trust, and democratic processes, influencing critical decision making and public opinion. To address these challenges, there is a growing need for automated fake news detection mechanisms. Pre-trained large language models (LLMs) have demonstrated exceptional capabilities across various natural language processing (NLP) tasks, prompting exploration into their potential for verifying news claims. Instead of employing LLMs in a non-agentic way, where LLMs generate responses based on direct prompts in a single shot, our work introduces FactAgent, an agentic approach of utilizing LLMs for fake news detection. FactAgent enables LLMs to emulate human expert behavior in verifying news claims without any model training, following a structured workflow. This workflow breaks down the complex task of news veracity checking into multiple sub-steps, where LLMs complete simple tasks using their internal knowledge or external tools. At the final step of the workflow, LLMs integrate all findings throughout the workflow to determine the news claim's veracity. Compared to manual human verification, FactAgent offers enhanced efficiency. Experimental studies demonstrate the effectiveness of FactAgent in verifying claims without the need for any training process. Moreover, FactAgent provides transparent explanations at each step of the workflow and during final decision-making, offering insights into the reasoning process of fake news detection for end users. FactAgent is highly adaptable, allowing for straightforward updates to its tools that LLMs can leverage within the workflow, as well as updates to the workflow itself using domain knowledge. This adaptability enables FactAgent's application to news verification across various domains.
在当前的数字时代,网络平台上迅速传播的错误信息对社会的福祉、公众信任和民主过程造成了显著的挑战,影响了 critical decision making和公众舆论。为解决这些挑战,需要越来越多的自动化假新闻检测机制。预训练的大语言模型(LLMs)在各种自然语言处理(NLP)任务中表现出色,引发了探索其验证新闻主张的潜在用途。我们的工作引入了FactAgent,一种使用LLMs进行假新闻检测的代理方法。FactAgent使得LLMs在没有模型训练的情况下模仿人类专家行为,通过结构化的工作流程完成简单的任务。工作流程将核实新闻真实性的复杂任务分解为多个子步骤,LLMs使用其内部知识或外部工具完成简单的任务。在工作流程的最后一阶段,LLMs将所有发现整合起来以确定新闻主张的准确性。与手动人类验证相比,FactAgent提供了增强的效率。实验研究表明,FactAgent在没有培训过程的情况下验证主张的有效性。此外,FactAgent为每个工作步骤和工作最终决策提供了透明的解释,为用户提供了 fake news 检测的推理过程。FactAgent的适应性很强,允许LLMs在其工作流程中利用其工具,以及使用领域知识对工作流程本身进行更新。这种适应性使FactAgent在各种领域的新闻验证中得到广泛应用。
https://arxiv.org/abs/2405.01593
Document semantic segmentation is a promising avenue that can facilitate document analysis tasks, including optical character recognition (OCR), form classification, and document editing. Although several synthetic datasets have been developed to distinguish handwriting from printed text, they fall short in class variety and document diversity. We demonstrate the limitations of training on existing datasets when solving the National Archives Form Semantic Segmentation dataset (NAFSS), a dataset which we introduce. To address these limitations, we propose the most comprehensive document semantic segmentation synthesis pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources to create the Document Element Layer INtegration Ensemble 8K, or DELINE8K dataset. Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research. The DELINE8K dataset is available at this https URL.
文档语义分割是一个有前途的方法,可以促进文档分析任务,包括光学字符识别(OCR)、表单分类和文档编辑。尽管已经开发了几个合成数据集来区分手写文本和打印文本,但它们在类别的多样性和文档多样性方面都存在不足。我们证明了在解决国家档案馆形式语义分割数据集(NAFSS)时训练现有数据集的局限性。为了应对这些局限性,我们提出了目前最全面的文档语义分割合成管道,结合了来自10个以上来源的预打印文本、手写文本和文档背景,创建了文档元素层集成集8K或DELINE8K数据集。我们的定制数据集在NAFSS基准测试中表现出卓越的性能,证明其在进一步研究中具有前景的工具。DELINE8K数据集可在以下链接处获得。
https://arxiv.org/abs/2404.19259
Graphics Processing Units (GPUs) have become the leading hardware accelerator for deep learning applications and are used widely in training and inference of transformers; transformers have achieved state-of-the-art performance in many areas of machine learning and are especially used in most modern Large Language Models (LLMs). However, GPUs require large amounts of energy, which poses environmental concerns, demands high operational costs, and causes GPUs to be unsuitable for edge computing. We develop an accelerator for transformers, namely, Llama 2, an open-source state-of-the-art LLM, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). HLS allows us to rapidly prototype FPGA designs without writing code at the register-transfer level (RTL). We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x reduction in energy used per token on the Xilinx Virtex UltraScale+ VU9P FPGA compared to an Intel Xeon Broadwell E5-2686 v4 CPU and NVIDIA RTX 3090 GPU respectively, while increasing inference speeds by up to 2.46x compared to CPU and maintaining 0.53x the speed of an RTX 3090 GPU despite the GPU's 4 times higher base clock rate. With the lack of existing open-source FPGA accelerators for transformers, we open-source our code and document our steps for synthesis. We hope this work will serve as a step in democratizing the use of FPGAs in transformer inference and inspire research into energy-efficient inference methods as a whole. The code can be found on this https URL.
图形处理器(GPUs)已成为深度学习应用的首选硬件加速器,并在训练和推理Transformer方面得到了广泛应用;Transformer在许多机器学习领域取得了最先进的性能,尤其是在现代大型语言模型(LLMs)中应用更加广泛。然而,GPU需要大量能源,这导致了对环境产生负面影响,增加了运营成本,使得GPU不适合用于边缘计算。我们开发了一个用于Transformer的加速器,即Llama 2,使用高级可编程门阵列(FPGAs)上的高级水平合成(HLS)。HLS使我们能够快速原型设计FPGA,而无需在寄存器传输级别(RTL)编写代码。我们将我们的方法命名为HLSTransform,我们用HLS合成的FPGA设计可以达到每令牌12.75倍能量消耗和每GPU 8.25倍能量消耗的降低,同时将推理速度提高至2.46倍,相较于CPU,保持0.53倍Transformer的性能,尽管GPU的基频钟率是其4倍高。由于缺乏现有的开源FPGA加速器,我们开源了我们的代码,并记录了合成过程的步骤。我们希望这项工作能为民主化Transformer推理中FPGA的使用迈出一步,并鼓励研究团队关注能量高效的推理方法。代码可以从此链接找到:https://www.acceler.io/。
https://arxiv.org/abs/2405.00738
Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset. Our code is publicly available at \url{this https URL}.
文件是二维信息承载者,因此其解释需要采用多模态方法,将文本和视觉信息有效地结合。文档视觉问答(Document VQA)由于其多模态特性,吸引了来自文档理解和自然语言处理社区的广泛关注。最先进的单页文档VQA方法表现出令人印象深刻的性能,然而在多页场景中,这些方法遇到困难。为了处理,它们将所有页面拼接成一个大页进行处理,需要大量的GPU资源,甚至用于评估。在这项工作中,我们提出了一个新颖的方法和高效的训练策略,用于处理多页文档VQA任务。 特别地,我们采用了一种视觉 only 的文档表示方法,利用了文档理解模型的编码器 Pix2Struct。我们的方法利用自注意力评分机制为每个文档页面生成相关分数,从而实现相关页面的检索。这一改进允许我们在评估过程中不限制页数,且对GPU资源的需求最小化。 我们的广泛实验不仅证明了在没有光学字符识别(OCR)的情况下实现最先进性能,还证明了在扩展到近800页的文档时具有稳健的性能,与MP-DocVQA数据集中的最大页数20页相比。我们的代码公开可用,在\url{这个链接}处。
https://arxiv.org/abs/2404.19024