Named Entity Recognition and Relation Extraction are two crucial and challenging subtasks in the field of Information Extraction. Despite the successes achieved by the traditional approaches, fundamental research questions remain open. First, most recent studies use parameter sharing for a single subtask or shared features for both two subtasks, ignoring their semantic differences. Second, information interaction mainly focuses on the two subtasks, leaving the fine-grained informtion interaction among the subtask-specific features of encoding subjects, relations, and objects unexplored. Motivated by the aforementioned limitations, we propose a novel model to jointly extract entities and relations. The main novelties are as follows: (1) We propose to decouple the feature encoding process into three parts, namely encoding subjects, encoding objects, and encoding relations. Thanks to this, we are able to use fine-grained subtask-specific features. (2) We propose novel inter-aggregation and intra-aggregation strategies to enhance the information interaction and construct individual fine-grained subtask-specific features, respectively. The experimental results demonstrate that our model outperforms several previous state-of-the-art models. Extensive additional experiments further confirm the effectiveness of our model.
命名实体识别和关系抽取是信息抽取领域中的两个关键且具有挑战性的子任务。尽管传统方法取得了成功,但基本研究问题仍然存在。首先,大多数最近的studies使用参数共享来解决单个子任务或同时共享两个子任务,而忽略了它们的语义差异。其次,信息交互主要关注这两个子任务,而没有对编码主题、关系和对象的细粒度信息交互进行研究。为了应对上述限制,我们提出了一个新颖的模型来共同提取实体和关系。主要的创新点如下: (1)我们将特征编码过程分为三个部分,即编码主题、编码对象和编码关系。得益于这一点,我们能够使用细粒度的子任务特定特征。 (2)我们提出了新的互操作和内聚策略,以增强信息交互并分别构建个体细粒度子任务特定特征。实验结果表明,我们的模型超越了几个最先进的模型。此外的深入实验进一步证实了我们的模型的有效性。
https://arxiv.org/abs/2405.08311
Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a cluster-aware compression method for improving event relation extraction (TacoERE), which explores a compression-then-extraction paradigm. Specifically, we first introduce document clustering for modeling event dependencies. It splits the document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Secondly, we utilize cluster summarization to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. We have conducted extensive experiments on both pre-trained language models, such as RoBERTa, and large language models, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Experimental results demonstrate that TacoERE is an effective method for ERE.
事件关系提取(ERE)是自然语言处理的一个关键和基本挑战。现有的工作主要集中在直接建模整个文档,这无法有效地处理长距离依赖和信息冗余。为了应对这些问题,我们提出了一个聚类感知压缩方法来提高事件关系提取(TacoREE),这是一种压缩然后再提取范式。具体来说,我们首先引入了文档聚类来建模事件依赖关系。它将文档划分为内聚和外聚的簇,内聚簇旨在在同一簇内增强关系,而外聚簇试图在任意距离上建模相关事件。其次,我们利用聚类摘要来简化并突出聚类的重要文本内容,以减轻信息冗余和事件距离。我们在三个数据集上(即MAVEN-ERE,EventStoryLine和HiEve)对预训练语言模型(如RoBERTa和ChatGPT)进行了广泛的实验。实验结果表明,TacoREE是一种有效的用于事件关系提取的有效方法。
https://arxiv.org/abs/2405.06890
This paper introduces Stochastic RAG--a novel approach for end-to-end optimization of retrieval-augmented generation (RAG) models that relaxes the simplifying assumptions of marginalization and document independence, made in most prior work. Stochastic RAG casts the retrieval process in RAG as a stochastic sampling without replacement process. Through this formulation, we employ straight-through Gumbel-top-k that provides a differentiable approximation for sampling without replacement and enables effective end-to-end optimization for RAG. We conduct extensive experiments on seven diverse datasets on a wide range of tasks, from open-domain question answering to fact verification to slot-filling for relation extraction and to dialogue systems. By applying this optimization method to a recent and effective RAG model, we advance state-of-the-art results on six out of seven datasets.
本文介绍了一种名为Stochastic RAG的新方法,用于端到端优化检索增强生成(RAG)模型,该方法放宽了大多数先前的工作中的简化假设,即边际化和文档独立性假设。Stochastic RAG将检索过程在RAG中建模为无替换的随机采样过程。通过这种表示方法,我们使用 straight-through Gumbel-top-k,它为无替换采样提供了一个不同的导数近似,并有效地对RAG进行了端到端的优化。我们在包括开放域问题回答、事实验证、关系提取和对话系统等七种多样化的数据集上进行了广泛的实验。通过将这种优化方法应用于最近且有效的RAG模型,我们在六个数据集上超过了现有的最佳结果。
https://arxiv.org/abs/2405.02816
Methods for relation extraction from text mostly focus on high precision, at the cost of limited recall. High recall is crucial, though, to populate long lists of object entities that stand in a specific relation with a given subject. Cues for relevant objects can be spread across many passages in long texts. This poses the challenge of extracting long lists from long texts. We present the L3X method which tackles the problem in two stages: (1) recall-oriented generation using a large language model (LLM) with judicious techniques for retrieval augmentation, and (2) precision-oriented scrutinization to validate or prune candidates. Our L3X method outperforms LLM-only generations by a substantial margin.
文本关系提取的方法主要关注高精度,但代价是欠准确。高准确度对填充具有特定关系的对象长列表至关重要。相关对象的提示可以分散在长文本的多个段落中。这提出了从长文本中提取长列表的挑战。我们提出了L3X方法,该方法分为两个阶段来解决这个问题:(1)使用大型语言模型(LLM)进行召回导向的生成,并采用适当的检索增强技术;(2)对候选者进行精确度导向的检查,以验证或修剪。我们的L3X方法在LLM-only生成方面取得了很大的优势。
https://arxiv.org/abs/2405.02732
The Financial Relation Extraction (FinRE) task involves identifying the entities and their relation, given a piece of financial statement/text. To solve this FinRE problem, we propose a simple but effective strategy that improves the performance of pre-trained language models by augmenting them with Named Entity Recognition (NER) and Part-Of-Speech (POS), as well as different approaches to combine these information. Experiments on a financial relations dataset show promising results and highlights the benefits of incorporating NER and POS in existing models. Our dataset and codes are available at this https URL.
财务关系提取(FinRE)任务涉及识别给定财务报表/文本中的实体及其关系。为解决这个FinRE问题,我们提出了一个简单而有效的策略,通过为预训练语言模型添加命名实体识别(NER)和词性标注(POS),以及不同方法来结合这些信息来提高其性能。在一个金融关系数据集上的实验表明了这些策略的有前景的结果,并突出了在现有模型中引入NER和POS的价值。我们的数据集和代码可在此处访问:https://url。
https://arxiv.org/abs/2405.06665
Large Language Models (LLMs) have swiftly emerged as vital resources for different applications in the biomedical and healthcare domains; however, these models encounter issues such as generating inaccurate information or hallucinations. Retrieval-augmented generation provided a solution for these models to update knowledge and enhance their performance. In contrast to previous retrieval-augmented LMs, which utilize specialized cross-attention mechanisms to help LLM encode retrieved text, BiomedRAG adopts a simpler approach by directly inputting the retrieved chunk-based documents into the LLM. This straightforward design is easily applicable to existing retrieval and language models, effectively bypassing noise information in retrieved documents, particularly in noise-intensive tasks. Moreover, we demonstrate the potential for utilizing the LLM to supervise the retrieval model in the biomedical domain, enabling it to retrieve the document that assists the LM in improving its predictions. Our experiments reveal that with the tuned scorer,\textsc{ BiomedRAG} attains superior performance across 5 biomedical NLP tasks, encompassing information extraction (triple extraction, relation extraction), text classification, link prediction, and question-answering, leveraging over 9 datasets. For instance, in the triple extraction task, \textsc{BiomedRAG} outperforms other triple extraction systems with micro-F1 scores of 81.42 and 88.83 on GIT and ChemProt corpora, respectively.
大语言模型(LLMs)迅速成为了生物医学和医疗领域各种应用程序的重要资源;然而,这些模型会面临生成不准确信息或虚构的问题。检索增强是一种解决这些模型的方法,以更新知识并提高其性能。与之前的检索增强语言模型不同,它们利用专门的跨注意机制帮助LLM编码检索到的文本。相比之下,生物医学RAG采用了一种更简单的策略,即直接将基于检索到的片段的文档输入LLM。这种直观的设计很容易应用于现有的检索和语言模型,有效绕过了检索文本文档中的噪声信息,尤其是在噪声密集任务中。此外,我们还证明了在生物医学领域使用LLM监督检索模型的潜力,使模型能够检索到帮助模型提高预测的文档。我们的实验结果表明,经过调整的评分器\textsc{BiomedRAG}在5个生物医学自然语言处理任务中取得了卓越的性能,包括信息提取(三元提取,关系提取),文本分类,链接预测和问题回答,利用了超过9个数据集。例如,在三元提取任务中,\textsc{BiomedRAG}在GIT和ChemProt数据集上的微F1得分分别为81.42和88.83。
https://arxiv.org/abs/2405.00465
This paper presents a comprehensive exploration of relation extraction utilizing advanced language models, specifically Chain of Thought (CoT) and Graphical Reasoning (GRE) techniques. We demonstrate how leveraging in-context learning with GPT-3.5 can significantly enhance the extraction process, particularly through detailed example-based reasoning. Additionally, we introduce a novel graphical reasoning approach that dissects relation extraction into sequential sub-tasks, improving precision and adaptability in processing complex relational data. Our experiments, conducted on multiple datasets, including manually annotated data, show considerable improvements in performance metrics, underscoring the effectiveness of our methodologies.
本文对利用先进语言模型(如Chain of Thought和Graphical Reasoning)进行关系提取进行了全面的探讨。我们证明了利用GPT-3.5中的上下文学习可以显著增强提取过程,特别是通过详细的基于例子的推理。此外,我们还介绍了一种新的图形推理方法,将关系提取分为序列子任务,从而提高处理复杂关系数据的精度和适应性。我们的实验在多个数据集上进行,包括手动标注的数据,结果表明我们的方法论在性能指标上取得了显著的改进,验证了我们的方法的有效性。
https://arxiv.org/abs/2405.00216
This paper introduces uRAG--a framework with a unified retrieval engine that serves multiple downstream retrieval-augmented generation (RAG) systems. Each RAG system consumes the retrieval results for a unique purpose, such as open-domain question answering, fact verification, entity linking, and relation extraction. We introduce a generic training guideline that standardizes the communication between the search engine and the downstream RAG systems that engage in optimizing the retrieval model. This lays the groundwork for us to build a large-scale experimentation ecosystem consisting of 18 RAG systems that engage in training and 18 unknown RAG systems that use the uRAG as the new users of the search engine. Using this experimentation ecosystem, we answer a number of fundamental research questions that improve our understanding of promises and challenges in developing search engines for machines.
本文介绍了uRAG框架,这是一个统一检索引擎,为多个下游检索增强生成(RAG)系统提供服务。每个RAG系统消耗用于独特目的的检索结果,例如开放领域问题回答、事实验证、实体链接和关系提取。我们引入了一种通用的训练指南,标准化了搜索引擎与参与优化检索模型的下游RAG系统之间的通信。这为我们构建了一个由18个参与训练的大规模实验生态系统和18个使用uRAG作为新搜索 engine 用户的大规模实验生态系统奠定了基础。利用这个实验生态系统,我们回答了一系列基本研究问题,推动了我们对为机器开发搜索引擎的承诺和挑战的深入了解。
https://arxiv.org/abs/2405.00175
Domain-Specific Chinese Relation Extraction (DSCRE) aims to extract relations between entities from domain-specific Chinese text. Despite the rapid development of PLMs in recent years, especially LLMs, DSCRE still faces three core challenges: complex network structure design, poor awareness, and high consumption of fine-tuning. Given the impressive performance of large language models (LLMs) in natural language processing, we propose a new framework called CRE-LLM. This framework is based on fine-tuning open-source LLMs, such as Llama-2, ChatGLM2, and Baichuan2. CRE-LLM enhances the logic-awareness and generative capabilities of the model by constructing an appropriate prompt and utilizing open-source LLMs for instruction-supervised fine-tuning. And then it directly extracts the relations of the given entities in the input textual data, which improving the CRE approach. To demonstrate the effectiveness of the proposed framework, we conducted extensive experiments on two domain-specific CRE datasets, FinRE and SanWen. The experimental results show that CRE-LLM is significantly superior and robust, achieving state-of-the-art (SOTA) performance on the FinRE dataset. This paper introduces a novel approach to domain-specific relation extraction (DSCRE) tasks that are semantically more complex by combining LLMs with triples. Our code is publicly available.
领域特定中文关系提取(DSCRE)旨在从领域特定的中文文本中提取实体之间的关系。尽管近年来PLM(特别是LLM)发展迅速,但DSCRE仍然面临着三个核心挑战:复杂网络结构设计、缺乏意识和高级微调的消耗。鉴于大型语言模型(LLMs)在自然语言处理方面的惊人表现,我们提出了一个名为CRE-LLM的新框架。这个框架基于微调开源LLM,如Llama-2、ChatGLM2和Baichuan2。CRE-LLM通过构建适当的提示并利用开源LLM进行指令指导微调,增强了模型的逻辑意识和生成能力。然后,它直接从输入文本数据中提取给定实体的关系,从而提高了CRE方法。为了验证所提出的框架的有效性,我们在两个领域特定的CRE数据集——FinRE和SanWen上进行了广泛的实验。实验结果表明,CRE-LLM具有显著优越性和稳健性,在FinRE数据集上实现了最先进的(SOTA)性能。本文提出了一种将LLM与三元组相结合的新方法来解决领域特定关系提取(DSCRE)问题,使其在语义上更加复杂。我们的代码是公开可用的。
https://arxiv.org/abs/2404.18085
Relation extraction (RE) aims to identify relations between entities mentioned in texts. Although large language models (LLMs) have demonstrated impressive in-context learning (ICL) abilities in various tasks, they still suffer from poor performances compared to most supervised fine-tuned RE methods. Utilizing ICL for RE with LLMs encounters two challenges: (1) retrieving good demonstrations from training examples, and (2) enabling LLMs exhibit strong ICL abilities in RE. On the one hand, retrieving good demonstrations is a non-trivial process in RE, which easily results in low relevance regarding entities and relations. On the other hand, ICL with an LLM achieves poor performance in RE while RE is different from language modeling in nature or the LLM is not large enough. In this work, we propose a novel recall-retrieve-reason RE framework that synergizes LLMs with retrieval corpora (training examples) to enable relevant retrieving and reliable in-context reasoning. Specifically, we distill the consistently ontological knowledge from training datasets to let LLMs generate relevant entity pairs grounded by retrieval corpora as valid queries. These entity pairs are then used to retrieve relevant training examples from the retrieval corpora as demonstrations for LLMs to conduct better ICL via instruction tuning. Extensive experiments on different LLMs and RE datasets demonstrate that our method generates relevant and valid entity pairs and boosts ICL abilities of LLMs, achieving competitive or new state-of-the-art performance on sentence-level RE compared to previous supervised fine-tuning methods and ICL-based methods.
关系抽取(RE)的目的是识别文本中提到的实体之间的关系。尽管大型语言模型(LLMs)在各种任务中的上下文学习(ICL)能力表现出色,但与大多数监督微调的RE方法相比,它们仍然存在表现不佳的问题。使用LLMs进行RE遇到了两个挑战:(1)从训练例子中检索好的演示,以及(2)让LLMs在RE中表现出强大的ICL能力。一方面,从RE中检索好的演示是不易的过程,很容易导致实体和关系的相关性较低。另一方面,LLM与RE的ICL在RE中表现不佳,而RE与自然语言建模或LLM不够大有关。在本文中,我们提出了一个新颖的回忆-检索-推理RE框架,将LLMs与检索数据集(训练例子)相结合,以实现相关的检索和可靠的上下文推理。具体来说,我们从训练数据中提炼出一致的语义知识,让LLMs根据检索数据集生成相关实体对作为有效查询。这些实体对然后用于从检索数据集中检索相关的训练例子作为LLMs通过指令调整进行更好ICL的演示。在不同的LLM和RE数据集上进行的大量实验证明,我们的方法能够生成相关且有效的实体对,提高LLMs的ICL能力,在句子级别RE上实现 competitive或与之前监督微调方法或基于ICL的方法相当的性能。
https://arxiv.org/abs/2404.17809
Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and select good examples for improving ICL-based RE. Although both factors are vital for ICL, if one can fundamentally boost the ICL capability of LLMs in RE, the zero and few-shot RE performance via ICL would be significantly improved. To this end, we introduce \textsc{Micre} (\textbf{M}eta \textbf{I}n-\textbf{C}ontext learning of LLMs for \textbf{R}elation \textbf{E}xtraction), a new meta-training framework for zero and few-shot RE where an LLM is tuned to do ICL on a diverse collection of RE datasets (i.e., learning to learn in context for RE). Through meta-training, the model becomes more effectively to learn a new RE task in context by conditioning on a few training examples with no parameter updates or task-specific templates at inference time, enabling better zero and few-shot task generalization. We experiment \textsc{Micre} on various LLMs with different model scales and 12 public RE datasets, and then evaluate it on unseen RE benchmarks under zero and few-shot settings. \textsc{Micre} delivers comparable or superior performance compared to a range of baselines including supervised fine-tuning and typical in-context learning methods. We find that the gains are particular significant for larger model scales, and using a diverse set of the meta-training RE datasets is key to improvements. Empirically, we show that \textsc{Micre} can transfer the relation semantic knowledge via relation label name during inference on target RE datasets.
关系提取(RE)是识别文本中实体之间关系的重要任务。虽然大型语言模型(LLMs)已经在一般零和少样本学习方面展示了令人瞩目的表现,但最近的研究表明,当前的LLMs在零和少样本RE方面仍然存在困难。以前的研究主要是致力于设计提示格式和选择好的示例来提高基于ICL的RE。尽管这两点对ICL至关重要,但只要一个能根本性地提高LLM在RE中的ICL能力,零和少样本RE通过ICL的性能就会显著提高。因此,我们引入了\textsc{Micre}(LLMs为关系提取的元训练框架,\textbf{M}eta \textbf{I}n-\textbf{C}ontext learning of LLMs for \textbf{R}elation \textbf{E}xtraction),一种新的元训练框架,用于零和少样本RE,该框架使LLM在多样的RE数据集上进行关系提取(RE)时能够进行基于上下文的ICL学习(即在RE中学习上下文)。通过元训练,模型在推理时通过几组训练示例进行ICL学习,无需进行参数更新或任务特定模板,从而实现更好的零和少样本任务泛化。我们在各种LLM模型规模和12个公共RE数据集上进行实验,然后将\textsc{Micre}应用于各种LLM,并在零和少样本设置下评估其性能。与一系列基线相比,\textsc{Micre}在包括监督微调在内的各种方法中具有相似或卓越的性能。我们发现,对于较大的模型规模,收益尤为明显,而使用具有多样性的元训练RE数据集是提高改进的关键。通过实证研究,我们发现\textsc{Micre}可以通过关系标签名称在目标RE数据集上进行关系语义知识转移。
https://arxiv.org/abs/2404.17807
Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relations based on partial dialogues, which motivates us to discover more effective methods that can alleviate the above issues. We notice that the rise of large language models (LLMs) has sparked considerable interest in evaluating their performance across diverse tasks. To this end, we initially investigate the capabilities of different LLMs in DRE, considering both proprietary models and open-source models. Interestingly, we discover that LLMs significantly alleviate two issues in existing DRE methods. Generally, we have following findings: (1) scaling up model size substantially boosts the overall DRE performance and achieves exceptional results, tackling the difficulty of capturing long and sparse multi-turn information; (2) LLMs encounter with much smaller performance drop from entire dialogue setting to partial dialogue setting compared to existing methods; (3) LLMs deliver competitive or superior performances under both full-shot and few-shot settings compared to current state-of-the-art; (4) LLMs show modest performances on inverse relations but much stronger improvements on general relations, and they can handle dialogues of various lengths especially for longer sequences.
对话关系提取(DRE)旨在提取对话中两个论据之间的关系,这比标准RE更具挑战性,因为对话中的人称代词频率较高,信息密度较低。然而,现有的DRE方法仍然存在两个严重的问题:(1)很难捕捉长而稀疏的多轮信息;(2)很难根据部分对话提取金色关系,这促使我们发现更有效的解决方案来解决上述问题。我们注意到大型语言模型(LLMs)的崛起引发了对它们在各种任务上的表现进行评估的浓厚兴趣。为此,我们首先研究了不同LLM在DRE方面的能力,考虑了专有模型和开源模型。有趣的是,我们发现LLMs显著减轻了现有DRE方法中的两个问题。一般来说,我们的发现如下:(1)扩大模型规模极大地提高了整体DRE性能,实现了卓越的成果,解决了捕捉长而稀疏多轮信息的问题;(2)LLMs在完整对话设置和部分对话设置之间的性能下降要小得多,与现有方法相比;(3)LLM在完整 shot和少量 shot设置下都实现了竞争或优越的性能,与现有水平相当;(4)LLM在反向关系上的表现较为温和,但在一般关系上表现出明显改进,特别是对于较长的对话序列。
https://arxiv.org/abs/2404.17802
Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.
文档级别关系提取(DocRE)是从文档中提取所有语义关系的过程。尽管已经进行了关于英语DocRE的研究,但在非英语语言中,对DocRE的研究却鲜有关注。本文深入研究如何有效地利用现有英语资源来促进非英语语言中的DocRE研究,以日本为例作为代表。作为初始尝试,我们将英语数据集迁移到日本并构建了一个数据集。然而,训练在这样的数据集上的模型,模型的召回率很低。我们研究了错误案例,并将失败归因于从英语到非英语翻译的文档的不同表面结构和语义。因此,我们转向研究是否转移的数据集可以帮助人类对日语文档进行标注。在我们的建议中,注释者编辑从转移数据集中得出的关系预测。定量分析显示,与以前的方法相比,模型建议的关系减少约50%的人为编辑步骤。实验验证了现有DocRE模型的在我们收集的数据集上的性能,揭示了日语和跨语言DocRE的挑战。
https://arxiv.org/abs/2404.16506
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at this https URL.
专家策展对于从FAIR开放知识库中捕获酶功能知识至关重要,但无法跟上新发现和新出版物的发展速度。在这项工作中,我们提出了EnzChemRED,Enzyme Chemistry Relation Extraction Dataset的训练和基准数据集,以支持开发自然语言处理(NLP)方法,如(大型)语言模型,以协助酶策展。EnzChemRED由1,210个专家编写的PubMed摘要组成,其中酶及其催化的化学反应使用来自UniProt知识库(UniProtKB)和化学生物实体(ChEBI)的标识符进行注释。我们证明了使用EnzChemRED对预训练语言模型进行微调可以显著提高其在文本(命名实体识别,NER)中识别蛋白质和化学物质的提及能力以及提取它们参与的化学转换(关系提取,RE)能力,平均F1分数为86.30% for NER,86.66% for RE for chemical conversion pairs,83.79% for RE for chemical conversion pairs and linked enzymes。我们使用EnzChemRED中表现最好的方法对文本进行微调,创建了从文本到摘要的端到端管道,并将此应用于PubMed大小的摘要以创建酶功能文献的初步映射,以指导在UniProtKB和反应知识库Rhea中的策展工作。EnzChemRED语料库可在此链接处免费获取:https://www.ncbi.nlm.nih.gov/25962541
https://arxiv.org/abs/2404.14209
Information resources such as newspapers have produced unstructured text data in various languages related to the corona outbreak since December 2019. Analyzing these unstructured texts is time-consuming without representing them in a structured format; therefore, representing them in a structured format is crucial. An information extraction pipeline with essential tasks -- named entity tagging and relation extraction -- to accomplish this goal might be applied to these texts. This study proposes a data annotation pipeline to generate training data from corona news articles, including generic and domain-specific entities. Named entity recognition models are trained on this annotated corpus and then evaluated on test sentences manually annotated by domain experts evaluating the performance of a trained model. The code base and demonstration are available at this https URL.
信息资源(如报纸)自2019年12月以来产生了与冠状病毒疫情相关的各种语言无结构文本数据。如果没有以结构化格式表示这些无结构文本,分析这些文本将耗时;因此,以结构化格式表示这些文本至关重要。一个实现这一目标的信息提取管道包括关键任务——命名实体标记和关系抽取——用于完成此任务。本研究提出了一个数据注释管道,用于从冠状病毒新闻文章中生成训练数据,包括通用和领域特定的实体。经过训练的命名实体识别模型被评估为专家对训练模型的表现进行手动标注的测试句子。代码库和演示文稿可在此https URL找到。
https://arxiv.org/abs/2404.13439
Representing unstructured data in a structured form is most significant for information system management to analyze and interpret it. To do this, the unstructured data might be converted into Knowledge Graphs, by leveraging an information extraction pipeline whose main tasks are named entity recognition and relation extraction. This thesis aims to develop a novel continual relation extraction method to identify relations (interconnections) between entities in a data stream coming from the real world. Domain-specific data of this thesis is corona news from German and Austrian newspapers.
将非结构化数据以结构化形式表示对于信息系统管理来说最为重要,以便对其进行分析和解释。为此,非结构化数据可能通过利用信息提取管道将其转换为知识图谱,该管道的首要任务是命名实体识别和关系提取。本论文旨在开发一种新颖的连续关系提取方法,用于从现实世界的数据流中识别实体之间的关系(联系)。本论文的研究领域特定数据是德国和奥地利报纸上的冠状病毒新闻。
https://arxiv.org/abs/2404.17593
Information Extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. The identification of the relation between a pair of entities plays a crucial role within this framework. Despite the existence of various techniques for relation extraction, their efficacy heavily relies on access to labeled data and substantial computational resources. In addressing these challenges, Large Language Models (LLMs) emerge as promising solutions; however, they might return hallucinating responses due to their own training data. To overcome these limitations, Retrieved-Augmented Generation-based Relation Extraction (RAG4RE) in this work is proposed, offering a pathway to enhance the performance of relation extraction tasks. This work evaluated the effectiveness of our RAG4RE approach utilizing different LLMs. Through the utilization of established benchmarks, such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to comprehensively evaluate the efficacy of our RAG4RE approach. In particularly, we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our investigation. The results of our study demonstrate that our RAG4RE approach surpasses performance of traditional RE approaches based solely on LLMs, particularly evident in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing.
信息抽取(IE)是一个将非结构化文本数据转换为结构化格式的变革性过程,通过采用实体和关系提取(RE)方法来完成。在信息抽取框架内,实体间关系的识别具有关键作用。尽管存在各种关系提取技术,但它们的有效性很大程度上依赖于访问标注数据和大量的计算资源。为了应对这些挑战,本文提出了一种基于已检索增强生成关系提取(RAG4RE)的方法,为提高关系抽取任务的性能提供了一条途径。 本文通过利用已有的基准数据集,如TACRED、TACREV、Re-TACRED和SemEval RE,全面评估了所提出RAG4RE方法的有效性。特别地,我们在研究中利用了重要的LLM,包括Flan T5、Llama2和Mistral。我们研究的结果表明,基于LLM的传统关系抽取方法在RAG4RE方法上显著超越了 performance,特别是在TACRED数据集及其变化中。此外,与以前的关系提取方法相比,我们在TACRED和TACREV数据集上的表现都有显著优势,这表明了RAG4RE方法在自然语言处理中的潜在推动作用。
https://arxiv.org/abs/2404.13397
Extracting structured information from unstructured text is critical for many downstream NLP applications and is traditionally achieved by closed information extraction (cIE). However, existing approaches for cIE suffer from two limitations: (i) they are often pipelines which makes them prone to error propagation, and/or (ii) they are restricted to sentence level which prevents them from capturing long-range dependencies and results in expensive inference time. We address these limitations by proposing REXEL, a highly efficient and accurate model for the joint task of document level cIE (DocIE). REXEL performs mention detection, entity typing, entity disambiguation, coreference resolution and document-level relation classification in a single forward pass to yield facts fully linked to a reference knowledge graph. It is on average 11 times faster than competitive existing approaches in a similar setting and performs competitively both when optimised for any of the individual subtasks and a variety of combinations of different joint tasks, surpassing the baselines by an average of more than 6 F1 points. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale. We also release an extension of the DocRED dataset to enable benchmarking of future work on DocIE, which is available at this https URL.
从无结构文本中提取结构化信息对于许多下游自然语言处理(NLP)应用至关重要,而且通常通过关闭信息提取(cIE)来实现。然而,现有的cIE方法存在两个局限:(i)它们通常是流水线,容易传播错误,(ii)它们仅限于句子级别,无法捕捉长距离依赖关系,导致推理时间昂贵。为了克服这些局限,我们提出了REXEL,一种高效且准确的文档级别cIE(DocIE)模型。REXEL在单向传递过程中实现提举检测、实体类型、实体歧义、关系分类和文档级别关系,以产生完全链接到参考知识图谱的事实。在类似设置中,REXEL的平均速度是现有方法的11倍,而且在优化任何单个子任务或各种组合任务时,表现出色,超过了基线平均6个F1分。速度和准确性的结合使REXEL成为在网页规模上提取结构化信息的准确且高效系统。我们还发布了DocRED数据集的扩展,以便于未来在DocIE上进行基准测试,该扩展可通过此链接获得。
https://arxiv.org/abs/2404.12788
Joint entity and relation extraction plays a pivotal role in various applications, notably in the construction of knowledge graphs. Despite recent progress, existing approaches often fall short in two key aspects: richness of representation and coherence in output structure. These models often rely on handcrafted heuristics for computing entity and relation representations, potentially leading to loss of crucial information. Furthermore, they disregard task and/or dataset-specific constraints, resulting in output structures that lack coherence. In our work, we introduce EnriCo, which mitigates these shortcomings. Firstly, to foster rich and expressive representation, our model leverage attention mechanisms that allow both entities and relations to dynamically determine the pertinent information required for accurate extraction. Secondly, we introduce a series of decoding algorithms designed to infer the highest scoring solutions while adhering to task and dataset-specific constraints, thus promoting structured and coherent outputs. Our model demonstrates competitive performance compared to baselines when evaluated on Joint IE datasets.
联合实体和关系提取在各种应用中发挥着重要作用,特别是在知识图谱的构建中。尽管最近取得了进展,但现有的方法在两个关键方面往往存在不足:表示的丰富性和输出结构的连贯性。这些方法通常依赖于手工构建的启发式规则计算实体和关系表示,可能导致关键信息的丢失。此外,它们忽视了任务和/或数据集特定的约束,导致输出结构缺乏连贯性。在我们的工作中,我们引入了EnriCo模型,从而缓解了这些不足。首先,为了促进丰富和表现性的表示,我们的模型利用了注意机制,允许实体和关系动态确定所需的相关信息。其次,我们引入了一系列解码算法,旨在在遵守任务和数据集特定约束的情况下推断最高得分解决方案,从而促进结构和连贯的输出。与基线相比,我们的模型在Joint IE数据集上的评估表现具有竞争力。
https://arxiv.org/abs/2404.12493
Information extraction (IE) is an important task in Natural Language Processing (NLP), involving the extraction of named entities and their relationships from unstructured text. In this paper, we propose a novel approach to this task by formulating it as graph structure learning (GSL). By formulating IE as GSL, we enhance the model's ability to dynamically refine and optimize the graph structure during the extraction process. This formulation allows for better interaction and structure-informed decisions for entity and relation prediction, in contrast to previous models that have separate or untied predictions for these tasks. When compared against state-of-the-art baselines on joint entity and relation extraction benchmarks, our model, GraphER, achieves competitive results.
信息抽取(IE)是自然语言处理(NLP)中一个重要任务,涉及从无结构文本中提取命名实体及其关系。在本文中,我们提出了一种将IE表示为图结构学习(GSL)的新方法。通过将IE表示为GSL,我们增强了模型在提取过程期间动态精炼和优化图结构的能力。这种表示使得实体和关系预测具有更好的交互性和结构指导决策,与之前这些任务具有单独或松散预测的模型相比。与联合实体和关系提取基准上的最先进基线相比,我们的模型GraphER在竞争中取得了竞争力的结果。
https://arxiv.org/abs/2404.12491