Ontology embeddings map classes, relations, and individuals in ontologies into $\mathbb{R}^n$, and within $\mathbb{R}^n$ similarity between entities can be computed or new axioms inferred. For ontologies in the Description Logic $\mathcal{EL}^{++}$, several embedding methods have been developed that explicitly generate models of an ontology. However, these methods suffer from some limitations; they do not distinguish between statements that are unprovable and provably false, and therefore they may use entailed statements as negatives. Furthermore, they do not utilize the deductive closure of an ontology to identify statements that are inferred but not asserted. We evaluated a set of embedding methods for $\mathcal{EL}^{++}$ ontologies based on high-dimensional ball representation of concept descriptions, incorporating several modifications that aim to make use of the ontology deductive closure. In particular, we designed novel negative losses that account both for the deductive closure and different types of negatives. We demonstrate that our embedding methods improve over the baseline ontology embedding in the task of knowledge base or ontology completion.
知识图谱嵌入将知识图谱中的类、关系和个人映射到$\mathbb{R}^n$,并在$\mathbb{R}^n$中计算或推断新的公理。对于描述逻辑$\mathcal{EL}^{++}$中的知识图谱,已经开发了许多 embedding 方法,但它们存在一些局限性;它们没有区分不可证明的和可证明的假言语句,因此它们可能使用蕴含语句作为否定。此外,它们没有利用知识图谱的演绎封闭性来确定尚未推断但已命名的语句。我们对$\mathcal{EL}^{++}$中的嵌入方法进行了评估,基于高维球表示概念描述,包括几个旨在利用知识图谱演绎封闭性的修改。特别是,我们设计了一些新的负损失,既考虑了演绎封闭性,又考虑了不同类型的否定。我们证明了我们的嵌入方法在知识库或知识图谱完成任务方面优于基线嵌入。
https://arxiv.org/abs/2405.04868
All fields of knowledge are being impacted by Artificial Intelligence. In particular, the Deep Learning paradigm enables the development of data analysis tools that support subject matter experts in a variety of sectors, from physics up to the recognition of ancient languages. Palaeontology is now observing this trend as well. This study explores the capability of Convolutional Neural Networks (CNNs), a particular class of Deep Learning algorithms specifically crafted for computer vision tasks, to classify images of isolated fossil shark teeth gathered from online datasets as well as from the authors$'$ experience on Peruvian Miocene and Italian Pliocene fossil assemblages. The shark taxa that are included in the final, composite dataset (which consists of more than one thousand images) are representative of both extinct and extant genera, namely, Carcharhinus, Carcharias, Carcharocles, Chlamydoselachus, Cosmopolitodus, Galeocerdo, Hemipristis, Notorynchus, Prionace and Squatina. We developed a CNN, named SharkNet-X, specifically tailored on our recognition task, reaching a 5-fold cross validated mean accuracy of 0.85 to identify images containing a single shark tooth. Furthermore, we elaborated a visualization of the features extracted from images using the last dense layer of the CNN, achieved through the application of the clustering technique t-SNE. In addition, in order to understand and explain the behaviour of the CNN while giving a paleontological point of view on the results, we introduced the explainability method SHAP. To the best of our knowledge, this is the first instance in which this method is applied to the field of palaeontology. The main goal of this work is to showcase how Deep Learning techniques can aid in identifying isolated fossil shark teeth, paving the way for developing new information tools for automating the recognition and classification of fossils.
知识领域的所有领域都受到人工智能的影响。特别是,深度学习范式使数据分析工具得以开发,支持各个领域的专家,从物理学到古语言的识别。古生物学领域现在也加入了这个趋势。这项研究探讨了卷积神经网络(CNNs)作为一种特别为计算机视觉任务而设计的深度学习算法的分类能力,将孤立的化石鲨牙齿图片从在线数据集中到作者在秘鲁米奥科新世和意大利普利奥新世化石群的经历中进行分类的能力。包括在最终综合 dataset(包含超过1000张图片)中的鲨鱼种类,都是灭绝和现存的物种,包括Carcharhinus、Carcharias、Carcharocles、Chlamydoselachus、Cosmopolitodus、Galeocerdo、Hemipristis、Notorynchus、Prionace和Squatina。我们开发了一个名为SharkNet-X的CNN,专门针对我们的识别任务,达到5倍交叉验证平均准确率0.85,识别包含单颗鲨牙齿的图片。此外,我们通过应用聚类技术t-SNE对提取图像特征进行可视化。为了了解和解释CNN在识别化石鲨牙齿时的行为,我们还引入了Shap解释方法。据我们所知,这是第一个将这种方法应用于古生物学领域的实例。本工作的主要目标是通过展示深度学习技术如何帮助识别孤立的化石鲨牙齿,为开发新的信息工具,用于自动化化石的识别和分类铺平道路。
https://arxiv.org/abs/2405.04189
Deep learning models are often unaware of the inherent constraints of the task they are applied to. However, many downstream tasks require logical consistency. For ontology classification tasks, such constraints include subsumption and disjointness relations between classes. In order to increase the consistency of deep learning models, we propose a semantic loss that combines label-based loss with terms penalising subsumption- or disjointness-violations. Our evaluation on the ChEBI ontology shows that the semantic loss is able to decrease the number of consistency violations by several orders of magnitude without decreasing the classification performance. In addition, we use the semantic loss for unsupervised learning. We show that this can further improve consistency on data from a distribution outside the scope of the supervised training.
深度学习模型通常无法意识到它们所应用任务的固有约束。然而,许多下游任务需要语义一致性。对于本体分类任务,这些约束包括类之间的子集和分离关系。为了提高深度学习模型的语义一致性,我们提出了一个结合标签为基础的损失与惩罚子集或分离违反的项的语义损失。我们对ChEBI本体的评估显示,语义损失能够在不降低分类性能的情况下减少几个数量级的语义违规数量。此外,我们还使用语义损失进行无监督学习。我们证明了,这可以进一步改善来自超出监督训练范围的数据的语义一致性。
https://arxiv.org/abs/2405.02083
The growing reliance on digital twins across various industries and domains brings with it semantic interoperability challenges. Ontologies are a well-known strategy for addressing such challenges, though given the complexity of the phenomenon, there are risks of reintroducing the interoperability challenges at the level of ontology representations. In the interest of avoiding such pitfalls, we introduce and defend characterizations of digital twins within the context of the Common Core Ontologies, an extension of the widely-used Basic Formal Ontology. We provide a set of definitions and design patterns relevant to the domain of digital twins, highlighted by illustrative use cases of digital twins and their physical counterparts. In doing so, we provide a foundation on which to build more sophisticated ontological content related and connected to digital twins.
各行各业和领域对数字孪生的依赖带来了语义互操作性挑战。本体论是一种已知的方法来解决这些挑战,然而,由于现象的复杂性,在语义表示层重新引入了互操作性挑战。为了避免这类陷阱,我们在Common Core Ontologies的背景下引入并捍卫数字孪生的本体定义和设计模式,这是广泛使用的基本形式本体论的扩展。我们提供了与数字孪生领域相关的定义和设计模式,重点关注数字孪生及其实体物理对应关系的示例。通过这样做,我们在构建与数字孪生相关和相连的更复杂本体内容方面提供了基础。
https://arxiv.org/abs/2405.00960
Ontological representations of qualities, dispositions, and roles have been refined over the past decade, clarifying subtle distinctions in life science research. After articulating a widely-used characterization of these entities within the context of Basic Formal Ontology (BFO), we identify gaps in this treatment and motivate the need for supplementing the BFO characterization. By way of supplement, we propose definitions for grounding relations holding between qualities and dispositions, and dispositions and roles, illustrating our proposal by representing subtle aspects of host-pathogen interactions.
在过去的十年里,对品质、性状和角色的本体表示已经得到了改进,阐明了生命科学研究中的细微差别。在基本形式本体论(BFO)的背景下,我们指出了这种处理中存在的空白,并激发了对BFO角色定义的补充需求。为了补充,我们提出了关于品质与性状、性状与角色之间的 grounded relation 的定义,并通过代表主机-病原体相互作用中的微妙方面,阐明了自己的提议。
https://arxiv.org/abs/2405.00197
The term credential encompasses educational certificates, degrees, certifications, and government-issued licenses. An occupational credential is a verification of an individuals qualification or competence issued by a third party with relevant authority. Job seekers often leverage such credentials as evidence that desired qualifications are satisfied by their holders. Many U.S. education and workforce development organizations have recognized the importance of credentials for employment and the challenges of understanding the value of credentials. In this study, we identified and ontologically defined credential and credential-related terms at the textual and semantic levels based on the Occupation Ontology (OccO), a BFO-based ontology. Different credential types and their authorization logic are modeled. We additionally defined a high-level hierarchy of credential related terms and relations among many terms, which were initiated in concert with the Alabama Talent Triad (ATT) program, which aims to connect learners, earners, employers and education/training providers through credentials and skills. To our knowledge, our research provides for the first time systematic ontological modeling of the important domain of credentials and related contents, supporting enhanced credential data and knowledge integration in the future.
术语“资格证书”包括教育证书、学位、证书和政府发行的执照。职业资格证书是由第三方相关机构颁发给个人的资格或能力证明。求职者通常利用这些资格证书作为持有者符合所需资格的证明。许多美国的教育和职业发展组织已经认识到资格证书在就业中的重要性以及理解资格证书价值的挑战。在这项研究中,我们根据职业 Ontology (OccO) 对文本和语义水平进行了资格和资格相关术语的建模。我们还在许多术语之间定义了高级层次的资格相关术语和关系,该研究是在与阿拉巴马州人才三角洲(ATT)项目合作的基础上进行的,该项目的目标是通过证书和技能将学习者、雇主和培训机构连接起来。据我们所知,我们的研究为第一次系统地建模了资格和相关领域的领域提供了支持,这将有助于在将来增强资格数据和知识的整合。
https://arxiv.org/abs/2405.00186
In our daily lives, as in science and in all other domains, we encounter huge numbers of dispositions (tendencies, potentials, powers) which are realized in processes such as sneezing, sweating, shedding dandruff, and on and on. Among this plethora of what we can think of as mere dispositions is a subset of dispositions in whose realizations we have an interest a car responding well when driven on ice, a rabbits lungs responding well when it is chased by a wolf, and so on. We call the latter capabilities and we attempt to provide a robust ontological account of what capabilities are that is of sufficient generality to serve a variety of purposes, for example by providing a useful extension to ontology-based research in areas where capabilities data are currently being collected in siloed fashion.
在我们日常生活中,正如在科学以及其他领域一样,我们经常遇到大量的心态(倾向,潜能,力量),这些心态在诸如打喷嚏,出汗,掉头皮屑等过程中得以实现。在这个庞大的心态群中,有一部分心态是我们感兴趣的,即在冰上驾驶汽车时,兔子的肺很好地响应了狼的追击等等。我们将后者描述为能力,并试图提供一个足够通用的事理论账户,以满足各种目的,例如,通过为基于能力的数据在当前以孤立的方式收集的领域提供有用的扩展,为研究提供支持。
https://arxiv.org/abs/2405.00183
Despite widespread applications of knowledge graphs (KGs) in various tasks such as question answering and intelligent conversational systems, existing KGs face two major challenges: information granularity and deficiency in timeliness. These hinder considerably the retrieval and analysis of in-context, fine-grained, and up-to-date knowledge from KGs, particularly in highly specialized themes (e.g., specialized scientific research) and rapidly evolving contexts (e.g., breaking news or disaster tracking). To tackle such challenges, we propose a theme-specific knowledge graph (i.e., ThemeKG), a KG constructed from a theme-specific corpus, and design an unsupervised framework for ThemeKG construction (named TKGCon). The framework takes raw theme-specific corpus and generates a high-quality KG that includes salient entities and relations under the theme. Specifically, we start with an entity ontology of the theme from Wikipedia, based on which we then generate candidate relations by Large Language Models (LLMs) to construct a relation ontology. To parse the documents from the theme corpus, we first map the extracted entity pairs to the ontology and retrieve the candidate relations. Finally, we incorporate the context and ontology to consolidate the relations for entity pairs. We observe that directly prompting GPT-4 for theme-specific KG leads to inaccurate entities (such as "two main types" as one entity in the query result) and unclear (such as "is", "has") or wrong relations (such as "have due to", "to start"). In contrast, by constructing the theme-specific KG step by step, our model outperforms GPT-4 and could consistently identify accurate entities and relations. Experimental results also show that our framework excels in evaluations compared with various KG construction baselines.
尽管知识图谱(KGs)在各种任务中的广泛应用,如问答和智能对话系统,现有KG面临两个主要挑战:信息粒度和时间不足。这些阻碍了从KGs中检索和分析上下文、细粒度和最新知识的能力,特别是在高度专业化的主题(例如,专业科学研究)和快速变化的环境(例如,新闻或灾害跟踪)中。为了应对这些挑战,我们提出了一个主题特定知识图(即 ThemeKG),一个基于主题特定语料库的知识图谱,并设计了用于 ThemeKG 构建的无监督框架(名为 TKGCon)。该框架从主题特定语料库中提取原始主题,然后通过大型语言模型(LLMs)生成候选关系,构建主题关系本体。为了解析主题语料库中的文档,我们首先将提取到的实体对映射到语料库,并检索候选关系。最后,我们将上下文和本体整合用于关系匹配。我们观察到,直接使用 GPT-4 生成主题特定 KG会导致不准确实体(例如查询结果中的“两个主要类型”作为一个实体),以及不清晰或错误的關係(例如“由於”或“开始于”)。相比之下,通过逐步构建主题特定 KG,我们的模型在比较各种 KG 建设基线方面表现出优异性能。实验结果还显示,我们的框架在各种 KG 建设基线上的评估中表现出色。
https://arxiv.org/abs/2404.19146
Ethical reasoning is a crucial skill for Large Language Models (LLMs). However, moral values are not universal, but rather influenced by language and culture. This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages and if their moral judgement depend on the language in which they are prompted. We extend the study of ethical reasoning of LLMs by Rao et al. (2023) to a multilingual setup following their framework of probing LLMs with ethical dilemmas and policies from three branches of normative ethics: deontology, virtue, and consequentialism. We experiment with six languages: English, Spanish, Russian, Chinese, Hindi, and Swahili. We find that GPT-4 is the most consistent and unbiased ethical reasoner across languages, while ChatGPT and Llama2-70B-Chat show significant moral value bias when we move to languages other than English. Interestingly, the nature of this bias significantly vary across languages for all LLMs, including GPT-4.
道德推理是一个关键的技能,对于大型语言模型(LLMs)。然而,道德价值观并不是普遍的,而是受到语言和文化的影响。本文探讨了三种显著的LLM——GPT-4、ChatGPT和Llama2-70B-Chat——在不同语言下的道德推理表现以及它们是否在所提示的语言中进行道德判断。我们在Rao等人(2023)的研究框架下,将道德推理扩展到多语言环境,遵循他们关于通过三个道德哲学分支(义务伦理学、美德伦理学和后果主义)探究LLMs的框架。我们对六种语言进行了实验:英语、西班牙语、俄语、汉语、印地语和斯瓦希里语。我们发现,GPT-4是在所有LLM中表现最一致和不偏见的道德推理者,而ChatGPT和Llama2-70B-Chat在除英语外的其他语言中表现出显著的道德价值偏见。有趣的是,这种偏见在所有LLM中表现出显著差异,包括GPT-4。
https://arxiv.org/abs/2404.18460
The Common Core Ontologies (CCO) are designed as a mid-level ontology suite that extends the Basic Formal Ontology. CCO has since been increasingly adopted by a broad group of users and applications and is proposed as the first standard mid-level ontology. Despite these successes, documentation of the contents and design patterns of the CCO has been comparatively minimal. This paper is a step toward providing enhanced documentation for the mid-level ontology suite through a discussion of the contents of the eleven ontologies that collectively comprise the Common Core Ontology suite.
Common Core Ontologies(CCO)被设计为中水平语义网套件,扩展了基本形式语义网。自CCO被越来越多的用户和应用程序采用以来,对CCO的文档化和设计模式的描述相对较少。本文是通过讨论组成共同核心语义网套件的11个语义网,为中水平语义网套件提供增强文档的一个步骤。
https://arxiv.org/abs/2404.17758
Mid-level ontologies are used to integrate terminologies and data across disparate domains. There are, however, no clear, defensible criteria for determining whether a given ontology should count as mid-level, because we lack a rigorous characterization of what the middle level of generality is supposed to contain. Attempts to provide such a characterization have failed, we believe, because they have focused on the goal of specifying what is characteristic of those single ontologies that have been advanced as mid-level ontologies. Unfortunately, single ontologies of this sort are generally a mixture of top- and mid-level, and sometimes even of domain-level terms. To gain clarity, we aim to specify the necessary and sufficient conditions for a collection of one or more ontologies to inhabit what we call a mid-level architecture.
中等水平 ontologies用于将术语和数据集成到不同的领域中。然而,确定一个给定 ontology 是否应被视为中等水平并没有明确的、可防御的准则,因为我们在对中等水平通用性的定义进行深入描述时缺乏严谨的阐述。我们试图提供这样的描述,但我们认为,这是因为他们集中于指定作为先进中等水平 ontologies的特征。然而,这类单个 ontology 通常是一个包含 top- 和 mid- 级别术语的混合物,有时甚至包含领域级别术语。为了获得清晰,我们旨在确定一个或多个 ontology 集合是否可以占据我们称之为中等水平架构的东西。
https://arxiv.org/abs/2404.17757
In our work, we systematize and analyze implicit ontological commitments in the responses generated by large language models (LLMs), focusing on ChatGPT 3.5 as a case study. We investigate how LLMs, despite having no explicit ontology, exhibit implicit ontological categorizations that are reflected in the texts they generate. The paper proposes an approach to understanding the ontological commitments of LLMs by defining ontology as a theory that provides a systematic account of the ontological commitments of some text. We investigate the ontological assumptions of ChatGPT and present a systematized account, i.e., GPT's top-level ontology. This includes a taxonomy, which is available as an OWL file, as well as a discussion about ontological assumptions (e.g., about its mereology or presentism). We show that in some aspects GPT's top-level ontology is quite similar to existing top-level ontologies. However, there are significant challenges arising from the flexible nature of LLM-generated texts, including ontological overload, ambiguity, and inconsistency.
在我们的工作中,我们系统化和分析了大型语言模型(LLMs)在生成响应中隐含的语义承诺。我们的重点是 ChatGPT 3.5 作为一个案例研究。我们研究了LLMs虽然缺乏明确的语义,但仍表现出反映了它们生成的文本中的隐含语义分类。论文提出了一种理解LLM语义承诺的方法,将语义定义为提供对某些文本语义承诺的系统性描述的理论。我们研究了ChatGPT的语义假设,并给出了一个系统化的说明,即GPT的高级语义。这包括一个分类,该分类作为OWL文件可用。我们还讨论了语义假设(例如其本体论或现质论),并表明在某些方面,GPT的高级语义与现有高级语义非常相似。然而,LLM生成的文本的灵活性带来了显著的挑战,包括本体过载、歧义和不一致性。
https://arxiv.org/abs/2405.01581
Capability ontologies are increasingly used to model functionalities of systems or machines. The creation of such ontological models with all properties and constraints of capabilities is very complex and can only be done by ontology experts. However, Large Language Models (LLMs) have shown that they can generate machine-interpretable models from natural language text input and thus support engineers / ontology experts. Therefore, this paper investigates how LLMs can be used to create capability ontologies. We present a study with a series of experiments in which capabilities with varying complexities are generated using different prompting techniques and with different LLMs. Errors in the generated ontologies are recorded and compared. To analyze the quality of the generated ontologies, a semi-automated approach based on RDF syntax checking, OWL reasoning, and SHACL constraints is used. The results of this study are very promising because even for complex capabilities, the generated ontologies are almost free of errors.
能力元理模型越来越多地用于系统或机器的功能建模。创建具有所有能力和约束的所有属性与能力元理模型非常复杂,只能由元理专家完成。然而,大型语言模型(LLMs)已经表明,它们可以从自然语言文本输入中生成机器可解释的模型,从而支持工程师/元理专家。因此,本文研究了LLMs如何用于创建能力元理模型。我们提出了一个系列实验来研究使用不同提示技术和不同LLM生成具有不同复杂性的能力元理模型。记录生成的元理模型的错误并进行了比较。为了分析生成的元理模型的质量,采用基于RDF语法检查、OWL推理和SHACL约束的半自动方法。本研究的结果非常有前途,因为即使对于复杂的能力,生成的元理模型也几乎没有错误。
https://arxiv.org/abs/2404.17524
Objective: Clinical trials are essential for advancing pharmaceutical interventions, but they face a bottleneck in selecting eligible participants. Although leveraging electronic health records (EHR) for recruitment has gained popularity, the complex nature of unstructured medical texts presents challenges in efficiently identifying participants. Natural Language Processing (NLP) techniques have emerged as a solution with a recent focus on transformer models. In this study, we aimed to evaluate the performance of a prompt-based large language model for the cohort selection task from unstructured medical notes collected in the EHR. Methods: To process the medical records, we selected the most related sentences of the records to the eligibility criteria needed for the trial. The SNOMED CT concepts related to each eligibility criterion were collected. Medical records were also annotated with MedCAT based on the SNOMED CT ontology. Annotated sentences including concepts matched with the criteria-relevant terms were extracted. A prompt-based large language model (Generative Pre-trained Transformer (GPT) in this study) was then used with the extracted sentences as the training set. To assess its effectiveness, we evaluated the model's performance using the dataset from the 2018 n2c2 challenge, which aimed to classify medical records of 311 patients based on 13 eligibility criteria through NLP techniques. Results: Our proposed model showed the overall micro and macro F measures of 0.9061 and 0.8060 which were among the highest scores achieved by the experiments performed with this dataset. Conclusion: The application of a prompt-based large language model in this study to classify patients based on eligibility criteria received promising scores. Besides, we proposed a method of extractive summarization with the aid of SNOMED CT ontology that can be also applied to other medical texts.
目标:临床试验对于推动制药干预至关重要,但在选择合适参与者方面存在瓶颈。尽管利用电子病历(EHR)进行招募的做法已经受到欢迎,但非结构化医疗文本复杂的 nature 提出了有效地识别参与者的挑战。自然语言处理(NLP)技术在最近关注于Transformer模型方面成为了解决方案。在这项研究中,我们旨在评估基于提示的大型语言模型在从EHR中收集的非结构化医疗文本的队列选择任务中的性能。方法:为了处理医学记录,我们选择了与需要试验资格标准相关的最相关的句子。收集了与每个资格标准相关的SNOMED CT概念。同时,根据SNOMED CT语义数据库对医学记录进行了注释。包括与标准匹配的概念的注解句子被提取出来。然后,使用基于提示的大型语言模型(本研究中使用的是Generative Pre-trained Transformer(GPT))对提取的句子进行训练。为了评估其效果,我们使用2018 n2c2挑战的数据集来评估模型的性能,该数据集旨在根据13个资格标准对311名患者的医疗记录进行分类。结果:与该数据集上进行的实验相比,我们提出的模型在整体微和宏观F分数方面得分最高,为0.9061和0.8060,这是该数据集中实现的最高分数。结论:将提示式大型语言模型应用于根据资格标准对患者进行分类,在本研究中得到了有前景的分数。此外,我们还提出了使用SNOMED CT语义数据库的提取式总结方法,该方法也可以应用于其他医学文本。
https://arxiv.org/abs/2404.16198
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.
对比性语言-图像预训练(CLIP)的成功取决于图像与摘要之间的配对监督,而这类数据往往存在噪声。我们提出了混合数据专家(MoDE)方法并通过聚类学习系统。每个数据专家在一个数据聚类上进行训练,对其他聚类的虚假负噪声更不敏感。在推理时,我们通过任务元数据与聚类条件的关联来应用权重。为了精确估计相关性,一个聚类的样本应该在语义上相似,但数据专家的数量仍应保持在训练和推理的合理范围内。因此,我们在人类语言的语义层次上考虑元数据,并建议在粗粒度层面使用细粒度聚类中心来表示每个数据专家。实验研究表明,在ViT-B/16上,四个CLIP数据专家超过了ViT-L/14上的OpenAI CLIP和OpenCLIP在零散图像分类上的表现,但训练成本较低(<35%)。与此同时,MoDE可以异步训练所有数据专家,并可以灵活地包括新的数据专家。代码可在此处下载:https://thisurl.com
https://arxiv.org/abs/2404.16030
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at this https URL.
专家策展对于从FAIR开放知识库中捕获酶功能知识至关重要,但无法跟上新发现和新出版物的发展速度。在这项工作中,我们提出了EnzChemRED,Enzyme Chemistry Relation Extraction Dataset的训练和基准数据集,以支持开发自然语言处理(NLP)方法,如(大型)语言模型,以协助酶策展。EnzChemRED由1,210个专家编写的PubMed摘要组成,其中酶及其催化的化学反应使用来自UniProt知识库(UniProtKB)和化学生物实体(ChEBI)的标识符进行注释。我们证明了使用EnzChemRED对预训练语言模型进行微调可以显著提高其在文本(命名实体识别,NER)中识别蛋白质和化学物质的提及能力以及提取它们参与的化学转换(关系提取,RE)能力,平均F1分数为86.30% for NER,86.66% for RE for chemical conversion pairs,83.79% for RE for chemical conversion pairs and linked enzymes。我们使用EnzChemRED中表现最好的方法对文本进行微调,创建了从文本到摘要的端到端管道,并将此应用于PubMed大小的摘要以创建酶功能文献的初步映射,以指导在UniProtKB和反应知识库Rhea中的策展工作。EnzChemRED语料库可在此链接处免费获取:https://www.ncbi.nlm.nih.gov/25962541
https://arxiv.org/abs/2404.14209
Ontology matching is defined as finding a relationship or correspondence between two or more entities in two or more ontologies. To solve the interoperability problem of the domain ontologies, semantically similar entities in these ontologies must be found and aligned before merging them. GraphMatcher, developed in this study, is an ontology matching system using a graph attention approach to compute higher-level representation of a class together with its surrounding terms. The GraphMatcher has obtained remarkable results in in the Ontology Alignment Evaluation Initiative (OAEI) 2022 conference track. Its codes are available at ~\url{this https URL}.
语义匹配是一种在两个或多个语义网之间查找关系或对应关系的任务。为了解决领域语义网之间的互操作性问题,本研究开发了一种基于图注意力的语义匹配系统,用于计算类及其周围术语的高级表示。GraphMatcher在2022年Ontology Alignment Evaluation Initiative(OAEI)会议跟踪中取得了显著的成果。其代码可在此处下载:https://this https URL。
https://arxiv.org/abs/2404.14450
Adverse drug events (ADEs) significantly impact clinical research and public health, contributing to failures in clinical trials and leading to increased healthcare costs. The accurate prediction and management of ADEs are crucial for improving the development of safer, more effective medications, and enhancing patient outcomes. To support this effort, we introduce CT-ADE, a novel dataset compiled to enhance the predictive modeling of ADEs. Encompassing over 12,000 instances extracted from clinical trial results, the CT-ADE dataset integrates drug, patient population, and contextual information for multilabel ADE classification tasks in monopharmacy treatments, providing a comprehensive resource for developing advanced predictive models. To mirror the complex nature of ADEs, annotations are standardized at the system organ class level of the Medical Dictionary for Regulatory Activities (MedDRA) ontology. Preliminary analyses using baseline models have demonstrated promising results, achieving 73.33% F1 score and 81.54% balanced accuracy, highlighting CT-ADE's potential to advance ADE prediction. CT-ADE provides an essential tool for researchers aiming to leverage the power of artificial intelligence and machine learning to enhance patient safety and minimize the impact of ADEs on pharmaceutical research and development. Researchers interested in using the CT-ADE dataset can find all necessary resources at this https URL.
药物不良反应(ADEs)对临床研究和公共卫生产生重大影响,导致临床试验失败和医疗费用增加。准确预测和管理ADEs对提高更安全、更有效的药物开发至关重要。为了支持这一努力,我们引入了CT-ADE,一个专门为增强ADEs预测建模的新数据集。包含从临床试验结果中提取的超过12,000个实例,CT-ADE数据集整合了药物、患者人口和上下文信息,为多标签ADE分类任务提供了一个全面的资源,以开发高级预测模型。为了反映ADEs的复杂性,在MedDRA语义层的系统器官级别进行注释。使用基线模型进行初步分析已经取得了良好的成果,实现了73.33%的F1得分和81.54%的平衡准确率,突出了CT-ADE在提高ADE预测方面的潜力。CT-ADE为研究人员利用人工智能和机器学习加强患者安全并减轻ADEs对制药研究和开发产生影响提供了一个重要的工具。对使用CT-ADE数据集感兴趣的研究人员可以在该链接找到所有必要的资源。
https://arxiv.org/abs/2404.12827
Current open-domain neural semantics parsers show impressive performance. However, closer inspection of the symbolic meaning representations they produce reveals significant weaknesses: sometimes they tend to merely copy character sequences from the source text to form symbolic concepts, defaulting to the most frequent word sense based in the training distribution. By leveraging the hierarchical structure of a lexical ontology, we introduce a novel compositional symbolic representation for concepts based on their position in the taxonomical hierarchy. This representation provides richer semantic information and enhances interpretability. We introduce a neural "taxonomical" semantic parser to utilize this new representation system of predicates, and compare it with a standard neural semantic parser trained on the traditional meaning representation format, employing a novel challenge set and evaluation metric for evaluation. Our experimental findings demonstrate that the taxonomical model, trained on much richer and complex meaning representations, is slightly subordinate in performance to the traditional model using the standard metrics for evaluation, but outperforms it when dealing with out-of-vocabulary concepts. This finding is encouraging for research in computational semantics that aims to combine data-driven distributional meanings with knowledge-based symbolic representations.
目前公开领域的神经语义解析器表现出令人印象深刻的性能。然而,对其产生的符号意义表示的近距离观察揭示了显著的弱点:有时候它们倾向于仅仅从源文本中复制字符序列以形成符号概念,默认为基于训练分布中最常见单词意义的最频词汇。通过利用词汇本体的层次结构,我们引入了一种基于它们在分类层次结构中的位置的新组合符号表示概念。这种表示提供了更丰富的语义信息并提高了可解释性。我们引入了一个神经“语义分类”语义解析器,用于利用这种基于命题的新表示系统,并将其与使用传统意义表示格式训练的标准神经语义解析器进行比较。我们的实验结果表明,基于更丰富和复杂语义表示的语义模型在标准评估指标上的性能略微低于使用标准评估指标的传统模型,但在处理非词汇概念时表现优异。这一发现对于旨在将数据驱动的分布语义与知识驱动的符号表示相结合的计算语义研究来说是有益的。
https://arxiv.org/abs/2404.12698
Different entities with the same name can be difficult to distinguish. Handling confusing entity mentions is a crucial skill for language models (LMs). For example, given the question "Where was Michael Jordan educated?" and a set of documents discussing different people named Michael Jordan, can LMs distinguish entity mentions to generate a cohesive answer to the question? To test this ability, we introduce a new benchmark, AmbigDocs. By leveraging Wikipedia's disambiguation pages, we identify a set of documents, belonging to different entities who share an ambiguous name. From these documents, we generate questions containing an ambiguous name and their corresponding sets of answers. Our analysis reveals that current state-of-the-art models often yield ambiguous answers or incorrectly merge information belonging to different entities. We establish an ontology categorizing four types of incomplete answers and automatic evaluation metrics to identify such categories. We lay the foundation for future work on reasoning across multiple documents with ambiguous entities.
具有相同名称的不同实体可能很难区分。处理令人困惑的实体提及是语言模型(LMs)的一项关键技能。例如,给定问题“迈克尔·乔丹在哪里受教育?”以及一系列讨论不同名为迈克尔·乔丹的人的文件,LMs能否区分实体提及并生成针对问题的连贯答案?为了测试这种能力,我们引入了一个新的基准,AmbigDocs。通过利用维基百科的歧义页面,我们找到了一组属于不同实体的具有模糊名称的文档。从这些文档中,我们生成包含模糊名称和相关答案的问题。我们的分析显示,当前最先进的模型通常会产生模糊的答案或错误地合并来自不同实体的信息。我们建立了一个分类为四种不完整答案的元数据模型和自动评估指标,以识别这些类别。我们在跨多个具有模糊实体的文档之间进行推理的基础之上,为未来的研究工作奠定了基础。
https://arxiv.org/abs/2404.12447