We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. OpenAI's GPT series models were compared against human annotation. Frequent terms with a high false-positive rate were removed from the dictionary. Analysis of the estimated false-positive rates of the annotated terms revealed 8 ambiguous terms (plus synonyms) used in Instagram posts, which were removed from the original dictionary. To study the effect of removing those terms, we constructed knowledge networks using the refined and the original dictionaries and performed an eigenvector-centrality analysis on both networks. We show that the refined dictionary thus produced leads to a significantly different rank of important terms, as measured by their eigenvector-centrality of the knowledge networks. Furthermore, the most important terms obtained after refinement are of greater medical relevance. In addition, we show that OpenAI's GPT series models fare worse than human annotators in this task.
我们使用了一个基于生物医学术语提取于各种来源的词典,如DrugBank、MedDRA、MedlinePlus、TCMGeneDIT等,来标记在至少提及一次抗癫痫药物的8000万条Instagram帖子。通过对2010年到2016年间的随机样本进行人名标注者评估,来识别出假阳性结果。OpenAI的GPT系列模型与人类标注进行了比较。高假阳性率和高频词汇从词典中移除。对注释词汇的预计假阳性率进行分析,发现其中8个词汇是 Instagram 帖子中使用的模糊词汇(包括同义词),它们被从原始词典中移除。为了研究移除这些词汇的影响,我们使用精化和原始词典来构建知识网络,并分别对两个网络进行了 eigenvector-centrality 分析。我们发现,因此产生的精化词典导致知识网络中重要词汇的排名明显不同,这是通过它们的 eigenvector-centrality 来衡量的。此外,在精化过程中获得的最重要词汇具有更大的医学意义。此外,我们还发现,OpenAI的GPT系列模型在這種任务上表现不佳,远低于人类标注者。
https://arxiv.org/abs/2405.08784
With the proliferation of edge devices, there is a significant increase in attack surface on these devices. The decentralized deployment of threat intelligence on edge devices, coupled with adaptive machine learning techniques such as the in-context learning feature of large language models (LLMs), represents a promising paradigm for enhancing cybersecurity on low-powered edge devices. This approach involves the deployment of lightweight machine learning models directly onto edge devices to analyze local data streams, such as network traffic and system logs, in real-time. Additionally, distributing computational tasks to an edge server reduces latency and improves responsiveness while also enhancing privacy by processing sensitive data locally. LLM servers can enable these edge servers to autonomously adapt to evolving threats and attack patterns, continuously updating their models to improve detection accuracy and reduce false positives. Furthermore, collaborative learning mechanisms facilitate peer-to-peer secure and trustworthy knowledge sharing among edge devices, enhancing the collective intelligence of the network and enabling dynamic threat mitigation measures such as device quarantine in response to detected anomalies. The scalability and flexibility of this approach make it well-suited for diverse and evolving network environments, as edge devices only send suspicious information such as network traffic and system log changes, offering a resilient and efficient solution to combat emerging cyber threats at the network edge. Thus, our proposed framework can improve edge computing security by providing better security in cyber threat detection and mitigation by isolating the edge devices from the network.
随着边缘设备的普及,这些设备上的攻击面显著增加。在边缘设备上分布式威胁情报的集中部署,与大型语言模型(LLMs)的上下文学习特征等自适应机器学习技术的结合,代表了一种增强网络安全低功耗边缘设备的有前途的范式。这种方法涉及在边缘设备上直接部署轻量级机器学习模型以实时分析本地数据流,如网络流量和系统日志。此外,将计算任务分配给边缘服务器可以降低延迟并提高响应速度,同时通过在本地处理敏感数据而增强隐私。LLM服务器可以使得这些边缘服务器能够自主适应不断变化的威胁和攻击模式,持续更新模型以提高检测准确性和减少误报。此外,合作学习机制使边缘设备之间实现安全且可信的相互知识共享,增强网络集体智慧,并能够实现针对检测到的异常情况的动态威胁缓解措施,如设备隔离。这种方法的可扩展性和灵活性使其非常适合各种不断变化的网络环境,因为边缘设备仅发送网络流量和系统日志变化等可疑信息,为解决网络边缘 emerging cyber threats 提供了一个弹性和高效的解决方案。因此,我们提出的框架可以通过在网络边缘隔离边缘设备来提高边缘计算安全性,从而通过隔离边缘设备从网络来提高网络威胁检测和缓解的 security。
https://arxiv.org/abs/2405.08755
Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.
浏览新闻文章的复杂多变的景观,需要理解各种参与者的身份,这些参与者从决策者到反对派人物、公民等,在塑造新闻故事中发挥着关键作用。了解他们的角色、政治观点、社会地位等,对于深入理解新闻内容至关重要。尽管现有的工作集中于通过社交媒体数据突出实体、覆盖差异和政治立场,但自动检测新闻内容中的参与者角色仍然是一个未被探索的领域。在本文中,我们通过引入一种有效的分类新闻文章中参与者类型的方法,跨越了这个领域的空白。我们的方法将参与者分类问题转化为自然语言推理任务,利用新闻文章的上下文信息和外部知识来提高参与者类型检测的准确性。此外,我们所提出的模型在零散设置中表现出优异效果,进一步拓展了其在各种新闻环境中的应用。
https://arxiv.org/abs/2405.08751
Non-prehensile manipulation enables fast interactions with objects by circumventing the need to grasp and ungrasp as well as handling objects that cannot be grasped through force closure. Current approaches to non-prehensile manipulation focus on static contacts, avoiding the underactuation that comes with sliding. However, the ability to control sliding contact, essentially removing the no-slip constraint, opens up new possibilities in dynamic manipulation. In this paper, we explore a challenging dynamic non-prehensile manipulation task that requires the consideration of the full spectrum of hybrid contact modes. We leverage recent methods in contact-implicit MPC to handle the multi-modal planning aspect of the task. We demonstrate, with careful consideration of integration between the simple model used for MPC and the low-level tracking controller, how contact-implicit MPC can be adapted to dynamic tasks. Surprisingly, despite the known inaccuracies of frictional rigid contact models, our method is able to react to these inaccuracies while still quickly performing the task. Moreover, we do not use common aids such as reference trajectories or motion primitives, highlighting the generality of our approach. To the best of our knowledge, this is the first application of contact-implicit MPC to a dynamic manipulation task in three dimensions.
非抓取操作使通过绕开抓取和解抓取的需要,以及处理无法通过力闭合来抓取的对象,实现了与物体的高速互动。目前,非抓取操作方法主要关注静态接触,避免与滑动相关的松动。然而,控制滑动接触的能力,本质上消除松动约束,为动态操作带来了新的可能性。在本文中,我们探讨了一个具有挑战性的动态非抓取操作任务,需要考虑混合接触模式的完整范围。我们利用最近在接触隐式MPC中的方法来处理任务的 Multi-modal 规划方面。我们证明了,在仔细考虑简单模型用于MPC和低级跟踪控制器之间的集成的情况下,接触隐式MPC可以适应动态任务。令人惊讶的是,尽管已知摩擦刚性接触模型的不准确度,但我们的方法仍然能够应对这些不准确度,同时仍然快速地完成任务。此外,我们没有使用常见的辅助工具,如参考轨迹或运动原型,这突出了我们方法的普遍性。据我们所知,这是在三维空间中第一次将接触隐式MPC应用于动态操作任务。
https://arxiv.org/abs/2405.08731
Large-scale Vision-Language Models (VLMs) have demonstrated exceptional performance in natural vision tasks, motivating researchers across domains to explore domain-specific VLMs. However, the construction of powerful domain-specific VLMs demands vast amounts of annotated data, substantial electrical energy, and computing resources, primarily accessible to industry, yet hindering VLM research in academia. To address this challenge and foster sustainable and equitable VLM research, we present the Generalized Domain Prompt Learning (GDPL) framework. GDPL facilitates the transfer of VLMs' robust recognition capabilities from natural vision to specialized domains, without the need for extensive data or resources. By leveraging small-scale domain-specific foundation models and minimal prompt samples, GDPL empowers the language branch with domain knowledge through quaternion networks, uncovering cross-modal relationships between domain-specific vision features and natural vision-based contextual embeddings. Simultaneously, GDPL guides the vision branch into specific domains through hierarchical propagation of generated vision prompt features, grounded in well-matched vision-language relations. Furthermore, to fully harness the domain adaptation potential of VLMs, we introduce a novel low-rank adaptation approach. Extensive experiments across diverse domains like remote sensing, medical imaging, geology, Synthetic Aperture Radar, and fluid dynamics, validate the efficacy of GDPL, demonstrating its ability to achieve state-of-the-art domain recognition performance in a prompt learning paradigm. Our framework paves the way for sustainable and inclusive VLM research, transcending the barriers between academia and industry.
大规模视觉语言模型(VLMs)在自然视觉任务中的卓越表现,激发了跨学科领域的研究人员探索领域特定的VLMs。然而,构建强大的领域特定VLMs需要大量注释数据、大量的电力和计算资源,主要面向工业界,这阻碍了学术界VLM研究的发展。为了应对这一挑战,促进可持续和公正的VLM研究,我们提出了泛化领域提示学习(GDPL)框架。GDPL通过将VLMs的稳健识别能力从自然视觉传递到专用领域,无需大量数据或资源来实现。通过利用小规模领域特定基础模型和最小提示样本,GDPL通过四元网络赋予语言分支领域知识,揭示领域特定视觉特征与自然视觉 based 的上下文嵌入之间的跨模态关系。同时,GDPL通过分层传播生成的视觉提示特征引导视觉分支进入特定领域,基于匹配的视觉语言关系。此外,为了充分利用VLMs的领域适应潜力,我们引入了一种新的低秩适应方法。在遥感、医学成像、地质、合成孔径雷达和流体动力学等多样领域进行的大量实验证实了GDPL的有效性,表明其在提示学习范式下实现最先进的领域识别性能。我们的框架为可持续和包容的VLM研究铺平道路,超越了学术界和工业界之间的障碍。
https://arxiv.org/abs/2405.08668
Since the release of ChatGPT and GPT-4, large language models (LLMs) and multimodal large language models (MLLMs) have garnered significant attention due to their powerful and general capabilities in understanding, reasoning, and generation, thereby offering new paradigms for the integration of artificial intelligence with medicine. This survey comprehensively overviews the development background and principles of LLMs and MLLMs, as well as explores their application scenarios, challenges, and future directions in medicine. Specifically, this survey begins by focusing on the paradigm shift, tracing the evolution from traditional models to LLMs and MLLMs, summarizing the model structures to provide detailed foundational knowledge. Subsequently, the survey details the entire process from constructing and evaluating to using LLMs and MLLMs with a clear logic. Following this, to emphasize the significant value of LLMs and MLLMs in healthcare, we survey and summarize 6 promising applications in healthcare. Finally, the survey discusses the challenges faced by medical LLMs and MLLMs and proposes a feasible approach and direction for the subsequent integration of artificial intelligence with medicine. Thus, this survey aims to provide researchers with a valuable and comprehensive reference guide from the perspectives of the background, principles, and clinical applications of LLMs and MLLMs.
自 ChatGPT 和 GPT-4 发布以来,大型语言模型(LLMs)和多模态大型语言模型(MLLMs)因其在理解、推理和生成方面的强大和通用能力而备受关注,为将人工智能与医疗相结合提供了新的范例。这项调查全面回顾了 LLMs 和 MLLMs 的开发背景和原理,并探讨了它们在医学中的应用场景、挑战和未来发展方向。具体来说,这项调查首先关注范式的转变,从传统模型到 LLMs 和 MLLMs 的演变过程,并总结模型的结构以提供详细的基础知识。接着,调查详细描述了从构建和评估到使用 LLMs 和 MLLMs 的整个过程,并强调了 LLMs 和 MLLMs 在医疗保健中的重要价值。随后,我们调查和总结了 6 个医疗保健领域的有益应用。最后,调查讨论了医疗 LLMs 和 MLLMs 面临的问题,并为将来的人工智能与医疗结合提出了一种可行的方式和方向。因此,这项调查旨在为研究人员提供关于 LLMs 和 MLLMs 的背景、原则和临床应用方面宝贵的全面参考指南。
https://arxiv.org/abs/2405.08603
The nature of diversity in real-world environments necessitates neural network models to expand from closed category settings to accommodate novel emerging categories. In this paper, we study the open-vocabulary object detection (OVD), which facilitates the detection of novel object classes under the supervision of only base annotations and open-vocabulary knowledge. However, we find that the inadequacy of neighboring relationships between regions during the alignment process inevitably constrains the performance on recent distillation-based OVD strategies. To this end, we propose Neighboring Region Attention Alignment (NRAA), which performs alignment within the attention mechanism of a set of neighboring regions to boost the open-vocabulary inference. Specifically, for a given proposal region, we randomly explore the neighboring boxes and conduct our proposed neighboring region attention (NRA) mechanism to extract relationship information. Then, this interaction information is seamlessly provided into the distillation procedure to assist the alignment between the detector and the pre-trained vision-language models (VLMs). Extensive experiments validate that our proposed model exhibits superior performance on open-vocabulary benchmarks.
现实环境中的多样性需要神经网络模型从封闭的类别设置扩展到以容纳新颖的浮现类别。在本文中,我们研究了开放词汇对象检测(OVD),它通过仅基于基本注释的监督来检测新颖的对象类别。然而,我们发现,在配准过程中,区域之间相邻关系的不足会必然限制最近基于蒸馏的OD策略的性能。为此,我们提出了邻居区域注意对齐(NRAA),它通过一组邻居区域的注意机制在注意力机制内进行对齐,以提高开放词汇的推理。具体来说,对于给定的建议区域,我们随机探索邻居框,并执行我们提出的邻居区域注意(NRA)机制来提取关系信息。然后,这种交互信息被无缝地提供到蒸馏过程中,以协助检测器与预训练的视觉语言模型(VLMs)之间的对齐。大量实验证实,与开放词汇基准相比,我们提出的模型具有卓越的性能。
https://arxiv.org/abs/2405.08593
Image stitching aims to construct a wide field of view with high spatial resolution, which cannot be achieved in a single exposure. Typically, conventional image stitching techniques, other than deep learning, require complex computation and thus computational pricy, especially for stitching large raw images. In this study, inspired by the multiscale feature of fluid turbulence, we developed a fast feature point detection algorithm named local-peak scale-invariant feature transform (LP-SIFT), based on the multiscale local peaks and scale-invariant feature transform method. By combining LP-SIFT and RANSAC in image stitching, the stitching speed can be improved by orders, compared with the original SIFT method. Nine large images (over 2600*1600 pixels), arranged randomly without prior knowledge, can be stitched within 158.94 s. The algorithm is highly practical for applications requiring a wide field of view in diverse application scenes, e.g., terrain mapping, biological analysis, and even criminal investigation.
图像拼接旨在通过高空间分辨率来构建广阔的视野,这在一个曝光中无法实现。通常,除了深度学习之外,传统的图像拼接技术需要复杂的计算,因此计算代价较高,尤其是在拼接大型的原始图像时。在这项研究中,我们受到流体湍流中多尺度特征的启发,开发了一种名为局部峰值尺度不变特征变换(LP-SIFT)的快速特征点检测算法,基于多尺度局部峰值和尺度不变特征变换方法。通过将LP-SIFT和RANSAC在图像拼接中结合,拼接速度可以提高 orders,与原始SIFT方法相比。 在随机排列的9个大图像(超过2600*1600像素)上进行拼接,仅用158.94秒。该算法在需要广阔视野的各种应用场景中具有很高的实用性,例如地形测绘、生物分析,甚至犯罪调查。
https://arxiv.org/abs/2405.08578
Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach leveraging large-scale multisensory pre-training for robotic manipulation. For supplementary information including videos of real robot experiments, please see this https URL.
尽管在机器人学习中预先在大规模数据集上训练是有益的,但目前的范式仅在对视觉表示进行大规模预训练,而其他模态的表示是从零开始训练的。与丰富的视觉数据相比,不清楚可能用于预训练其他模态(如触觉感知)的相关互联网规模数据。在机器人应用中,低数据量的情况很常见。为了填补这个空白,本文我们通过使用接触式麦克风作为另一种触觉传感器来解决这个问题。我们关键的见解是,接触式麦克风捕获固有音频信息,使我们能够利用大规模音频-视觉预训练来获得提高机器人操作绩效的代表。据我们所知,我们的方法是第一个利用大规模多感官预训练来提高机器人操作绩效的方法。如果您有兴趣了解包括机器人实验视频的更多信息,请查看此链接。
https://arxiv.org/abs/2405.08576
In visual tasks, large teacher models capture essential features and deep information, enhancing performance. However, distilling this information into smaller student models often leads to performance loss due to structural differences and capacity limitations. To tackle this, we propose a distillation framework based on graph knowledge, including a multi-level feature alignment strategy and an attention-guided mechanism to provide a targeted learning trajectory for the student model. We emphasize spectral embedding (SE) as a key technique in our distillation process, which merges the student's feature space with the relational knowledge and structural complexities similar to the teacher network. This method captures the teacher's understanding in a graph-based representation, enabling the student model to more accurately mimic the complex structural dependencies present in the teacher model. Compared to methods that focus only on specific distillation areas, our strategy not only considers key features within the teacher model but also endeavors to capture the relationships and interactions among feature sets, encoding these complex pieces of information into a graph structure to understand and utilize the dynamic relationships among these pieces of information from a global perspective. Experiments show that our method outperforms previous feature distillation methods on the CIFAR-100, MS-COCO, and Pascal VOC datasets, proving its efficiency and applicability.
在视觉任务中,大型教师模型能够捕获关键特征和深度信息,提高性能。然而,将这种信息压缩到学生模型中往往会导致性能损失,因为学生模型和教师模型的结构存在差异和容量限制。为了解决这个问题,我们提出了一个基于图知识的蒸馏框架,包括多级特征对齐策略和关注引导机制,为学生的模型提供目标学习路径。我们强调谱聚类(SE)作为我们蒸馏过程的关键技术,它将学生的特征空间与类似于教师网络的结构关系和复杂性相结合。这种方法捕捉了教师在基于图的关系知识,使得学生模型能够更准确地模仿教师模型的复杂结构依赖关系。与只关注特定蒸馏领域的方法相比,我们的策略不仅考虑了教师模型中的关键特征,而且努力捕捉特征集之间的关系和相互作用,将这些复杂信息编码为图结构,从全局视角理解并利用这些信息。实验证明,我们的方法在CIFAR-100、MS-COCO和Pascal VOC数据集上的性能优于之前的特征蒸馏方法,证明了其高效性和适用性。
https://arxiv.org/abs/2405.08547
Recent advances in knowledge graph embedding (KGE) rely on Euclidean/hyperbolic orthogonal relation transformations to model intrinsic logical patterns and topological structures. However, existing approaches are confined to rigid relational orthogonalization with restricted dimension and homogeneous geometry, leading to deficient modeling capability. In this work, we move beyond these approaches in terms of both dimension and geometry by introducing a powerful framework named GoldE, which features a universal orthogonal parameterization based on a generalized form of Householder reflection. Such parameterization can naturally achieve dimensional extension and geometric unification with theoretical guarantees, enabling our framework to simultaneously capture crucial logical patterns and inherent topological heterogeneity of knowledge graphs. Empirically, GoldE achieves state-of-the-art performance on three standard benchmarks. Codes are available at this https URL.
近年来,知识图嵌入(KGE)的进步主要依赖于欧氏/混叠欧氏正交关系变换来建模固有逻辑模式和拓扑结构。然而,现有方法局限于刚性关系正交化以及受限维度和同质几何,导致建模能力不足。在本文中,我们通过引入一个名为GoldE的强大框架,在维度和几何方面超越了这些方法。该框架基于一种一般形式的家用反射,具有普遍的正交参数化。这种参数化可以自然实现维度的扩展和几何的统一,使得我们的框架能够同时捕捉知识图的关键逻辑模式和固有拓扑异质性。实证研究表明,GoldE在三个标准基准测试中都实现了最先进的性能。代码可在此处访问:https://www.xxxxxx.com/。
https://arxiv.org/abs/2405.08540
The SemEval task on Argument Reasoning in Civil Procedure is challenging in that it requires understanding legal concepts and inferring complex arguments. Currently, most Large Language Models (LLM) excelling in the legal realm are principally purposed for classification tasks, hence their reasoning rationale is subject to contention. The approach we advocate involves using a powerful teacher-LLM (ChatGPT) to extend the training dataset with explanations and generate synthetic data. The resulting data are then leveraged to fine-tune a small student-LLM. Contrary to previous work, our explanations are not directly derived from the teacher's internal knowledge. Instead they are grounded in authentic human analyses, therefore delivering a superior reasoning signal. Additionally, a new `mutation' method generates artificial data instances inspired from existing ones. We are publicly releasing the explanations as an extension to the original dataset, along with the synthetic dataset and the prompts that were used to generate both. Our system ranked 15th in the SemEval competition. It outperforms its own teacher and can produce explanations aligned with the original human analyses, as verified by legal experts.
民事诉讼中的推理挑战性的任务在于它需要理解法律概念并推断复杂论点。目前,在法律领域表现卓越的大规模语言模型(LLM)主要目的是分类任务,因此其推理是具有争议的。我们倡导的方法包括使用强大的教师-LLM(ChatGPT)扩展训练数据,并生成合成数据。然后将这些数据用于微调小学生-LLM。与之前的工作不同,我们的解释不是直接从教师内部知识中得出的。相反,它们是基于真实的人类分析得出的,因此具有更好的推理信号。此外,一种新方法`突变`生成源自现有方法的 artificial 数据实例。我们公开发布这些解释作为对原始数据集的扩展,以及为生成 both 原始人类分析和合成数据而使用的提示。我们的系统在 SemEval 竞赛中排名第15。它超越了自己的教师,并且可以通过法律专家的验证,产生与原始人类分析一致的解释。
https://arxiv.org/abs/2405.08502
Traditional recommendation proposals, including content-based and collaborative filtering, usually focus on similarity between items or users. Existing approaches lack ways of introducing unexpectedness into recommendations, prioritizing globally popular items over exposing users to unforeseen items. This investigation aims to design and evaluate a novel layer on top of recommender systems suited to incorporate relational information and suggest items with a user-defined degree of surprise. We propose a Knowledge Graph (KG) based recommender system by encoding user interactions on item catalogs. Our study explores whether network-level metrics on KGs can influence the degree of surprise in recommendations. We hypothesize that surprisingness correlates with certain network metrics, treating user profiles as subgraphs within a larger catalog KG. The achieved solution reranks recommendations based on their impact on structural graph metrics. Our research contributes to optimizing recommendations to reflect the metrics. We experimentally evaluate our approach on two datasets of LastFM listening histories and synthetic Netflix viewing profiles. We find that reranking items based on complex network metrics leads to a more unexpected and surprising composition of recommendation lists.
传统推荐策略,包括基于内容和协同过滤的方法,通常关注物品或用户之间的相似性。现有方法缺乏将意外性引入推荐的方法,将全局热门物品优先于向用户展示未知的物品。本研究旨在设计并评估一种新层,将关系信息编码在物品目录上,用于在推荐系统中建议具有用户定义程度惊喜的物品。我们提出了一个基于知识图谱的推荐系统,通过编码用户在目录上的交互来实现。我们的研究探讨了网络级指标在知识图谱上的影响是否会影响推荐中的惊喜程度。我们假设惊喜程度与某些网络指标相关,将用户个人档案视为大型目录知识图谱中的子图。所实现的结果根据其对结构图指标的影响对推荐进行排序。我们的研究为优化推荐以反映这些指标做出了贡献。我们在LastFM听书历史数据集和合成Netflix观看个人资料数据集上进行了实验评估。我们发现,根据复杂的网络指标重新排列物品会导致推荐列表更加意外和令人惊讶。
https://arxiv.org/abs/2405.08465
Stereo image super-resolution (SR) refers to the reconstruction of a high-resolution (HR) image from a pair of low-resolution (LR) images as typically captured by a dual-camera device. To enhance the quality of SR images, most previous studies focused on increasing the number and size of feature maps and introducing complex and computationally intensive structures, resulting in models with high computational complexity. Here, we propose a simple yet efficient stereo image SR model called NAFRSSR, which is modified from the previous state-of-the-art model NAFSSR by introducing recursive connections and lightweighting the constituent modules. Our NAFRSSR model is composed of nonlinear activation free and group convolution-based blocks (NAFGCBlocks) and depth-separated stereo cross attention modules (DSSCAMs). The NAFGCBlock improves feature extraction and reduces number of parameters by removing the simple channel attention mechanism from NAFBlock and using group convolution. The DSSCAM enhances feature fusion and reduces number of parameters by replacing 1x1 pointwise convolution in SCAM with weight-shared 3x3 depthwise convolution. Besides, we propose to incorporate trainable edge detection operator into NAFRSSR to further improve the model performance. Four variants of NAFRSSR with different sizes, namely, NAFRSSR-Mobile (NAFRSSR-M), NAFRSSR-Tiny (NAFRSSR-T), NAFRSSR-Super (NAFRSSR-S) and NAFRSSR-Base (NAFRSSR-B) are designed, and they all exhibit fewer parameters, higher PSNR/SSIM, and faster speed than the previous state-of-the-art models. In particular, to the best of our knowledge, NAFRSSR-M is the lightest (0.28M parameters) and fastest (50 ms inference time) model achieving an average PSNR/SSIM as high as 24.657 dB/0.7622 on the benchmark datasets. Codes and models will be released at this https URL.
立体图像超分辨率(SR)是指通常由双相机设备捕获的低分辨率(LR)图像的一对高分辨率(HR)图像的重建。为了提高SR图像的质量,以前的研究主要集中在增加特征图的数量和大小,并引入复杂且计算密集的结构,导致具有高计算复杂性的模型。在这里,我们提出了一种简单而有效的立体图像SR模型,称为NAFRSSR,它基于前 state-of-the-art模型NAFSSR,通过引入递归连接和轻量化构成模块。我们的NAFRSSR模型由非线性激活自由和基于组卷积的块(NAFGCBlocks)以及深度分离的立体跨注意模块(DSSCAMs)组成。NAFGCBlock通过从NAFBlock中移除简单的通道关注机制并使用组卷积来减少参数数量并提高特征提取。DSSCAM通过用权共享的3x3深度卷积来替换SCAM中的1x1点卷积,从而增强特征融合并减少参数数量。此外,我们还提出将可训练的边缘检测操作器集成到NAFRSSR中,以进一步提高模型性能。设计有四种不同大小的NAFRSSR变体,分别为:NAFRSSR-Mobile(NAFRSSR-M),NAFRSSR-Tiny(NAFRSSR-T),NAFRSSR-Super(NAFRSSR-S)和NAFRSSR-Base(NAFRSSR-B),它们都具有更少的参数、更高的PSNR/SSIM和更快的推理速度。特别是,据我们所知,NAFRSSR-M是最轻便(0.28M参数)且最快的(50ms推理时间)模型,在基准数据集上达到平均PSNR/SSIM 24.657 dB/0.7622。代码和模型发布在https://这个URL上。
https://arxiv.org/abs/2405.08423
This paper describes our approach to the MEDIQA-CORR shared task, which involves error detection and correction in clinical notes curated by medical professionals. This task involves handling three subtasks: detecting the presence of errors, identifying the specific sentence containing the error, and correcting it. Through our work, we aim to assess the capabilities of Large Language Models (LLMs) trained on a vast corpora of internet data that contain both factual and unreliable information. We propose to comprehensively address all subtasks together, and suggest employing a unique prompt-based in-context learning strategy. We will evaluate its efficacy in this specialized task demanding a combination of general reasoning and medical knowledge. In medical systems where prediction errors can have grave consequences, we propose leveraging self-consistency and ensemble methods to enhance error correction and error detection performance.
本文描述了我们针对MEDIQA-CORR共享任务的 approach,该任务涉及临床笔记由医疗专业人士编写的错误检测和纠正。这项任务包括处理三个子任务:检测到错误的 presence,确定包含错误的具体句子,并对其进行纠正。通过我们的工作,我们旨在评估基于互联网数据的大型语言模型(LLMs)的 capabilities,这些数据集包含事实和不可靠信息。我们提出了一种全面解决所有子任务的策略,并建议采用一种基于独特提示的上下文学习策略。我们将评估其在要求一般推理和医学知识的专用任务中的有效性。在医疗系统中,预测错误可能会带来严重的后果时,我们提出利用自一致性和集成方法增强错误检测和纠正性能。
https://arxiv.org/abs/2405.08373
Reports regarding the misuse of $\textit{Generative AI}$ ($\textit{GenAI}$) to create harmful deepfakes are emerging daily. Recently, defensive watermarking, which enables $\textit{GenAI}$ providers to hide fingerprints in their images to later use for deepfake detection, has been on the rise. Yet, its potential has not been fully explored. We present $\textit{UnMarker}$ -- the first practical $\textit{universal}$ attack on defensive watermarking. Unlike existing attacks, $\textit{UnMarker}$ requires no detector feedback, no unrealistic knowledge of the scheme or similar models, and no advanced denoising pipelines that may not be available. Instead, being the product of an in-depth analysis of the watermarking paradigm revealing that robust schemes must construct their watermarks in the spectral amplitudes, $\textit{UnMarker}$ employs two novel adversarial optimizations to disrupt the spectra of watermarked images, erasing the watermarks. Evaluations against the $\textit{SOTA}$ prove its effectiveness, not only defeating traditional schemes while retaining superior quality compared to existing attacks but also breaking $\textit{semantic}$ watermarks that alter the image's structure, reducing the best detection rate to $43\%$ and rendering them useless. To our knowledge, $\textit{UnMarker}$ is the first practical attack on $\textit{semantic}$ watermarks, which have been deemed the future of robust watermarking. $\textit{UnMarker}$ casts doubts on the very penitential of this countermeasure and exposes its paradoxical nature as designing schemes for robustness inevitably compromises other robustness aspects.
关于$\textit{Generative AI}$($\textit{GenAI}$)用于制作有害深度伪造的报告每天都在增加。最近,防御性水印标记(Defensive Watermarking)激增,它使$\textit{GenAI}$提供商能够在他们的图像中隐藏指纹,以便稍后用于深度伪造检测。然而,它的潜力还没有完全发挥出来。我们提出了$\textit{UnMarker}$——第一个针对防御性水印标记的实际通用攻击。与现有攻击不同,$\textit{UnMarker}$不需要检测器反馈,不需要对攻击方案或类似模型的不切实际知识,也不需要高级去噪管道,这些可能并不存在。相反,它是通过深入分析水印范式揭示出,具有弹性的方案必须在频谱幅度上构建水印,$\textit{UnMarker}$采用两种新颖的对抗性优化来干扰水印图像的频谱,消除水印。与当前最佳攻击($\textit{SOTA}$)的评估证明其有效性,不仅在对传统攻击的胜利中保持卓越的质量,而且打破了改变图像结构的“语义”水印,将最佳检测率降到43%,使它们变得毫无用处。据我们所知,$\textit{UnMarker}$是第一个针对“语义”水印的实际攻击,这些水印被认为将是未来具有弹性的水印方案。$\textit{UnMarker}$使人们对这一补救措施的惩罚产生怀疑,并揭示了其自相矛盾的性质,即为了设计具有弹性的方案,必然会牺牲其他方面的可靠性。
https://arxiv.org/abs/2405.08363
Autonomous Vehicles (AVs) heavily rely on sensors and communication networks like Global Positioning System (GPS) to navigate autonomously. Prior research has indicated that networks like GPS are vulnerable to cyber-attacks such as spoofing and jamming, thus posing serious risks like navigation errors and system failures. These threats are expected to intensify with the widespread deployment of AVs, making it crucial to detect and mitigate such attacks. This paper proposes GPS Intrusion Detection System, or GPS-IDS, an Anomaly Behavior Analysis (ABA)-based intrusion detection framework to detect GPS spoofing attacks on AVs. The framework uses a novel physics-based vehicle behavior model where a GPS navigation model is integrated into the conventional dynamic bicycle model for accurate AV behavior representation. Temporal features derived from this behavior model are analyzed using machine learning to detect normal and abnormal navigation behavior. The performance of the GPS-IDS framework is evaluated on the AV-GPS-Dataset - a real-world dataset collected by the team using an AV testbed. The dataset has been publicly released for the global research community. To the best of our knowledge, this dataset is the first of its kind and will serve as a useful resource to address such security challenges.
自动驾驶车辆(AVs)主要依赖像全球定位系统(GPS)这样的传感器和通信网络进行自主导航。之前的研究表明,像GPS这样的网络很容易受到诸如伪造和干扰等 cyber-attacks,从而对导航错误和系统故障等严重风险构成严重威胁。预计随着AV的广泛部署,这些威胁将会加剧,因此检测和减轻这些攻击至关重要。本文提出了一种基于GPS Intrusion Detection System(GPS-IDS)的异常行为分析(ABA)框架,用于检测AV上的GPS伪造攻击。该框架使用了一种新颖的基于物理的车辆行为模型,将GPS导航模型与传统的动态自行车模型集成,以准确表示AV的行为。这种行为模型生成的时间特征通过机器学习分析,以检测正常和异常导航行为。对GPS-IDS框架的性能在AV-GPS-Dataset上进行了评估——这是该团队利用AV测试台收集的实时数据集。据我们所知,这个数据集是独一无二的,将成为解决这类安全挑战的有用资源。
https://arxiv.org/abs/2405.08359
DNN-based watermarking methods are rapidly developing and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the best of our knowledge, Resolution-Agnostic Image WaterMarking (RAIMark) framework by watermarking the implicit neural representation (INR) of image. Unlike previous methods, our method does not rely on the previous reduction process by directly watermarking the continuous signal instead of image pixels, thus achieving resolution-agnostic watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for the target image. As a continuous signal, such an INR can be sampled to obtain images with variant resolutions. Then, we quickly fine-tune the fitted INR to get a watermarked INR conditioned on a binary secret message. A pre-trained watermark decoder extracts the hidden message from any sampled images with arbitrary resolutions. By directly watermarking INR, we achieve resolution-agnostic watermarking with increased robustness. Extensive experiments show that our method outperforms previous methods with significant improvements: averagely improved bit accuracy by 7%$\sim$29%. Notably, we observe that previous methods are vulnerable to at least one watermarking attack (e.g. JPEG, crop, resize), while ours are robust against all watermarking attacks.
基于DNN的水印方法正在快速发展,并取得令人印象深刻的性能。最近,通过将变分分辨率水印问题降低到固定分辨率水印问题,实现了对图像的分辨率无关的水印。然而,这种降低过程可能会引入伪影和低鲁棒性。为了应对这个问题,我们提出了第一个,据我们所知,分辨率无关图像水印(RAIMark)框架。通过水印图像的隐式神经表示(INR),我们实现了分辨率无关的水印。与之前的方法不同,我们的方法不依赖于之前的降低过程,而是直接对连续信号进行水印,从而实现了分辨率无关的水印。具体来说,对于任意分辨率图像,我们为目标图像调整INR。作为连续信号,这样的INR可以采样以获得具有不同分辨率的图像。然后,我们迅速对调整后的INR进行微调,以获得基于二进制机密消息的水印INR。预训练的水印解码器可以从任何具有任意分辨率的图像中提取隐含的消息。通过直接水印INR,我们实现了分辨率无关的水印,具有更高的鲁棒性。大量实验证明,我们的方法在提高显著性能方面超过了之前的方法:平均改善了比特精度7%至29%。值得注意的是,我们观察到,之前的算法对至少一种水印攻击(例如JPEG、裁剪、缩放)非常脆弱,而我们的方法对所有水印攻击都具有鲁棒性。
https://arxiv.org/abs/2405.08340
The grasp generation of dexterous hand often requires a large number of grasping annotations. Especially for functional grasp-requiring the grasp pose to be convenient for the subsequent use of the object. However, annotating high DoF dexterous hand pose is rather challenging. This prompt us to explore how people achieve manipulations on new objects based on past grasp experiences. We find that people are adept at discovering and leveraging various similarities between objects when grasping new items, including shape, layout, and grasp type. In light of this, we analyze and collect grasp-related similarity relationships among 51 common tool-like object categories and annotate semantic grasp representation for 1768 objects. These data are organized into the form of a knowledge graph, which helps infer our proposed cross-category functional grasp synthesis. Through extensive experiments, we demonstrate that the grasp-related knowledge indeed contributed to achieving functional grasp transfer across unknown or entirely new categories of objects. We will publicly release the dataset and code to facilitate future research.
熟练的手的抓取生成通常需要大量的抓取注释。特别是对于需要功能抓取且抓持姿势对于后续使用对象来说方便的对象。然而,对高维度灵活手抓取姿势的注释相当具有挑战性。这个提示我们探索人们如何基于过去的抓取经验来在新物体上进行操作。我们发现人们擅长发现和利用物体之间的各种相似性,包括形状、布局和抓取类型。鉴于这一点,我们对51个常见工具状物体类别进行了抓取相关相似关系分析,并为1768个物体标注了语义抓取表示。这些数据以知识图谱的形式组织起来,有助于推断我们提出的跨类功能抓取合成。通过大量实验,我们证实了抓取相关知识确实有助于在未知的或完全新的物体类别之间实现功能抓取转移。我们将公开发布该数据集和代码,以促进未来研究。
https://arxiv.org/abs/2405.08310
The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intelligent and versatile surgical assistant is expected to accurately understand the surgeon's intentions and accordingly conduct the specific tasks to support the surgical process. In this work, by leveraging advanced multimodal large language models (MLLMs), we propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention and complete a series of surgical understanding tasks, e.g., surgical scene analysis, surgical instrument detection, and segmentation on demand. Specifically, to achieve superior surgical multimodal understanding, we devise a mixture of projectors (MOP) module to align the surgical MLLM in VS-Assistant to balance the natural and surgical knowledge. Moreover, we devise a surgical Function-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions, and thus make a series of surgical function calls on demand to meet the needs of the surgeons. Extensive experiments on neurosurgery data confirm that our VS-Assistant can understand the surgeon's intention more accurately than the existing MLLM, resulting in overwhelming performance in textual analysis and visual tasks. Source code and models will be made public.
手术干预对患者健康至关重要,许多研究已经开发出高级算法,为外科医生提供理解和决策支持。尽管取得了很大进展,但这些算法是为单一具体任务和场景而设计的,实际上需要手动组合不同功能,因此其应用范围有限。因此,期待一个智能且多才多艺的外科助手准确理解外科医生的意图,并相应地执行支持手术过程的具体任务。在这个工作中,我们通过利用先进的跨模态大型语言模型(MLLMs),提出了一种多才多艺的外科助手(VS-Assistant),可以准确理解外科医生的意图并完成一系列手术理解任务,例如手术场景分析、手术器械检测和需求时的分割。具体来说,为了实现卓越的手术多模态理解,我们设计了一个投影器(MOP)模块,将外科MLLM在VS-Assistant对齐,平衡自然和手术知识。此外,我们还设计了一种手术功能调用调整策略,使VS-Assistant能够理解外科意图,从而按需调用一系列手术功能以满足外科医生的需求。对神经外科数据的大规模实验证实,我们的VS-Assistant比现有MLLM更准确地理解外科医生的意图,从而在文本分析和视觉任务上表现出色。源代码和模型将公开发布。
https://arxiv.org/abs/2405.08272