Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at this https URL.
食管癌是全球最常见的癌症之一,在癌症相关死亡率中排名第六。准确的多模态癌症分期诊断方法可以帮助医生有效地个性化治疗方案。目前,基于CT的癌症诊断方法因全面检查患者状况而受到广泛关注。然而,基于多模态的方法可能会引入信息冗余,导致性能下降。此外,需要进一步探索多模态表示之间的高效且有效的交互作用,缺乏对多模态特征的预测关联的深入探讨。在这项工作中,我们基于CT图像的 multi-modal 异质图条件特征指导扩散模型研究淋巴结转移的诊断,并使用临床测量和放射学数据。为了探索多模态特征之间的复杂关系,我们构建了一个异质图。接下来,应用条件特征引导扩散方法消除信息冗余。此外,我们提出了一种遮罩关系表示学习策略,旨在揭示原发肿瘤和淋巴结图像表示的潜在预后相关性和优先级。各种实验结果证实了我们提出方法的的有效性。代码可在此处访问:https://www.xxx.com
https://arxiv.org/abs/2405.09539
With the advent of image super-resolution (SR) algorithms, how to evaluate the quality of generated SR images has become an urgent task. Although full-reference methods perform well in SR image quality assessment (SR-IQA), their reliance on high-resolution (HR) images limits their practical applicability. Leveraging available reconstruction information as much as possible for SR-IQA, such as low-resolution (LR) images and the scale factors, is a promising way to enhance assessment performance for SR-IQA without HR for reference. In this letter, we attempt to evaluate the perceptual quality and reconstruction fidelity of SR images considering LR images and scale factors. Specifically, we propose a novel dual-branch reduced-reference SR-IQA network, \ie, Perception- and Fidelity-aware SR-IQA (PFIQA). The perception-aware branch evaluates the perceptual quality of SR images by leveraging the merits of global modeling of Vision Transformer (ViT) and local relation of ResNet, and incorporating the scale factor to enable comprehensive visual perception. Meanwhile, the fidelity-aware branch assesses the reconstruction fidelity between LR and SR images through their visual perception. The combination of the two branches substantially aligns with the human visual system, enabling a comprehensive SR image evaluation. Experimental results indicate that our PFIQA outperforms current state-of-the-art models across three widely-used SR-IQA benchmarks. Notably, PFIQA excels in assessing the quality of real-world SR images.
随着图像超分辨率(SR)算法的出现,如何评估生成SR图像的质量已成为一个紧迫的任务。尽管全参考方法在SR图像质量评估(SR-IQA)中表现良好,但它们对高分辨率(HR)图像的依赖限制了其实用性。充分利用SR-IQA中可用的重建信息,如低分辨率(LR)图像和比例因子,是一种提高SR-IQA性能而无需参考HR图像的方法。在本文中,我们试图评估SR图像的感知质量和重构准确性,同时考虑LR图像和比例因子。具体来说,我们提出了一个新颖的双分支减少参考SR-IQA网络,即感知-和可靠性-感知SR-IQA(PFIQA)。感知分支通过利用Vision Transformer(ViT)的全局建模优点和ResNet的局部关系,以及包含比例因子来提高综合视觉 perception。同时,可靠性分支通过它们的视觉感知评估LR和SR图像之间的重构准确性。两个分支的组合使得PFIQA与人类视觉系统高度契合,实现了全面的SR图像评估。实验结果表明,我们的PFIQA在三个广泛使用的SR-IQA基准测试中超过了最先进的模型。值得注意的是,PFIQA在评估真实世界SR图像的质量方面表现出色。
https://arxiv.org/abs/2405.09472
This paper introduces the Global-Local Image Perceptual Score (GLIPS), an image metric designed to assess the photorealistic image quality of AI-generated images with a high degree of alignment to human visual perception. Traditional metrics such as FID and KID scores do not align closely with human evaluations. The proposed metric incorporates advanced transformer-based attention mechanisms to assess local similarity and Maximum Mean Discrepancy (MMD) to evaluate global distributional similarity. To evaluate the performance of GLIPS, we conducted a human study on photorealistic image quality. Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores. Additionally, we introduce the Interpolative Binning Scale (IBS), a refined scaling method that enhances the interpretability of metric scores by aligning them more closely with human evaluative standards. The proposed metric and scaling approach not only provides more reliable assessments of AI-generated images but also suggest pathways for future enhancements in image generation technologies.
本文介绍了全局局部图像感知分数(GLIPS),一种用于评估高度与人类视觉感知一致的AI生成的图像的图像指标。传统指标如FID和KID得分与人类评价的关系不密切。所提出的指标采用先进的Transformer基注意力机制来评估局部相似度和最大均方差(MMD)以评估全局分布相似度。为了评估GLIPS的性能,我们在图像质量方面进行了一项人类研究。对各种生成模型进行的全局测试表明,GLIPS在与人类评分的一致性方面显著优于现有指标如FID、SSIM和MS-SSIM。此外,我们还引入了平滑分割尺度(IBS),一种通过更紧密地与人类评价标准对齐来提高指标分数解释性的平滑分割方法。所提出的指标和分割方法不仅为AI生成图像提供了更可靠的评估,还提出了未来图像生成技术改进的路径。
https://arxiv.org/abs/2405.09426
Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
近年来,短视频的发展速度加快,通常包含视觉和音频模态。背景音乐对于短视频来说非常重要,因为它可以显著影响观众的情感。然而,目前,短视频的背景音乐通常由视频制作人选择,缺乏针对短视频的自动音乐推荐方法。本文介绍了一种创新的跨模态检索模型MVBind,用于短片的跨模态检索。MVBind是一种自监督方法,通过直接从数据中获取模态关系的固有知识,无需手动注释。此外,为了弥补短视频没有相应的音乐-视觉对数据集的不足,我们构建了一个数据集SVM-10K(短视频与音乐-10K),主要包含精心选择短的短视频。在这个数据集上,MVBind表现出比其他基线方法显著的优异性能。构建的数据集和代码将发布,以促进未来研究。
https://arxiv.org/abs/2405.09286
The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission.
在机器翻译(MT)中,幻觉和遗漏问题一直是一个长期存在的问题,因为LLM本身容易受到这些现象的影响。在本文中,我们通过引导基于LLM的MT模型进行更好的词对齐来减轻这个问题。我们首先研究了词对齐和MT中幻觉和遗漏现象之间的相关性。然后,我们提出将词对齐作为偏好来优化基于LLM的MT模型。偏好数据是通过选择和拒绝多个MT工具中的翻译来构建的。随后,我们使用直接偏好优化向量来优化LLM-based模型以指向偏好信号。鉴于在MT中没有专门为幻觉和遗漏评估的评估者,我们进一步提出选择难实例并利用GPT-4直接评估模型缓解这些问题。通过实验验证这些设计评估方法的有效性,并通过对结果的广泛分析展示了基于词对齐的偏好优化在减轻幻觉和遗漏方面的有效性。
https://arxiv.org/abs/2405.09223
In this paper, we present the findings of our Project ALPINE which stands for ``Autoregressive Learning for Planning In NEtworks." Project ALPINE initiates a theoretical investigation into the development of planning capabilities in Transformer-based language models through their autoregressive learning mechanisms, aiming to identify any potential limitations in their planning abilities. We abstract planning as a network path-finding task where the objective is to generate a valid path from a specified source node to a designated target node. In terms of expressiveness, we show that the Transformer is capable of executing path-finding by embedding the adjacency and reachability matrices within its weights. Our theoretical analysis of the gradient-based learning dynamic of the Transformer reveals that the Transformer is capable of learning both the adjacency matrix and a limited form of the reachability matrix. These theoretical insights are then validated through experiments, which demonstrate that the Transformer indeed learns the adjacency matrix and an incomplete reachability matrix, which aligns with the predictions made in our theoretical analysis. Additionally, when applying our methodology to a real-world planning benchmark, called Blocksworld, our observations remain consistent. Our theoretical and empirical analyses further unveil a potential limitation of Transformer in path-finding: it cannot identify reachability relationships through transitivity, and thus would fail when path concatenation is needed to generate a path. In summary, our findings shed new light on how the internal mechanisms of autoregressive learning enable planning in networks. This study may contribute to our understanding of the general planning capabilities in other related domains.
在本文中,我们展示了我们的项目ALPINE的研究成果,该项目名为“基于Transformer的自动规划网络”。ALPINE通过研究Transformer基语言模型的自回归学习机制,对基于Transformer的规划能力的发展进行理论探讨,旨在确定其规划能力中存在的任何潜在限制。我们将规划定义为一个网络路径寻址任务,其中目标是从指定的源节点生成一个有效的路径到指定的目标节点。在表现力方面,我们证明了Transformer可以通过将邻接和可达性矩阵嵌入其权重来执行路径寻址。我们对Transformer的自回归学习动态的理论分析揭示了Transformer能够学习邻接矩阵以及有限形式的可达性矩阵。这些理论见解通过实验得到了验证,实验结果表明Transformer确实能够学习邻接矩阵和有限形式的可达性矩阵,这与我们理论分析中的预测相符。此外,当将我们的方法应用于现实世界的规划基准,称为Blocksworld时,我们的观察结果仍然保持一致。我们的理论和实证分析进一步揭示了Transformer在路径寻址中的一个潜在限制:它不能通过传递性识别可达性关系,因此当路径拼接需要生成路径时,它将失败。总之,我们的研究结果为理解自回归学习在网络中的内部机制以及如何实现网络规划提供了新的思路。本研究可能会对我们的理解其他相关领域中的普遍规划能力有所贡献。
https://arxiv.org/abs/2405.09220
Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE ranks these answers based on their log-likelihood under the LM's distribution, and subsequently calculates their correlation with the corresponding human rankings. We support HRE's efficacy by investigating how efficiently it separates pretrained and instruction-tuned LMs of various sizes. We show that HRE correlates well with human judgements and is particularly responsive to model changes following instruction-tuning.
语言模型(LMs)作为会话助手最近变得越来越受欢迎,这些工具有助于人们完成各种任务。通常,这是通过在预先调整通用领域文本序列的基础上,通过指令微调和可能的选择优化方法,对LMs进行自适应。对这类LMS的评估理想上应该是通过人类判断来完成的,然而,这并不切实际。另一方面,通过辅助LMs作为评委或知识基础任务实现自动评估是可扩展的,但很难评估会话能力以及遵循指示的准确性。为了加速LMs作为会话助手的发展,我们提出了一个新颖的自动评估任务:HumanRankEval(HRE)。它包括一个大型、多样且高质量的问答集,每个答案都由人类编撰和评分。为了进行评估,HRE根据LM的分布对这些答案进行排序,然后计算它们与相应人类评分之间的相关性。我们通过研究HRE如何高效地分离各种大小的预训练和指令微调的LM来支持HRE的有效性。我们发现,HRE与人类判断高度相关,尤其是在指令微调后模型变化时。
https://arxiv.org/abs/2405.09186
Today's analog/mixed-signal (AMS) integrated circuit (IC) designs demand substantial manual intervention. The advent of multimodal large language models (MLLMs) has unveiled significant potential across various fields, suggesting their applicability in streamlining large-scale AMS IC design as well. A bottleneck in employing MLLMs for automatic AMS circuit generation is the absence of a comprehensive dataset delineating the schematic-netlist relationship. We therefore design an automatic technique for converting schematics into netlists, and create dataset AMSNet, encompassing transistor-level schematics and corresponding SPICE format netlists. With a growing size, AMSNet can significantly facilitate exploration of MLLM applications in AMS circuit design. We have made an initial set of netlists public, and will make both our netlist generation tool and the full dataset available upon publishing of this paper.
今天的模拟/混合信号(AMS)集成电路(IC)设计需要大量的手动干预。多模态大型语言模型(MLLMs)的出现揭示了各种领域中的显著潜力,表明它们在简化大规模AMS IC设计方面具有应用价值。然而,在将MLLM用于自动AMS电路生成方面,采用MLLM的一个瓶颈是缺乏一个全面的电路网络图数据集,该数据集定义了电路图与网络之间的关系。因此,我们设计了一种自动将电路图转换为网络列表的技巧,并创建了数据集AMSNet,涵盖了器件级别的电路图和相应的SPICE格式网络列表。随着规模的增大,AMSNet可以显著地促进探索MLLM在AMS电路设计中的应用。我们已经公开了一些初步的网络列表,并在论文发布后,将公开我们的网络列表生成工具和完整的数据集。
https://arxiv.org/abs/2405.09045
Referring Image Segmentation (RIS) consistently requires language and appearance semantics to more understand each other. The need becomes acute especially under hard situations. To achieve, existing works tend to resort to various trans-representing mechanisms to directly feed forward language semantic along main RGB branch, which however will result in referent distribution weakly-mined in space and non-referent semantic contaminated along channel. In this paper, we propose Spatial Semantic Recurrent Mining (S\textsuperscript{2}RM) to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. During fusion, S\textsuperscript{2}RM will first generate a constraint-weak yet distribution-aware language feature, then bundle features of each row and column from rotated features of one modality context to recurrently correlate relevant semantic contained in feature from other modality context, and finally resort to self-distilled weights to weigh on the contributions of different parsed semantics. Via coparsing, S\textsuperscript{2}RM transports information from the near and remote slice layers of generator context to the current slice layer of parsed context, capable of better modeling global relationship bidirectional and structured. Besides, we also propose a Cross-scale Abstract Semantic Guided Decoder (CASG) to emphasize the foreground of the referent, finally integrating different grained features at a comparatively low cost. Extensive experimental results on four current challenging datasets show that our proposed method performs favorably against other state-of-the-art algorithms.
参考图像分割(RIS)一致需要语言和外观语义来更好地理解彼此。尤其是在困难情况下,这种需求变得更加突出。为了实现这一目标,现有作品倾向于采用各种变换表示机制直接将语言语义沿主要RGB分支进行前馈,然而这可能导致在空间上进行参考分布的弱化以及通道中语义污染的非参考分布。在本文中,我们提出了空间语义递归挖掘(S\textsuperscript{2}RM)以实现高质量的跨模态融合。它遵循三部曲工作策略:分布语言特征,空间语义递归复制和解析语义平衡。在融合过程中,S\textsuperscript{2}RM将首先生成一个约束较弱但分布敏感的语言特征,然后将每个行和列从旋转特征中打包成循环相关的语义特征,最后将自抽离权重用于不同解析语义特征的贡献。通过复制,S\textsuperscript{2}RM将来自生成器上下文近和远切片层的语信息传递到解析上下文的当前行切片层,能够更好地建模全局关系双向和结构化。此外,我们还提出了跨模态抽象语义引导解码器(CASG),强调参考物的主题,并最终在相对较低的成本下整合不同细粒度特征。在四个当前具有挑战性的数据集上的广泛实验结果表明,与最先进的算法相比,我们所提出的方法表现出色。
https://arxiv.org/abs/2405.09006
Large-scale Vision-Language Models (VLMs) have demonstrated exceptional performance in natural vision tasks, motivating researchers across domains to explore domain-specific VLMs. However, the construction of powerful domain-specific VLMs demands vast amounts of annotated data, substantial electrical energy, and computing resources, primarily accessible to industry, yet hindering VLM research in academia. To address this challenge and foster sustainable and equitable VLM research, we present the Generalized Domain Prompt Learning (GDPL) framework. GDPL facilitates the transfer of VLMs' robust recognition capabilities from natural vision to specialized domains, without the need for extensive data or resources. By leveraging small-scale domain-specific foundation models and minimal prompt samples, GDPL empowers the language branch with domain knowledge through quaternion networks, uncovering cross-modal relationships between domain-specific vision features and natural vision-based contextual embeddings. Simultaneously, GDPL guides the vision branch into specific domains through hierarchical propagation of generated vision prompt features, grounded in well-matched vision-language relations. Furthermore, to fully harness the domain adaptation potential of VLMs, we introduce a novel low-rank adaptation approach. Extensive experiments across diverse domains like remote sensing, medical imaging, geology, Synthetic Aperture Radar, and fluid dynamics, validate the efficacy of GDPL, demonstrating its ability to achieve state-of-the-art domain recognition performance in a prompt learning paradigm. Our framework paves the way for sustainable and inclusive VLM research, transcending the barriers between academia and industry.
大规模视觉语言模型(VLMs)在自然视觉任务中的卓越表现,激发了跨学科领域的研究人员探索领域特定的VLMs。然而,构建强大的领域特定VLMs需要大量注释数据、大量的电力和计算资源,主要面向工业界,这阻碍了学术界VLM研究的发展。为了应对这一挑战,促进可持续和公正的VLM研究,我们提出了泛化领域提示学习(GDPL)框架。GDPL通过将VLMs的稳健识别能力从自然视觉传递到专用领域,无需大量数据或资源来实现。通过利用小规模领域特定基础模型和最小提示样本,GDPL通过四元网络赋予语言分支领域知识,揭示领域特定视觉特征与自然视觉 based 的上下文嵌入之间的跨模态关系。同时,GDPL通过分层传播生成的视觉提示特征引导视觉分支进入特定领域,基于匹配的视觉语言关系。此外,为了充分利用VLMs的领域适应潜力,我们引入了一种新的低秩适应方法。在遥感、医学成像、地质、合成孔径雷达和流体动力学等多样领域进行的大量实验证实了GDPL的有效性,表明其在提示学习范式下实现最先进的领域识别性能。我们的框架为可持续和包容的VLM研究铺平道路,超越了学术界和工业界之间的障碍。
https://arxiv.org/abs/2405.08668
With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.
随着深度学习的 recent 进步,已经开发了众多算法来提高视频质量、减少视觉伪影和提高感知质量。然而,关于增强内容质量评估的研究仍然很少。增强内容的质量评估通常基于为压缩应用设计的质量度量指标。在本文中,我们提出了一个专门针对增强视频内容的新颖盲深度视频质量评估(VQA)方法。它采用了一种基于新数据库(包含 13K 训练补丁)的内容质量感知循环记忆转置(RMT)网络架构来获得视频质量表示,并通过一种基于新数据库(包含13K训练补丁)的内容质量感知对比学习策略来优化它。提取出的质量表示通过线性回归合并生成视频级的质量索引。所提出的方法,RMT-BVQA,通过五倍交叉验证对 VDPVE(用于感知视频增强的数据集)数据库进行了评估。结果表明,与十个现有无参考质量度量指标相比,其相关性能优越。
https://arxiv.org/abs/2405.08621
The nature of diversity in real-world environments necessitates neural network models to expand from closed category settings to accommodate novel emerging categories. In this paper, we study the open-vocabulary object detection (OVD), which facilitates the detection of novel object classes under the supervision of only base annotations and open-vocabulary knowledge. However, we find that the inadequacy of neighboring relationships between regions during the alignment process inevitably constrains the performance on recent distillation-based OVD strategies. To this end, we propose Neighboring Region Attention Alignment (NRAA), which performs alignment within the attention mechanism of a set of neighboring regions to boost the open-vocabulary inference. Specifically, for a given proposal region, we randomly explore the neighboring boxes and conduct our proposed neighboring region attention (NRA) mechanism to extract relationship information. Then, this interaction information is seamlessly provided into the distillation procedure to assist the alignment between the detector and the pre-trained vision-language models (VLMs). Extensive experiments validate that our proposed model exhibits superior performance on open-vocabulary benchmarks.
现实环境中的多样性需要神经网络模型从封闭的类别设置扩展到以容纳新颖的浮现类别。在本文中,我们研究了开放词汇对象检测(OVD),它通过仅基于基本注释的监督来检测新颖的对象类别。然而,我们发现,在配准过程中,区域之间相邻关系的不足会必然限制最近基于蒸馏的OD策略的性能。为此,我们提出了邻居区域注意对齐(NRAA),它通过一组邻居区域的注意机制在注意力机制内进行对齐,以提高开放词汇的推理。具体来说,对于给定的建议区域,我们随机探索邻居框,并执行我们提出的邻居区域注意(NRA)机制来提取关系信息。然后,这种交互信息被无缝地提供到蒸馏过程中,以协助检测器与预训练的视觉语言模型(VLMs)之间的对齐。大量实验证实,与开放词汇基准相比,我们提出的模型具有卓越的性能。
https://arxiv.org/abs/2405.08593
This article explores the adaptive relationship between Encoder Layers and Decoder Layers using the SOTA model Helsinki-NLP/opus-mt-de-en, which translates German to English. The specific method involves introducing a bias-free fully connected layer between the Encoder and Decoder, with different initializations of the layer's weights, and observing the outcomes of fine-tuning versus retraining. Four experiments were conducted in total. The results suggest that directly modifying the pre-trained model structure for fine-tuning yields suboptimal performance. However, upon observing the outcomes of the experiments with retraining, this structural adjustment shows significant potential.
这篇文章探讨了使用Helsinki-NLP/opus-mt-de-en模型中的编码器层和解码器层之间的自适应关系。具体方法是在编码器和解码器之间引入一个无偏的全连接层,具有不同的初始化权重,并观察微调和解重训练的结果。共进行了四个实验。结果表明,直接修改预训练模型结构以进行微调导致性能劣化。然而,通过观察实验结果的重构,这种结构调整具有显著的潜力。
https://arxiv.org/abs/2405.08570
In visual tasks, large teacher models capture essential features and deep information, enhancing performance. However, distilling this information into smaller student models often leads to performance loss due to structural differences and capacity limitations. To tackle this, we propose a distillation framework based on graph knowledge, including a multi-level feature alignment strategy and an attention-guided mechanism to provide a targeted learning trajectory for the student model. We emphasize spectral embedding (SE) as a key technique in our distillation process, which merges the student's feature space with the relational knowledge and structural complexities similar to the teacher network. This method captures the teacher's understanding in a graph-based representation, enabling the student model to more accurately mimic the complex structural dependencies present in the teacher model. Compared to methods that focus only on specific distillation areas, our strategy not only considers key features within the teacher model but also endeavors to capture the relationships and interactions among feature sets, encoding these complex pieces of information into a graph structure to understand and utilize the dynamic relationships among these pieces of information from a global perspective. Experiments show that our method outperforms previous feature distillation methods on the CIFAR-100, MS-COCO, and Pascal VOC datasets, proving its efficiency and applicability.
在视觉任务中,大型教师模型能够捕获关键特征和深度信息,提高性能。然而,将这种信息压缩到学生模型中往往会导致性能损失,因为学生模型和教师模型的结构存在差异和容量限制。为了解决这个问题,我们提出了一个基于图知识的蒸馏框架,包括多级特征对齐策略和关注引导机制,为学生的模型提供目标学习路径。我们强调谱聚类(SE)作为我们蒸馏过程的关键技术,它将学生的特征空间与类似于教师网络的结构关系和复杂性相结合。这种方法捕捉了教师在基于图的关系知识,使得学生模型能够更准确地模仿教师模型的复杂结构依赖关系。与只关注特定蒸馏领域的方法相比,我们的策略不仅考虑了教师模型中的关键特征,而且努力捕捉特征集之间的关系和相互作用,将这些复杂信息编码为图结构,从全局视角理解并利用这些信息。实验证明,我们的方法在CIFAR-100、MS-COCO和Pascal VOC数据集上的性能优于之前的特征蒸馏方法,证明了其高效性和适用性。
https://arxiv.org/abs/2405.08547
Recent advances in knowledge graph embedding (KGE) rely on Euclidean/hyperbolic orthogonal relation transformations to model intrinsic logical patterns and topological structures. However, existing approaches are confined to rigid relational orthogonalization with restricted dimension and homogeneous geometry, leading to deficient modeling capability. In this work, we move beyond these approaches in terms of both dimension and geometry by introducing a powerful framework named GoldE, which features a universal orthogonal parameterization based on a generalized form of Householder reflection. Such parameterization can naturally achieve dimensional extension and geometric unification with theoretical guarantees, enabling our framework to simultaneously capture crucial logical patterns and inherent topological heterogeneity of knowledge graphs. Empirically, GoldE achieves state-of-the-art performance on three standard benchmarks. Codes are available at this https URL.
近年来,知识图嵌入(KGE)的进步主要依赖于欧氏/混叠欧氏正交关系变换来建模固有逻辑模式和拓扑结构。然而,现有方法局限于刚性关系正交化以及受限维度和同质几何,导致建模能力不足。在本文中,我们通过引入一个名为GoldE的强大框架,在维度和几何方面超越了这些方法。该框架基于一种一般形式的家用反射,具有普遍的正交参数化。这种参数化可以自然实现维度的扩展和几何的统一,使得我们的框架能够同时捕捉知识图的关键逻辑模式和固有拓扑异质性。实证研究表明,GoldE在三个标准基准测试中都实现了最先进的性能。代码可在此处访问:https://www.xxxxxx.com/。
https://arxiv.org/abs/2405.08540
The IoT and Business Process Management (BPM) communities co-exist in many shared application domains, such as manufacturing and healthcare. The IoT community has a strong focus on hardware, connectivity and data; the BPM community focuses mainly on finding, controlling, and enhancing the structured interactions among the IoT devices in processes. While the field of Process Mining deals with the extraction of process models and process analytics from process event logs, the data produced by IoT sensors often is at a lower granularity than these process-level events. The fundamental questions about extracting and abstracting process-related data from streams of IoT sensor values are: (1) Which sensor values can be clustered together as part of process events?, (2) Which sensor values signify the start and end of such events?, (3) Which sensor values are related but not essential? This work proposes a framework to semi-automatically perform a set of structured steps to convert low-level IoT sensor data into higher-level process events that are suitable for process mining. The framework is meant to provide a generic sequence of abstract steps to guide the event extraction, abstraction, and correlation, with variation points for plugging in specific analysis techniques and algorithms for each step. To assess the completeness of the framework, we present a set of challenges, how they can be tackled through the framework, and an example on how to instantiate the framework in a real-world demonstration from the field of smart manufacturing. Based on this framework, future research can be conducted in a structured manner through refining and improving individual steps.
物联网和业务流程管理(BPM)社区在许多共同的应用领域中存在,如制造业和医疗保健。物联网社区对硬件、连接和数据有很强的关注;BPM社区主要关注在过程设备中寻找、控制和改进结构化的交互。虽然过程挖掘领域涉及从过程事件日志中提取过程模型和过程分析,但物联网传感器产生的数据往往比这些过程级事件具有较低的粒度。从物联网传感器值的流中提取和抽象与过程相关的数据的基本问题包括:(1)哪些传感器值可以聚类为过程事件的一部分?(2)哪些传感器值表示过程事件的开始和结束?(3)哪些传感器值相关但不 essential?本文提出了一个半自动化的框架,以执行一系列结构化步骤将低级物联网传感器数据转换为适合过程挖掘的高级别过程事件。该框架旨在提供一种通用的序列步骤,以指导事件提取、抽象和关联,并允许插入特定分析技术和算法的差异点。为了评估框架的完整性,我们提供了一组挑战,以及如何通过框架解决这些挑战的思路,以及一个从智能制造领域实际演示中即时启动框架的示例。基于这个框架,未来的研究可以通过优化和改进各个步骤来以结构化的方式进行。
https://arxiv.org/abs/2405.08528
In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors.
近年来,深度学习极大地简化了生成逼真假脸图像的过程。为了意识到这种伪造技术的危险性,研究人员开发了各种工具来检测这些伪造技术。然而,没有一个工具问到这个问题:什么数字编辑会使得真实照片脸孔图像伪造,而其他工具不会?在本文中,我们将面部伪造置于语义背景下,并定义了能够超过人类判断阈值的语义人脸属性编辑的方法是面部伪造的源头。在受到新定义的指导下,我们构建了一个大规模的人脸伪造图像数据集,其中每个图像都与一个由层次图组织起来的标签集相关联。我们的数据集使得有两个新的测试协议可以探究面部伪造检测器的泛化能力。此外,我们提出了一个语义导向的面部伪造检测方法,它抓住了标签关系并优先考虑了主要任务(即真实或伪造脸孔检测)。我们证明了所提出的数据集成功地揭示了当前检测器的缺陷作为测试集,并且在训练集上持续改进了它们的泛化能力。此外,我们还证明了我们的语义导向方法比传统的二分类和多分类分类基础检测器具有优越性。
https://arxiv.org/abs/2405.08487
Traditional recommendation proposals, including content-based and collaborative filtering, usually focus on similarity between items or users. Existing approaches lack ways of introducing unexpectedness into recommendations, prioritizing globally popular items over exposing users to unforeseen items. This investigation aims to design and evaluate a novel layer on top of recommender systems suited to incorporate relational information and suggest items with a user-defined degree of surprise. We propose a Knowledge Graph (KG) based recommender system by encoding user interactions on item catalogs. Our study explores whether network-level metrics on KGs can influence the degree of surprise in recommendations. We hypothesize that surprisingness correlates with certain network metrics, treating user profiles as subgraphs within a larger catalog KG. The achieved solution reranks recommendations based on their impact on structural graph metrics. Our research contributes to optimizing recommendations to reflect the metrics. We experimentally evaluate our approach on two datasets of LastFM listening histories and synthetic Netflix viewing profiles. We find that reranking items based on complex network metrics leads to a more unexpected and surprising composition of recommendation lists.
传统推荐策略,包括基于内容和协同过滤的方法,通常关注物品或用户之间的相似性。现有方法缺乏将意外性引入推荐的方法,将全局热门物品优先于向用户展示未知的物品。本研究旨在设计并评估一种新层,将关系信息编码在物品目录上,用于在推荐系统中建议具有用户定义程度惊喜的物品。我们提出了一个基于知识图谱的推荐系统,通过编码用户在目录上的交互来实现。我们的研究探讨了网络级指标在知识图谱上的影响是否会影响推荐中的惊喜程度。我们假设惊喜程度与某些网络指标相关,将用户个人档案视为大型目录知识图谱中的子图。所实现的结果根据其对结构图指标的影响对推荐进行排序。我们的研究为优化推荐以反映这些指标做出了贡献。我们在LastFM听书历史数据集和合成Netflix观看个人资料数据集上进行了实验评估。我们发现,根据复杂的网络指标重新排列物品会导致推荐列表更加意外和令人惊讶。
https://arxiv.org/abs/2405.08465
Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
由于未见类别的标签信息有限,少样本分割仍然具有挑战性。之前的方法主要依赖于从冻结的视觉编码器中提取高层次特征图来计算像素级的相似度作为解码器的关键先验指导。然而,这种先验表示存在粗粒度不准确性和对新生类别的泛化能力差,因为这些高层次特征图明显具有类别偏见。在本文中,我们提出用视觉-文本对齐能力来替代视觉先验表示,以捕捉更可靠的指导并提高模型的泛化能力。具体来说,我们设计了两种类型的自适应先验信息生成策略,试图利用预训练的对比语言-图像模型(CLIP)的语义对齐能力来查找目标类别。此外,为了获得更精确的先验指导,我们构建了关注图的高级关系,并利用它来优化初始先验信息。在PASCAL-5和COCO-20数据集上的实验表明,我们的方法取得了显着的改进,并达到了与现有最佳性能相当的水平。
https://arxiv.org/abs/2405.08458
In the training process of Deep Reinforcement Learning (DRL), agents require repetitive interactions with the environment. With an increase in training volume and model complexity, it is still a challenging problem to enhance data utilization and explainability of DRL training. This paper addresses these challenges by focusing on the temporal correlations within the time dimension of time series. We propose a novel approach to segment multivariate time series into meaningful subsequences and represent the time series based on these subsequences. Furthermore, the subsequences are employed for causal inference to identify fundamental causal factors that significantly impact training outcomes. We design a module to provide feedback on the causality during DRL training. Several experiments demonstrate the feasibility of our approach in common environments, confirming its ability to enhance the effectiveness of DRL training and impart a certain level of explainability to the training process. Additionally, we extended our approach with priority experience replay algorithm, and experimental results demonstrate the continued effectiveness of our approach.
在深度强化学习(DRL)的训练过程中,代理需要与环境进行重复交互。随着训练量的增加和模型复杂性的提高,增强数据利用率和对DRL训练的 可解释性仍然是一个具有挑战性的问题。本文通过关注时间维度中的时间相关性来解决这些问题。我们提出了一种将多维时间序列分割为有意义的子序列的新方法,并基于这些子序列表示时间序列。此外,这些子序列用于进行因果推理,以确定对训练结果具有显著影响的基本因果因素。我们设计了一个模块来提供在DRL训练期间的因果性反馈。 several实验证明,我们的方法在常见环境中是行得通的,证实了其提高DRL训练效果和使其具有某种程度的可解释性的能力。此外,我们还通过优先经验回放算法扩展了我们的方法,实验结果表明,我们的方法仍然具有有效性。
https://arxiv.org/abs/2405.08380