Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at this https URL.
食管癌是全球最常见的癌症之一,在癌症相关死亡率中排名第六。准确的多模态癌症分期诊断方法可以帮助医生有效地个性化治疗方案。目前,基于CT的癌症诊断方法因全面检查患者状况而受到广泛关注。然而,基于多模态的方法可能会引入信息冗余,导致性能下降。此外,需要进一步探索多模态表示之间的高效且有效的交互作用,缺乏对多模态特征的预测关联的深入探讨。在这项工作中,我们基于CT图像的 multi-modal 异质图条件特征指导扩散模型研究淋巴结转移的诊断,并使用临床测量和放射学数据。为了探索多模态特征之间的复杂关系,我们构建了一个异质图。接下来,应用条件特征引导扩散方法消除信息冗余。此外,我们提出了一种遮罩关系表示学习策略,旨在揭示原发肿瘤和淋巴结图像表示的潜在预后相关性和优先级。各种实验结果证实了我们提出方法的的有效性。代码可在此处访问:https://www.xxx.com
https://arxiv.org/abs/2405.09539
A recent study by De et al. (2022) has reported that large-scale representation learning through pre-training on a public dataset significantly enhances differentially private (DP) learning in downstream tasks, despite the high dimensionality of the feature space. To theoretically explain this phenomenon, we consider the setting of a layer-peeled model in representation learning, which results in interesting phenomena related to learned features in deep learning and transfer learning, known as Neural Collapse (NC). Within the framework of NC, we establish an error bound indicating that the misclassification error is independent of dimension when the distance between actual features and the ideal ones is smaller than a threshold. Additionally, the quality of the features in the last layer is empirically evaluated under different pre-trained models within the framework of NC, showing that a more powerful transformer leads to a better feature representation. Furthermore, we reveal that DP fine-tuning is less robust compared to fine-tuning without DP, particularly in the presence of perturbations. These observations are supported by both theoretical analyses and experimental evaluation. Moreover, to enhance the robustness of DP fine-tuning, we suggest several strategies, such as feature normalization or employing dimension reduction methods like Principal Component Analysis (PCA). Empirically, we demonstrate a significant improvement in testing accuracy by conducting PCA on the last-layer features.
De等人(2022)最近发表的一篇研究报道指出,通过在公共数据集上进行大规模表示学习,可以显著增强下游任务的差异隐私(DP)学习,尽管特征空间具有高维度。为了理论地解释这一现象,我们考虑表示学习中的层剥离设置,该设置在深度学习和迁移学习中学到的特征中产生了关于神经崩塌(NC)有趣的现象。在NC的框架内,我们建立了误分类误差与维度之间的独立性,即实际特征与理想特征之间的距离小于一个阈值时,误分类误差是独立的。此外,在不同预训练模型下对NC框架内最后层的特征进行了实证评估,结果表明,更强大的Transformer导致更好的特征表示。此外,我们还发现,与没有DP的微调相比,DP微调的鲁棒性较低,尤其是在存在扰动的情况下。这些观察结果得到了理论分析和实验评估的支持。此外,为了增强DP微调的鲁棒性,我们提出了几个策略,例如特征归一化或采用如主成分分析(PCA)等降维方法。通过PCA对最后一层特征进行降维,我们得到了显著的测试准确率提升。
https://arxiv.org/abs/2405.08920
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
本文解决了自监督通用音频表示学习的问题。我们探讨了使用联合嵌入预测架构(JEPA)解决这个任务的途径,它包括将输入的Mel声谱图拆分为两个部分(上下文和目标),计算每个部分的神经表示,并训练神经网络从上下文表示预测目标表示。我们在这种框架内研究了几个设计选择,并通过广泛的实验研究了它们的影响,评估了我们的模型在各种音频分类基准上的表现,包括环境声音、语音和音乐下游任务。我们特别关注输入数据中哪个部分被用作上下文或目标,并通过实验证明了它对模型性能的影响。值得注意的是,在图像领域,一些有效的设计选择导致了在音频方面的表现不佳,从而突出了这两种媒体之间的主要区别。
https://arxiv.org/abs/2405.08679
Integrating the different data modalities of cancer patients can significantly improve the predictive performance of patient survival. However, most existing methods ignore the simultaneous utilization of rich semantic features at different scales in pathology images. When collecting multimodal data and extracting features, there is a likelihood of encountering intra-modality missing data, introducing noise into the multimodal data. To address these challenges, this paper proposes a new end-to-end framework, FORESEE, for robustly predicting patient survival by mining multimodal information. Specifically, the cross-fusion transformer effectively utilizes features at the cellular level, tissue level, and tumor heterogeneity level to correlate prognosis through a cross-scale feature cross-fusion method. This enhances the ability of pathological image feature representation. Secondly, the hybrid attention encoder (HAE) uses the denoising contextual attention module to obtain the contextual relationship features and local detail features of the molecular data. HAE's channel attention module obtains global features of molecular data. Furthermore, to address the issue of missing information within modalities, we propose an asymmetrically masked triplet masked autoencoder to reconstruct lost information within modalities. Extensive experiments demonstrate the superiority of our method over state-of-the-art methods on four benchmark datasets in both complete and missing settings.
综合癌症患者的不同数据模式可以显著提高预测患者生存的性能。然而,现有的方法大多数忽略了在病理图像中同时利用不同尺度上的丰富语义特征。在收集多模态数据和提取特征时,很可能遇到内模态缺失数据,引入噪声到多模态数据中。为解决这些挑战,本文提出了一种新的端到端框架,FORESEE,用于通过挖掘多模态信息 robust 预测患者生存。具体来说,跨模态变换器通过细胞水平、组织水平和对肿瘤异质性的跨模态特征交叉融合方法,有效地利用特征进行 prognosis 的关联。这增强了病理图像特征表示的能力。其次,混合注意编码器(HAE)使用去噪上下文关注模块获得分子数据的上下文关系特征和局部细节特征。HAE的通道关注模块获得分子数据的全局特征。此外,为解决模态内缺失信息的问题,我们提出了一种不对称掩码三元组掩码自动编码器,用于在模态内重构丢失的信息。大量实验证明,我们的方法在完整和缺失设置下的四个基准数据集上优于最先进的方法。
https://arxiv.org/abs/2405.07702
In the character animation field, modern supervised keyframe interpolation models have demonstrated exceptional performance in constructing natural human motions from sparse pose definitions. As supervised models, large motion datasets are necessary to facilitate the learning process; however, since motion is represented with fixed hierarchical skeletons, such datasets are incompatible for skeletons outside the datasets' native configurations. Consequently, the expected availability of a motion dataset for desired skeletons severely hinders the feasibility of learned interpolation in practice. To combat this limitation, we propose Point Cloud-based Motion Representation Learning (PC-MRL), an unsupervised approach to enabling cross-compatibility between skeletons for motion interpolation learning. PC-MRL consists of a skeleton obfuscation strategy using temporal point cloud sampling, and an unsupervised skeleton reconstruction method from point clouds. We devise a temporal point-wise K-nearest neighbors loss for unsupervised learning. Moreover, we propose First-frame Offset Quaternion (FOQ) and Rest Pose Augmentation (RPA) strategies to overcome necessary limitations of our unsupervised point cloud-to-skeletal motion process. Comprehensive experiments demonstrate the effectiveness of PC-MRL in motion interpolation for desired skeletons without supervision from native datasets.
在角色动画领域,现代有监督的关键帧插值模型在构建自然人体运动方面表现出卓越的性能。作为有监督模型,需要大型动作数据集来促进学习过程;然而,由于运动以固定的层次结构骨架表示,这类数据集对于数据集中的非原配置骨骼是不可兼容的。因此,为期望的骨骼模型设计的运动数据集在实践中严重阻碍了学习插值的可行性。为了克服这一限制,我们提出了基于点云的运动表示学习(PC-MRL),一种无监督方法,以实现骨骼在运动插值学习中的互操作性。PC-MRL包括使用时域点云采样进行骨密度干扰策略和一个无监督骨架重构方法。我们设计了一个基于时刻的K-最近邻损失来进行无监督学习。此外,我们还提出了First-frame Offset Quaternion(FOQ)和Rest Pose Augmentation(RPA)策略来克服我们无监督点云到骨骼的运动过程所必需的局限性。全面的实验证明PC-MRL在不需要来自原始数据集的监督的情况下,在运动插值方面具有很高的有效性。
https://arxiv.org/abs/2405.07444
The ability of deep networks to learn superior representations hinges on leveraging the proper inductive biases, considering the inherent properties of datasets. In tabular domains, it is critical to effectively handle heterogeneous features (both categorical and numerical) in a unified manner and to grasp irregular functions like piecewise constant functions. To address the challenges in the self-supervised learning framework, we propose a novel pretext task based on the classical binning method. The idea is straightforward: reconstructing the bin indices (either orders or classes) rather than the original values. This pretext task provides the encoder with an inductive bias to capture the irregular dependencies, mapping from continuous inputs to discretized bins, and mitigates the feature heterogeneity by setting all features to have category-type targets. Our empirical investigations ascertain several advantages of binning: capturing the irregular function, compatibility with encoder architecture and additional modifications, standardizing all features into equal sets, grouping similar values within a feature, and providing ordering information. Comprehensive evaluations across diverse tabular datasets corroborate that our method consistently improves tabular representation learning performance for a wide range of downstream tasks. The codes are available in this https URL.
深度网络学习优越的表示能力取决于正确利用归纳偏见,考虑到数据集的固有特性。在表格领域,关键是要以统一的方式有效地处理异质特征(包括分类和数值特征),并理解分段常数函数等不规则函数。为解决自监督学习框架中的挑战,我们提出了一个基于经典分立方法的新型预处理任务。这个想法很简单:重构分标(无论是顺序还是类别)而不是原始值。这个预处理任务为编码器提供了一个归纳偏见,以捕捉不规则依赖关系,将连续输入映射到离散的区间,并通过将所有特征都设置为具有类目类型的目标来缓解特征异质性。我们的实证调查证实了分立的一些优点:捕捉不规则函数、与编码器架构的兼容性、将所有特征标准化为等集、将相似值分组在一起以及提供排序信息。通过对各种表格数据集的全面评估,证实了我们的方法对于各种下游任务的表格表示学习性能始终有所改进。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2405.07414
In Reinforcement Learning (RL), training a policy from scratch with online experiences can be inefficient because of the difficulties in exploration. Recently, offline RL provides a promising solution by giving an initialized offline policy, which can be refined through online interactions. However, existing approaches primarily perform offline and online learning in the same task, without considering the task generalization problem in offline-to-online adaptation. In real-world applications, it is common that we only have an offline dataset from a specific task while aiming for fast online-adaptation for several tasks. To address this problem, our work builds upon the investigation of successor representations for task generalization in online RL and extends the framework to incorporate offline-to-online learning. We demonstrate that the conventional paradigm using successor features cannot effectively utilize offline data and improve the performance for the new task by online fine-tuning. To mitigate this, we introduce a novel methodology that leverages offline data to acquire an ensemble of successor representations and subsequently constructs ensemble Q functions. This approach enables robust representation learning from datasets with different coverage and facilitates fast adaption of Q functions towards new tasks during the online fine-tuning phase. Extensive empirical evaluations provide compelling evidence showcasing the superior performance of our method in generalizing to diverse or even unseen tasks.
在强化学习(RL)中,从零开始通过在线经验训练策略可能效率低下,因为探索的困难。最近,离线RL通过给出一个初始化的离线策略,并通过在线交互进行微调,提供了一个有前途的解决方案。然而,现有的方法主要在相同任务上进行离线和在线学习,而没有考虑在离线到在线适应过程中任务泛化问题。在现实世界的应用中,我们通常只有来自特定任务的离线数据,而希望为多个任务实现快速的在线适应。为了应对这个问题,我们的工作基于在线RL中任务泛化后继表示的调查,并扩展了框架以包括离线到在线学习。我们证明了使用后继特征的传统范式无法有效利用离线数据,并通过在线微调来提高新任务的性能。为了减轻这个问题,我们引入了一种新的方法,利用离线数据获得后继表示的集合,然后构建了 ensemble Q 函数。这种方法使得可以从具有不同覆盖率的数据中进行稳健的 Q 函数学习,并在在线微调阶段加速对新任务的适应。大量实证评估提供了令人信服的证据,表明我们的方法在泛化到多样或未见任务方面具有优越性能。
https://arxiv.org/abs/2405.07223
An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes. However, establishing such an ideal framework that is both task-generic and label-efficient poses a challenge in unifying the representation of the same primitive across diverse scenes. The current contrastive 3D pre-training methods typically follow a frame-level consistency, which focuses on the 2D-3D relationships in each detached image. Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict, i.e., the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges, we propose a CSC framework that puts a scene-level semantic consistency in the heart, bridging the connection of the similar semantic segments across various scenes. To achieve this goal, we combine the coherent semantic cues provided by the vision foundation model and the knowledge-rich cross-scene prototypes derived from the complementary multi-modality information. These allow us to train a universal 3D pre-training model that facilitates various downstream tasks with less fine-tuning efforts. Empirically, we achieve consistent improvements over SOTA pre-training approaches in semantic segmentation (+1.4% mIoU), object detection (+1.0% mAP), and panoptic segmentation (+3.0% PQ) using their task-specific 3D network on nuScenes. Code is released at this https URL, hoping to inspire future research.
一种在感知大型动态场景时具有普遍3D表示的有效预训练框架是非常想要的。然而,建立一个既任务通用又标签高效的理想框架,以统一同一原始语义在不同场景中的表示,仍然具有挑战性。当前的对比性3D预训练方法通常遵循帧级的一致性,重点关注每个脱离图像的2D-3D关系。这种不考虑全局一致性的一致性极大地阻碍了达到普遍预训练框架的有前途的路径:(1)跨场景语义自冲突,即来自不同场景的同一语义段之间强烈的碰撞;(2)缺乏一个全局统一的键,将跨场景语义一致性推入3D表示学习。为解决上述挑战,我们提出了一个CSC框架,将场景级别的语义一致性放在核心位置,桥接不同场景中类似语义段之间的连接。为实现这一目标,我们结合了视觉基础模型提供的 coherent 语义线索和互补多模态信息得到的知识丰富的跨场景原型。这允许我们训练一个通用的3D预训练模型,从而无需太多微调努力完成各种下游任务。通过在 nuScenes 上使用他们的任务特定3D网络进行实验,我们在语义分割(+1.4%mIoU)、目标检测(+1.0%mAP)和透视分割(+3.0%PQ)方面实现了与当前最佳预训练方法相当甚至更好的性能。代码发布在https://这个URL上,希望激发未来的研究兴趣。
https://arxiv.org/abs/2405.07201
Inductive representation learning on temporal heterogeneous graphs is crucial for scalable deep learning on heterogeneous information networks (HINs) which are time-varying, such as citation networks. However, most existing approaches are not inductive and thus cannot handle new nodes or edges. Moreover, previous temporal graph embedding methods are often trained with the temporal link prediction task to simulate the link formation process of temporal graphs, while ignoring the evolution of high-order topological structures on temporal graphs. To fill these gaps, we propose a Continuous-Time Representation Learning (CTRL) model on temporal HINs. To preserve heterogeneous node features and temporal structures, CTRL integrates three parts in a single layer, they are 1) a \emph{heterogeneous attention} unit that measures the semantic correlation between nodes, 2) a \emph{edge-based Hawkes process} to capture temporal influence between heterogeneous nodes, and 3) \emph{dynamic centrality} that indicates the dynamic importance of a node. We train the CTRL model with a future event (a subgraph) prediction task to capture the evolution of the high-order network structure. Extensive experiments have been conducted on three benchmark datasets. The results demonstrate that our model significantly boosts performance and outperforms various state-of-the-art approaches. Ablation studies are conducted to demonstrate the effectiveness of the model design.
在时间异质图上进行归纳表示学习对于在异质信息网络(HINs)上进行可扩展的深度学习至关重要。然而,现有的大多数方法都不是归纳的,因此无法处理新节点或边。此外,之前的时间图嵌入方法通常使用时间链接预测任务来模拟时间图的链接形成过程,而忽略了时间图上高阶拓扑结构的演化。为了填补这些空白,我们提出了一个连续时间表示学习(CTRL)模型在时间异质图上。为了保留异质节点特征和时间结构,CTRL在单个层中整合了三个部分,它们是 1)一个异质注意单元,衡量节点之间的语义关联;2)一个基于边的哈克斯过程,捕捉异质节点之间的时间影响;3)动态中心度,表示节点的动态重要性。我们使用未来事件(子图)预测任务来训练CTRL模型,以捕捉高阶网络结构的演化。在三个基准数据集上进行了广泛的实验。结果表明,我们的模型显著提高了性能,并超越了各种最先进的 approaches。进行了消融研究来证明模型设计的有效性。
https://arxiv.org/abs/2405.08013
While a number of knowledge graph representation learning (KGRL) methods have been proposed over the past decade, very few theoretical analyses have been conducted on them. In this paper, we present the first PAC-Bayesian generalization bounds for KGRL methods. To analyze a broad class of KGRL models, we propose a generic framework named ReED (Relation-aware Encoder-Decoder), which consists of a relation-aware message passing encoder and a triplet classification decoder. Our ReED framework can express at least 15 different existing KGRL models, including not only graph neural network-based models such as R-GCN and CompGCN but also shallow-architecture models such as RotatE and ANALOGY. Our generalization bounds for the ReED framework provide theoretical grounds for the commonly used tricks in KGRL, e.g., parameter-sharing and weight normalization schemes, and guide desirable design choices for practical KGRL methods. We empirically show that the critical factors in our generalization bounds can explain actual generalization errors on three real-world knowledge graphs.
在过去的十年里,已经提出了许多知识图谱表示学习(KGRL)方法,但几乎没有对它们进行理论分析。在本文中,我们提出了第一个PAC-Bayesian扩展边界条件,用于KGRL方法。为了分析广泛的KGRL模型,我们提出了一个名为ReED(关系感知编码器-解码器)的通用框架,它由一个关系感知的消息传递编码器和一个三元组分类解码器组成。我们的ReED框架可以表示至少15个现有的KGRL模型,包括不仅基于图神经网络的模型(如R-GCN和CompGCN),还包括浅架构模型(如RotatE和ANALOGY)。我们对ReED框架扩展边界的理论分析提供了KGRL中常用技巧(如参数共享和权重归一化方案)的理论基础,并为实际KGRL方法提供有益的设计选择。我们通过实证研究证明了,我们的一般化边界中的关键因素可以解释实际泛化误差。
https://arxiv.org/abs/2405.06418
Time Series Representation Learning (TSRL) focuses on generating informative representations for various Time Series (TS) modeling tasks. Traditional Self-Supervised Learning (SSL) methods in TSRL fall into four main categories: reconstructive, adversarial, contrastive, and predictive, each with a common challenge of sensitivity to noise and intricate data nuances. Recently, diffusion-based methods have shown advanced generative capabilities. However, they primarily target specific application scenarios like imputation and forecasting, leaving a gap in leveraging diffusion models for generic TSRL. Our work, Time Series Diffusion Embedding (TSDE), bridges this gap as the first diffusion-based SSL TSRL approach. TSDE segments TS data into observed and masked parts using an Imputation-Interpolation-Forecasting (IIF) mask. It applies a trainable embedding function, featuring dual-orthogonal Transformer encoders with a crossover mechanism, to the observed part. We train a reverse diffusion process conditioned on the embeddings, designed to predict noise added to the masked part. Extensive experiments demonstrate TSDE's superiority in imputation, interpolation, forecasting, anomaly detection, classification, and clustering. We also conduct an ablation study, present embedding visualizations, and compare inference speed, further substantiating TSDE's efficiency and validity in learning representations of TS data.
时间序列表示学习(TSRL)关注为各种时间序列(TS)建模任务生成有用的表示。传统的自监督学习(SSL)方法在TSRL中可以分为四类:重构、对抗、对比和预测,每种方法都面临着对噪声和复杂数据细节的敏感性挑战。最近,扩散基方法表现出出色的生成能力。然而,它们主要针对特定的应用场景,如填充和预测, leaving a gap in leveraging diffusion models for generic TSRL. 我们的工作,TSDE,通过第一个基于扩散的SSL TSRL方法填补了这个空白。TSDE通过使用IIF掩码将TS数据分割为观测和掩码部分。它应用了一个可训练的嵌入函数,具有双正交Transformer编码器,并使用跨接机制,对观测部分进行处理。我们通过训练反扩散过程来预测掩码部分添加的噪声。丰富的实验证明TSDE在填充、 interpolation、 forecasting、异常检测、分类和聚类方面的优越性。我们还进行了消融研究,提供了嵌入视觉图,并比较了推理速度,进一步证实了TSDE在学习TS数据表示方面的效率和有效性。
https://arxiv.org/abs/2405.05959
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
嵌入式AI代理需要对物理世界进行细粒度的理解,通过视觉和语言输入与感知世界。仅从任务特定数据中学习这种能力是困难的。这导致预训练视觉语言模型成为将从互联网规模数据中获得的表示传递到下游任务和新领域的工具。然而,常见的通过对比训练的表示,如CLIP,已经被证明无法使嵌入式代理获得足够细粒度的场景理解——这对于控制是至关重要的。为了解决这个问题,我们考虑从预训练的文本到图像扩散模型的表示,这些模型专门优化从文本提示中生成图像,因此包含文本相关表示,这些表示具有高度细粒度的视觉和空间信息。使用预训练的文本到图像扩散模型,我们构建了稳定控制表示,允许学习下游控制策略,这些策略可以泛化到复杂、开放性环境。我们证明了使用稳定控制表示学习到的策略在广泛的模拟控制设置中与最先进的表示学习方法具有竞争力,包括具有挑战性的操作和导航任务。最值得注意的是,我们证明了稳定控制表示能够学习具有OVMM(困难开放式词汇表导航) benchmark中最佳性能的策略。
https://arxiv.org/abs/2405.05852
Representation learning, and interpreting learned representations, are key areas of focus in machine learning and neuroscience. Both fields generally use representations as a means to understand or improve a system's computations. In this work, however, we explore surprising dissociations between representation and computation that may pose challenges for such efforts. We create datasets in which we attempt to match the computational role that different features play, while manipulating other properties of the features or the data. We train various deep learning architectures to compute these multiple abstract features about their inputs. We find that their learned feature representations are systematically biased towards representing some features more strongly than others, depending upon extraneous properties such as feature complexity, the order in which features are learned, and the distribution of features over the inputs. For example, features that are simpler to compute or learned first tend to be represented more strongly and densely than features that are more complex or learned later, even if all features are learned equally well. We also explore how these biases are affected by architectures, optimizers, and training regimes (e.g., in transformers, features decoded earlier in the output sequence also tend to be represented more strongly). Our results help to characterize the inductive biases of gradient-based representation learning. These results also highlight a key challenge for interpretability $-$ or for comparing the representations of models and brains $-$ disentangling extraneous biases from the computationally important aspects of a system's internal representations.
表示学习,以及解释学习到的表示,是机器学习和神经科学中的关键领域。这两个领域通常使用表示来理解或改进系统的计算。然而,在这项工作中,我们探讨了表示和计算之间的令人惊讶的分离,这可能会对这样的努力构成挑战。我们创建了数据集,试图匹配不同特征在计算中的作用,而操纵其他特征或数据的属性。我们训练了各种深度学习架构来计算这些多个抽象特征关于其输入的值。我们发现,他们的学习到的特征表示系统性地倾向于对某些特征表现得更强烈,而其他特征则表现得更弱,这取决于诸如特征复杂性、特征学习的顺序以及特征在输入上的分布等额外的属性。例如,计算或学习得更容易的特征往往被表示得更加紧密和强烈,即使所有特征都被 equally 学会。我们还研究了这些偏见如何受到架构、优化器和训练策略的影响(例如,在Transformer中,在输出序列中编码得更早的特征也往往被表示得更强烈)。我们的结果有助于刻画基于梯度的表示学习的归纳偏见。这些结果还突出了一个关键的挑战:比较模型和大脑表示的差异,解开无关偏差与系统内部表示的计算重要性的联系。
https://arxiv.org/abs/2405.05847
Emotion recognition is an important part of affective computing. Extracting emotional cues from human gaits yields benefits such as natural interaction, a nonintrusive nature, and remote detection. Recently, the introduction of self-supervised learning techniques offers a practical solution to the issues arising from the scarcity of labeled data in the field of gait-based emotion recognition. However, due to the limited diversity of gaits and the incompleteness of feature representations for skeletons, the existing contrastive learning methods are usually inefficient for the acquisition of gait emotions. In this paper, we propose a contrastive learning framework utilizing selective strong augmentation (SSA) for self-supervised gait-based emotion representation, which aims to derive effective representations from limited labeled gait data. First, we propose an SSA method for the gait emotion recognition task, which includes upper body jitter and random spatiotemporal mask. The goal of SSA is to generate more diverse and targeted positive samples and prompt the model to learn more distinctive and robust feature representations. Then, we design a complementary feature fusion network (CFFN) that facilitates the integration of cross-domain information to acquire topological structural and global adaptive features. Finally, we implement the distributional divergence minimization loss to supervise the representation learning of the generally and strongly augmented queries. Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
情感识别是情感计算的重要组成部分。从人类脚步中提取情感线索带来了诸如自然互动、非侵入性、远程检测等好处。最近,自监督学习技术的发展为基于脚步情感识别领域缺乏标注数据的问题提供了一个实际解决方案。然而,由于脚步动作的多样性有限和骨骼特征表示的不完整性,现有的对比学习方法通常对于获取有限标注数据的脚步情感识别效果不佳。在本文中,我们提出了一种使用选择性 strong augmentation (SSA) 的对比学习框架,用于自监督基于脚步情感表示,旨在从有限的标注数据中提取有效的情感表示。首先,我们提出了一种 SSA 方法来处理脚步情感识别任务,包括上半身抖动和随机时空掩码。SSA 的目标是生成更多多样化和针对性的正样本,并促使模型学习更具有特色和鲁棒性的特征表示。然后,我们设计了一个互补特征融合网络(CFFN),促进跨领域信息的整合以获取拓扑结构和全局自适应特征。最后,我们实现了一种分布差异最小化损失,用于指导一般和强烈 augmented 查询的特征学习。我们的方法在 Emotion-Gait 和 Emilya 数据集上的验证结果表明,它在不同评估协议下优于最先进的methods。
https://arxiv.org/abs/2405.04900
Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.
近年来,随着其较低成本,视觉中心化的自动驾驶引起了广泛关注。预训练对于提取普遍表示至关重要。然而,目前视觉中心化的预训练通常依赖于2D或3D预训练任务,忽视了自动驾驶作为4D场景理解任务的时空特征。在本文中,我们通过引入基于世界模型的自动驾驶4D表示学习框架\emph{DriveWorld}来解决这一挑战。该框架能够以时空方式从多摄像头驾驶视频中进行预训练。具体来说,我们提出了一个记忆状态空间模型进行空间-时间建模,包括动态内存库模块用于学习时空感知到的潜在动态,静态场景传播模块用于学习空间感知到的潜在静态,以提供全面的场景上下文。我们还引入了一个任务提示,用于解耦各种下游任务的关注点特征。实验证明,DriveWorld在各种自动驾驶任务上取得了很好的效果。当使用OpenScene数据集进行预训练时,DriveWorld在3D物体检测上实现了7.5%的mAP增加,在在线地图上实现了3%的IoU增加,在多对象跟踪上实现了5%的AMOTA增加,在运动预测中降低了0.1m的minADE,在占用预测上实现了3%的IoU增加,在规划中减少了0.34m的L2误差。
https://arxiv.org/abs/2405.04390
Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.
场景文本图像不仅包含样式信息(字体,背景)还包含内容信息(字符,纹理)。不同的场景文本任务需要不同的信息,但之前的表现学习方法使用紧密耦合的特征来处理所有任务,导致在应对各种下游任务时性能较低。我们提出了一个解耦表示学习框架(DARLING),旨在解耦这两种类型的特征以提高在更好地解决各种下游任务时的适应性(选择您真正需要的)。具体来说,我们通过监督设计合成了一组具有相同风格的图像对,但具有不同内容的图像。根据这个数据集,我们通过监督设计解耦这两种类型的特征。显然,我们直接将视觉表示分为样式和内容特征。内容特征由文本识别损失进行监督,而风格特征通过图像解码器的提示进行对齐。然后,通过样式特征在图像对中重构对应图像。这种操作有效地将根据其独特属性解耦。据我们所知,这是场景文本领域首次解耦文本图像的固有属性。我们的方法在场景文本识别、去除和编辑方面取得了最先进的性能。
https://arxiv.org/abs/2405.04377
Cross-modal knowledge transfer enhances point cloud representation learning in LiDAR semantic segmentation. Despite its potential, the \textit{weak teacher challenge} arises due to repetitive and non-diverse car camera images and sparse, inaccurate ground truth labels. To address this, we propose the Efficient Image-to-LiDAR Knowledge Transfer (ELiTe) paradigm. ELiTe introduces Patch-to-Point Multi-Stage Knowledge Distillation, transferring comprehensive knowledge from the Vision Foundation Model (VFM), extensively trained on diverse open-world images. This enables effective knowledge transfer to a lightweight student model across modalities. ELiTe employs Parameter-Efficient Fine-Tuning to strengthen the VFM teacher and expedite large-scale model training with minimal costs. Additionally, we introduce the Segment Anything Model based Pseudo-Label Generation approach to enhance low-quality image labels, facilitating robust semantic representations. Efficient knowledge transfer in ELiTe yields state-of-the-art results on the SemanticKITTI benchmark, outperforming real-time inference models. Our approach achieves this with significantly fewer parameters, confirming its effectiveness and efficiency.
跨模态知识转移增强在激光雷达语义分割中改善点云表示学习。尽管具有潜在优势,但\textit{弱教师挑战}源于重复且缺乏多样性的车载相机图像以及稀疏且不准确的地面真实标签。为了应对这个问题,我们提出了高效的图像到激光雷达知识转移(ELiTe)范式。ELiTe引入了来自视觉基础模型(VFM)的补丁到点的多级知识蒸馏,在多样开放世界图像上进行了广泛训练。这使得能够在模态之间有效地进行知识传递。ELiTe采用参数高效的微调来加强VFM教师,并加速大规模模型训练,同时最小化成本。此外,我们还引入了基于伪标签生成的分割 anything模型,以增强低质量图像标签,促进稳健的语义表示。ELiTe通过显著更少的参数在SemanticKITTI基准上取得了最先进的性能,超过了实时推理模型。我们的方法通过显著更少的参数证实了其有效性和效率。
https://arxiv.org/abs/2405.04121
Cancer, a leading cause of death globally, occurs due to genomic changes and manifests heterogeneously across patients. To advance research on personalized treatment strategies, the effectiveness of various drugs on cells derived from cancers (`cell lines') is experimentally determined in laboratory settings. Nevertheless, variations in the distribution of genomic data and drug responses between cell lines and humans arise due to biological and environmental differences. Moreover, while genomic profiles of many cancer patients are readily available, the scarcity of corresponding drug response data limits the ability to train machine learning models that can predict drug response in patients effectively. Recent cancer drug response prediction methods have largely followed the paradigm of unsupervised domain-invariant representation learning followed by a downstream drug response classification step. Introducing supervision in both stages is challenging due to heterogeneous patient response to drugs and limited drug response data. This paper addresses these challenges through a novel representation learning method in the first phase and weak supervision in the second. Experimental results on real patient data demonstrate the efficacy of our method (WISER) over state-of-the-art alternatives on predicting personalized drug response.
癌症是全球导致死亡的主要原因,其发生是由基因突变导致的,并表现出在不同患者中的异质性。为了推动癌症个性化治疗策略的研究,在实验室环境下对肿瘤细胞(`细胞系`)中各种药物的有效性进行了实验验证。然而,由于生物和环境差异,肿瘤细胞和人类之间基因组数据的分布和药物反应存在差异。此外,虽然许多癌症患者的基因组数据很容易获得,但相应的药物反应数据却很有限,这限制了能够有效预测患者药物反应的机器学习模型的能力。近年来,癌症药物反应预测方法大多遵循无监督领域不变性表示学习 followed by a downstream drug response classification step 的范式。由于药物反应异质性和药物响应数据有限,引入监督在两个阶段都具有挑战性。本文通过在第一阶段使用一种新颖的表示学习方法,在第二阶段使用弱监督,从而解决这些挑战。在真实患者数据上的实验结果表明,我们的方法(WISER)在预测个性化药物反应方面优于最先进的替代方法。
https://arxiv.org/abs/2405.04078
Graph representation learning has become a hot research topic due to its powerful nonlinear fitting capability in extracting representative node embeddings. However, for sequential data such as speech signals, most traditional methods merely focus on the static graph created within a sequence, and largely overlook the intrinsic evolving patterns of these data. This may reduce the efficiency of graph representation learning for sequential data. For this reason, we propose an adaptive graph representation learning method based on dynamically evolved graphs, which are consecutively constructed on a series of subsequences segmented by a sliding window. In doing this, it is better to capture local and global context information within a long sequence. Moreover, we introduce a weighted approach to update the node representation rather than the conventional average one, where the weights are calculated by a novel matrix computation based on the degree of neighboring nodes. Finally, we construct a learnable graph convolutional layer that combines the graph structure loss and classification loss to optimize the graph structure. To verify the effectiveness of the proposed method, we conducted experiments for speech emotion recognition on the IEMOCAP and RAVDESS datasets. Experimental results show that the proposed method outperforms the latest (non-)graph-based models.
由于其在提取具有代表性的节点嵌入的强大非线性适应能力,图表示学习已成为一个热门的研究课题。然而,对于像语音信号这样的序列数据,大多数传统方法仅关注序列内创建的静态图,并大大忽视了这些数据固有的演变模式。这可能会降低图表示学习对于序列数据的有效性。因此,我们提出了一个基于动态演变图的自适应图表示学习方法,该方法在系列片段分界滑动窗口上连续构建。这样做,更好地捕捉长序列中的局部和全局上下文信息。此外,我们引入了一种加权方法来更新节点表示,而不是传统的平均表示,其中权重通过基于节点度量的全新矩阵计算得到。最后,我们构建了一个可学习图卷积层,将图结构损失和分类损失相结合来优化图形。为了验证所提出方法的有效性,我们在IEMOCAP和RAVDESS数据集上对语音情感识别进行了实验。实验结果表明,与基于图的最近模型相比,所提出的方法具有更好的性能。
https://arxiv.org/abs/2405.03956
Graph Neural Networks (GNNs) have excelled in learning from graph-structured data, especially in understanding the relationships within a single graph, i.e., intra-graph relationships. Despite their successes, GNNs are limited by neglecting the context of relationships across graphs, i.e., inter-graph relationships. Recognizing the potential to extend this capability, we introduce Relating-Up, a plug-and-play module that enhances GNNs by exploiting inter-graph relationships. This module incorporates a relation-aware encoder and a feedback training strategy. The former enables GNNs to capture relationships across graphs, enriching relation-aware graph representation through collective context. The latter utilizes a feedback loop mechanism for the recursively refinement of these representations, leveraging insights from refining inter-graph dynamics to conduct feedback loop. The synergy between these two innovations results in a robust and versatile module. Relating-Up enhances the expressiveness of GNNs, enabling them to encapsulate a wider spectrum of graph relationships with greater precision. Our evaluations across 16 benchmark datasets demonstrate that integrating Relating-Up into GNN architectures substantially improves performance, positioning Relating-Up as a formidable choice for a broad spectrum of graph representation learning tasks.
图神经网络(GNNs)在处理图结构数据方面表现出色,尤其是在理解单个图中节点之间的关系,即 intra-graph 关系。尽管它们取得了成功,但GNNs 的局限在于忽略了图中关系之间的上下文,即 inter-graph 关系。为了扩展这种能力,我们引入了关系增强模块(Relating-Up),这是一种可插拔的模块,通过利用 inter-graph 关系增强了GNNs。该模块包括关系感知编码器和一个反馈训练策略。前一个策略使GNNs能够捕捉图形之间的关系,通过集体上下文丰富关系感知的图表示。后一个策略利用反馈循环机制对这些表示进行递归优化,并利用改进 inter-graph 动态的见解进行反馈循环。这两个创新之间的协同作用导致了一个稳健且多功能的模块。关系增强使GNNs 的表达力更加出色,使它们能够更精确地封装更广泛的图形关系。我们在16个基准数据集上的评估表明,将关系增强模块集成到GNN架构中会极大地提高性能,将关系增强定位为各种图形表示学习任务的出色选择。
https://arxiv.org/abs/2405.03950