In many applications, the demand arises for algorithms capable of aligning partially overlapping point sets while remaining invariant to the corresponding transformations. This research presents a method designed to meet such requirements through minimization of the objective function of the robust point matching (RPM) algorithm. First, we show that the RPM objective is a cubic polynomial. Then, through variable substitution, we transform the RPM objective to a quadratic function. Leveraging the convex envelope of bilinear monomials, we proceed to relax the resulting objective function, thus obtaining a lower bound problem that can be conveniently decomposed into distinct linear assignment and low-dimensional convex quadratic program components, both amenable to efficient optimization. Furthermore, a branch-and-bound (BnB) algorithm is devised, which solely branches over the transformation parameters, thereby boosting convergence rate. Empirical evaluations demonstrate better robustness of the proposed methodology against non-rigid deformation, positional noise, and outliers, particularly in scenarios where outliers remain distinct from inliers, when compared with prevailing state-of-the-art approaches.
在许多应用中,需要寻求能够对部分重叠的点集进行对齐,同时保持对相应变换的不变的算法。这项研究通过最小化稳健点匹配(RPM)算法的目标函数来满足这些要求。首先,我们证明了RPM目标是一个三次多项式。然后,通过变量替换,我们将RPM目标转化为一个二次函数。利用二元多项式的凸内积,我们继续放松得到的目标函数,从而得到了一个可以方便地分解为不同的线性分配问题和低维凸四元二次规划组件的更严格的目标函数。此外,还设计了一种分支定界(BnB)算法,它仅在变换参数上分支,从而提高了收敛率。 empirical结果表明,与非刚性变形、位置噪声和异常相比,所提出的方法在非均匀场景中的鲁棒性更好,尤其是在与现有先进方法相比,异常仍然与基线分离时。
https://arxiv.org/abs/2405.08589
This paper addresses the problem of pathological lung segmentation, a significant challenge in medical image analysis, particularly pronounced in cases of peripheral opacities (severe fibrosis and consolidation) because of the textural similarity between lung tissue and surrounding areas. To overcome these challenges, this paper emphasizes the use of CycleGAN for unpaired image-to-image translation, in order to provide an augmentation method able to generate fake pathological images matching an existing ground truth. Although previous studies have employed CycleGAN, they often neglect the challenge of shape deformation, which is crucial for accurate medical image segmentation. Our work introduces an innovative strategy that incorporates additional loss functions. Specifically, it proposes an L1 loss based on the lung surrounding which shape is constrained to remain unchanged at the transition from the healthy to pathological domains. The lung surrounding is derived based on ground truth lung masks available in the healthy domain. Furthermore, preprocessing steps, such as cropping based on ribs/vertebra locations, are applied to refine the input for the CycleGAN, ensuring that the network focus on the lung region. This is essential to avoid extraneous biases, such as the zoom effect bias, which can divert attention from the main task. The method is applied to enhance in semi-supervised manner the lung segmentation process by employing a U-Net model trained with on-the-fly data augmentation incorporating synthetic pathological tissues generated by the CycleGAN model. Preliminary results from this research demonstrate significant qualitative and quantitative improvements, setting a new benchmark in the field of pathological lung segmentation. Our code is available at this https URL
本文解决了病理肺部分割的问题,这是医学图像分析的一个显著挑战,尤其是在边缘透明度(严重纤维化和胶质化)病例中更加突出,因为肺组织与周围区域的纹理相似。为了克服这些挑战,本文强调使用CycleGAN进行无配对图像到图像的转换,以提供一种能够生成与现有真实 ground truth 匹配的假病理图像的增强方法。虽然之前的研究已经使用了CycleGAN,但它们通常忽视了形状变形的重要性,这对于准确医学图像分割至关重要。我们的工作引入了一种创新策略,包括额外的损失函数。具体来说,它提出了一个基于肺周围约束的L1损失,该约束在从健康到病理域的转换过程中保持形状不变。肺周围基于健康的领域内存在的真实肺mask为基础进行提取。此外,对输入进行预处理步骤,如基于肋/椎位置的裁剪,以优化CycleGAN,确保网络集中于肺区域。这对于避免诸如放大效果偏见等额外偏差至关重要。将该方法应用于半监督方式增强肺分割过程,通过使用训练时数据增强包含由CycleGAN模型生成的合成病理组织的方法。来自这项研究的结果表明,在半监督方式下,肺分割过程有显著的质量和数量改进,为病理肺部分割领域树立了新的基准。我们的代码可在此处访问:https://www.thisurl.com
https://arxiv.org/abs/2405.08556
Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a $\it{Compose}$ function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1.7x-2.0x compute. For example, DCPythia-6.9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at this https URL.
多头注意力(MHA)是Transformer的关键组成部分。在MHA中,关注头独立工作,导致诸如注意力分数矩阵低秩瓶颈和头冗余等问题。我们提出了一种动态可组合的多头注意力(DCMHA),一种参数和计算效率高的注意机制,通过动态组合注意头解决了MHA的不足之处,从而提高了模型的表现力。DCMHA的核心是一个有$\it{Compose}$函数,它以输入为依赖地将注意力分数和权重矩阵转换。DCMHA可以作为任何Transformer架构的 drop-in 替换,以获得相应的DCFormer。DCFormer 在各种架构和模型规模的语言建模中显著优于Transformer,其性能与具有 ~1.7x-2.0x 计算能力的模型相匹敌。例如,DCPythia-6.9B 在预训练预置和下游任务评估方面都优于开源Pythia-12B。代码和模型可通过此https URL获取。
https://arxiv.org/abs/2405.08553
Class-incremental learning (CIL) has emerged as a means to learn new classes incrementally without catastrophic forgetting of previous classes. Recently, CIL has undergone a paradigm shift towards dynamic architectures due to their superior performance. However, these models are still limited by the following aspects: (i) Data augmentation (DA), which are tightly coupled with CIL, remains under-explored in dynamic architecture scenarios. (ii) Feature representation. The discriminativeness of dynamic feature are sub-optimal and possess potential for refinement. (iii) Classifier. The misalignment between dynamic feature and classifier constrains the capabilities of the model. To tackle the aforementioned drawbacks, we propose the Dynamic Feature Learning and Matching (DFLM) model in this paper from above three perspectives. Specifically, we firstly introduce class weight information and non-stationary functions to extend the mix DA method for dynamically adjusting the focus on memory during training. Then, von Mises-Fisher (vMF) classifier is employed to effectively model the dynamic feature distribution and implicitly learn their discriminative properties. Finally, the matching loss is proposed to facilitate the alignment between the learned dynamic features and the classifier by minimizing the distribution distance. Extensive experiments on CIL benchmarks validate that our proposed model achieves significant performance improvements over existing methods.
分类级学习(CIL)作为一种逐步学习新类的方法,不会导致对前一类的灾难性遗忘。最近,由于其卓越的性能,CIL在动态架构场景中经历了一个范式转移,转向动态架构。然而,这些模型仍然受到以下方面的限制:(一)数据增强(DA),它们与CIL紧密耦合,在动态架构场景中仍然是一个未被充分探索的问题。(二)特征表示。动态特征的区分度较低,具有改进的潜力。(三)分类器。动态特征与分类器之间的不匹配限制了模型的能力。为解决上述问题,本文从三个方面提出了一种动态特征学习和匹配(DFLM)模型。具体来说,我们首先引入了分类权重信息和不变函数,扩展了混合DA方法,以便在训练过程中动态调整对记忆的关注。然后,使用vMises-Fisher(vMF)分类器来有效地建模动态特征分布,并隐含地学习它们的区分性特性。最后,提出匹配损失,通过最小化分布距离,促进学习到的动态特征与分类器之间的对齐。在CIL基准测试上进行的大量实验证明,与现有方法相比,我们提出的模型具有显著的性能提升。
https://arxiv.org/abs/2405.08533
Image matching is still challenging in such scenes with large viewpoints or illumination changes or with low textures. In this paper, we propose a Transformer-based pseudo 3D image matching method. It upgrades the 2D features extracted from the source image to 3D features with the help of a reference image and matches to the 2D features extracted from the destination image by the coarse-to-fine 3D matching. Our key discovery is that by introducing the reference image, the source image's fine points are screened and furtherly their feature descriptors are enriched from 2D to 3D, which improves the match performance with the destination image. Experimental results on multiple datasets show that the proposed method achieves the state-of-the-art on the tasks of homography estimation, pose estimation and visual localization especially in challenging scenes.
图像匹配在具有大视角或光照变化或低纹理的场景中仍然具有挑战性。在本文中,我们提出了一种基于Transformer的伪3D图像匹配方法。它通过参考图像升级源图像中提取的2D特征为3D特征,并通过粗到细的3D匹配将目标图像中提取的2D特征与源图像中的匹配。我们的关键发现是,通过引入参考图像,可以筛选出源图像中微小的点,并从2D到3D进一步丰富其特征描述符,从而提高与目标图像的匹配性能。在多个数据集上的实验结果表明,与其他方法相比,尤其是在具有挑战性场景的任务中,所提出的方法在姿态估计和视觉定位方面实现了最先进的水平。
https://arxiv.org/abs/2405.08434
From a perspective of feature matching, optical flow estimation for event cameras involves identifying event correspondences by comparing feature similarity across accompanying event frames. In this work, we introduces an effective and robust high-dimensional (HD) feature descriptor for event frames, utilizing Vector Symbolic Architectures (VSA). The topological similarity among neighboring variables within VSA contributes to the enhanced representation similarity of feature descriptors for flow-matching points, while its structured symbolic representation capacity facilitates feature fusion from both event polarities and multiple spatial scales. Based on this HD feature descriptor, we propose a novel feature matching framework for event-based optical flow, encompassing both model-based (VSA-Flow) and self-supervised learning (VSA-SM) methods. In VSA-Flow, accurate optical flow estimation validates the effectiveness of HD feature descriptors. In VSA-SM, a novel similarity maximization method based on the HD feature descriptor is proposed to learn optical flow in a self-supervised way from events alone, eliminating the need for auxiliary grayscale images. Evaluation results demonstrate that our VSA-based method achieves superior accuracy in comparison to both model-based and self-supervised learning methods on the DSEC benchmark, while remains competitive among both methods on the MVSEC benchmark. This contribution marks a significant advancement in event-based optical flow within the feature matching methodology.
从特征匹配的角度来看,事件相机中的光学流估计涉及通过比较伴随事件帧之间的特征相似性来识别事件对应关系。在这项工作中,我们引入了一个有效且鲁棒的高维(HD)特征描述器来描述事件帧,利用向量符号抽象结构(VSA)。VSA中相邻变量之间的拓扑相似性有助于增强特征描述器对于匹配点的光流匹配的表示相似性,而其结构化符号表示能力则促使来自事件极性和多个空间维度的特征进行融合。基于这个HD特征描述器,我们提出了一个基于事件的光流匹配框架,涵盖了基于模型的(VSA-Flow)和自监督学习(VSA-SM)方法。在VSA-Flow中,准确的光流估计验证了HD特征描述器的有效性。在VSA-SM中,我们提出了一种基于HD特征描述器的新型相似度最大化方法来以事件的方式自监督地学习光流,消除了需要辅助灰度图像的需求。评估结果表明,基于VSA的方法在DSEC基准上实现了与基于模型的方法和自监督学习方法相比的卓越准确性,而在MVSEC基准上则与这两种方法保持竞争力。这项贡献在基于特征匹配的光流匹配方法论上取得了显著的进步。
https://arxiv.org/abs/2405.08300
The authentic 3D hand avatar with every identifiable information, such as hand shapes and textures, is necessary for immersive experiences in AR/VR. In this paper, we present a universal hand model (UHM), which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. For effective universal hand modeling, we perform tracking and modeling at the same time, while previous 3D hand models perform them separately. The conventional separate pipeline suffers from the accumulated errors from the tracking stage, which cannot be recovered in the modeling stage. On the other hand, ours does not suffer from the accumulated errors while having a much more concise overall pipeline. We additionally introduce a novel image matching loss function to address a skin sliding during the tracking and modeling, while existing works have not focused on it much. Finally, using learned priors from our UHM, we effectively adapt our UHM to each person's short phone scan for the authentic hand avatar.
真实的三维手虚拟形象,包括手形状和纹理等可识别的信息,对于AR/VR中的沉浸体验是必要的。在本文中,我们提出了一个通用的手模型(UHM),它具有以下两个特点:1)可以普遍代表任意身份(ID)的高保真3D手网格;2)可以通过短电话扫描,将真实手虚拟形象个性化定制。为了实现有效的通用手建模,我们在跟踪和建模的同时进行,而之前的手模型则是在建模阶段分别进行。传统的单独管道从跟踪阶段积累的错误,在建模阶段无法恢复。另一方面,我们的模型在整体流程中没有积累的错误,而且更加简洁。此外,我们还引入了一种新的图像匹配损失函数,以解决跟踪和建模过程中的皮肤滑动问题,而现有的工作并没有太多关注这个问题。最后,使用从UHM中学习到的先验知识,我们有效地将UHM个性定制化每个人的短电话扫描的真实手虚拟形象。
https://arxiv.org/abs/2405.07933
Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. In this paper, we give and analyze an algorithm for the efficient construction of deterministic finite automata designed to operate directly on tokenizations produced by the popular byte pair encoding technique. This makes it possible to apply many existing techniques and algorithms to the tokenized case, such as pattern matching, equivalence checking of tokenization dictionaries, and composing tokenized languages in various ways.
许多自然语言处理系统通过词标来解决开放词汇问题。在本文中,我们给出了一个算法,用于构造直接基于流行字节对编码技术产生的词标的确定性有限自动机,使得许多现有技术和服务可以应用于词标处理情况,例如模式匹配、词标字典的等价性检查和以各种方式组合词标语言。
https://arxiv.org/abs/2405.07671
Object detection techniques for Unmanned Aerial Vehicles (UAVs) rely on Deep Neural Networks (DNNs), which are vulnerable to adversarial attacks. Nonetheless, adversarial patches generated by existing algorithms in the UAV domain pay very little attention to the naturalness of adversarial patches. Moreover, imposing constraints directly on adversarial patches makes it difficult to generate patches that appear natural to the human eye while ensuring a high attack success rate. We notice that patches are natural looking when their overall color is consistent with the environment. Therefore, we propose a new method named Environmental Matching Attack(EMA) to address the issue of optimizing the adversarial patch under the constraints of color. To the best of our knowledge, this paper is the first to consider natural patches in the domain of UAVs. The EMA method exploits strong prior knowledge of a pretrained stable diffusion to guide the optimization direction of the adversarial patch, where the text guidance can restrict the color of the patch. To better match the environment, the contrast and brightness of the patch are appropriately adjusted. Instead of optimizing the adversarial patch itself, we optimize an adversarial perturbation patch which initializes to zero so that the model can better trade off attacking performance and naturalness. Experiments conducted on the DroneVehicle and Carpk datasets have shown that our work can reach nearly the same attack performance in the digital attack(no greater than 2 in mAP$\%$), surpass the baseline method in the physical specific scenarios, and exhibit a significant advantage in terms of naturalness in visualization and color difference with the environment.
无人机(UAV)的目标检测技术依赖于深度神经网络(DNNs),这些网络对攻击非常敏感。然而,UAV领域现有算法生成的攻击补丁对攻击的自然性非常关注。此外,直接对攻击补丁施加约束会使得生成看起来自然的人工补丁变得困难,同时保证高攻击成功率。我们注意到,当补丁的整体颜色与环境相同时,它们看起来是自然的。因此,我们提出了一种名为环境匹配攻击(EMA)的新方法来解决在颜色约束下优化攻击补丁的问题。据我们所知,这是第一个考虑UAV领域自然补丁的论文。EMA方法利用预训练的稳定扩散的强烈先验知识引导攻击补丁的优化方向,其中文本指导可以限制补丁的颜色。为了更好地匹配环境,适当调整补丁的对比度和亮度。我们不是优化攻击补丁本身,而是优化一个攻击补丁,该补丁初始化为零,以便模型可以更好地平衡攻击性能和自然性。在DroneVehicle和Carpk数据集上进行的实验表明,我们的工作在数字攻击(MAP%不超过2)方面的攻击性能与基线方法相当,在物理特定场景中超过了基线方法,并且在可视化和颜色差异方面具有显著的优越性。
https://arxiv.org/abs/2405.07595
Point cloud registration is a fundamental task for estimating rigid transformations between point clouds. Previous studies have used geometric information for extracting features, matching and estimating transformation. Recently, owing to the advancement of RGB-D sensors, researchers have attempted to utilize visual information to improve registration performance. However, these studies focused on extracting distinctive features by deep feature fusion, which cannot effectively solve the negative effects of each feature's weakness, and cannot sufficiently leverage the valid information. In this paper, we propose a new feature combination framework, which applies a looser but more effective fusion and can achieve better performance. An explicit filter based on transformation consistency is designed for the combination framework, which can overcome each feature's weakness. And an adaptive threshold determined by the error distribution is proposed to extract more valid information from the two types of features. Owing to the distinctive design, our proposed framework can estimate more accurate correspondences and is applicable to both hand-crafted and learning-based feature descriptors. Experiments on ScanNet show that our method achieves a state-of-the-art performance and the rotation accuracy of 99.1%.
点云配准是将点云之间估计刚性变换的基本任务。之前的研究已经利用几何信息提取特征、匹配和估计变换。然而,由于RGB-D传感器的进步,研究人员试图利用视觉信息来提高配准性能。然而,这些研究仅通过深度特征融合来提取独特的特征,这无法有效地解决每个特征的弱点,也无法充分利用有效信息。在本文中,我们提出了一个新的特征组合框架,该框架应用了较宽松但更有效的融合,可以实现更好的性能。基于变换一致性的显式滤波器被设计为组合框架,可以克服每个特征的弱点。还提出了一个由错误分布决定的自适应阈值,用于从两种类型的特征中提取更多有效信息。由于其独特的设计,我们提出的框架可以估计更精确的对应关系,并适用于手工制作的和学习式的特征描述符。在ScanNet上的实验表明,我们的方法达到最先进的性能,旋转精度为99.1%。
https://arxiv.org/abs/2405.07594
Linking (aligning) biomedical concepts across diverse data sources enables various integrative analyses, but it is challenging due to the discrepancies in concept naming conventions. Various strategies have been developed to overcome this challenge, such as those based on string-matching rules, manually crafted thesauri, and machine learning models. However, these methods are constrained by limited prior biomedical knowledge and can hardly generalize beyond the limited amounts of rules, thesauri, or training samples. Recently, large language models (LLMs) have exhibited impressive results in diverse biomedical NLP tasks due to their unprecedentedly rich prior knowledge and strong zero-shot prediction abilities. However, LLMs suffer from issues including high costs, limited context length, and unreliable predictions. In this research, we propose PromptLink, a novel biomedical concept linking framework that leverages LLMs. It first employs a biomedical-specialized pre-trained language model to generate candidate concepts that can fit in the LLM context windows. Then it utilizes an LLM to link concepts through two-stage prompts, where the first-stage prompt aims to elicit the biomedical prior knowledge from the LLM for the concept linking task and the second-stage prompt enforces the LLM to reflect on its own predictions to further enhance their reliability. Empirical results on the concept linking task between two EHR datasets and an external biomedical KG demonstrate the effectiveness of PromptLink. Furthermore, PromptLink is a generic framework without reliance on additional prior knowledge, context, or training data, making it well-suited for concept linking across various types of data sources. The source code is available at this https URL.
跨多个数据源链接生物医学概念,可以进行各种整合分析,但这是具有挑战性的,因为概念命名惯例的差异。已经开发了许多策略来克服这一挑战,例如基于字符匹配规则的方法、人工制作的同构词和机器学习模型。然而,这些方法都受到先验生物医学知识的限制,并且很难将它们扩展到有限规则、同构词或训练样本的数量。最近,大型语言模型(LLMs)在各种生物医学自然语言处理任务中表现出色,因为它们具有史无前例的丰富先验知识和完善零散预测能力。然而,LLMs也存在一些问题,包括高昂的成本、有限上下文长度和不可靠的预测。在这项研究中,我们提出了PromptLink,一种新颖的生物医学概念链接框架,用于利用LLM。它首先使用专门针对生物医学领域的预训练语言模型生成候选概念,使其适应LLM上下文窗口。然后,它利用LLM通过两阶段提示进行概念链接,第一阶段提示旨在从LLM中激发概念链接任务的生物医学先验知识,第二阶段提示则要求LLM考虑其自身的预测以进一步增强其可靠性。在两个电子病历数据集和一个外部生物医学知识库之间的概念链接任务之间的实证结果证明了PromptLink的有效性。此外,PromptLink是一个通用的框架,不依赖于其他先验知识、上下文或训练数据,因此非常适合各种数据源的概念链接。源代码可在此https URL中获取。
https://arxiv.org/abs/2405.07500
Text-based person retrieval (TPR) aims to retrieve images of a person from an extensive array of candidates based on a given textual description. The core challenge lies in mapping visual and textual data into a unified latent space. While existing TPR methods concentrate on recognizing explicit and positive characteristics, they often neglect the critical influence of negative descriptors, resulting in potential false positives that fulfill positive criteria but could be excluded by negative descriptors. To alleviate these issues, we introduce DualFocus, a unified framework for integrating positive and negative descriptors to enhance the interpretative accuracy of vision-language foundational models regarding textual queries. DualFocus employs Dual (Positive/Negative) Attribute Prompt Learning (DAPL), which integrates Dual Image-Attribute Contrastive (DIAC) Learning and Sensitive Image-Attributes Matching (SIAM) Learning. This way DualFocus enhances the detection of unseen attributes, thereby boosting retrieval precision. To further achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we propose the Dynamic Tokenwise Similarity (DTS) loss, which refines the representation of both matching and non-matching descriptions, thereby enhancing the matching process through a detailed and adaptable similarity assessment. By focusing on token-level comparisons, DualFocus significantly outperforms existing techniques in both precision and robustness. The experiment results highlight DualFocus's superior performance on CUHK-PEDES, ICFG-PEDES, and RSTPReid.
文本基于人员检索(TPR)旨在根据给定的文本描述从广泛的候选人中检索出一个人的图像。核心挑战在于将视觉和文本数据映射到统一的潜在空间中。虽然现有的TPR方法集中于识别显性和积极特征,但它们通常忽视了负面描述词的关键影响,导致可能出现阳性标准但可以被负面描述词排除的潜在误检。为了减轻这些问题,我们引入了DualFocus,一个整合正向和负向描述词以提高视觉语言基础模型关于文本查询的解释性精度的统一框架。DualFocus采用Dual(正向/负向)属性提示学习(DAPL),将双重图像属性对比学习(DIAC)和学习敏感图像属性匹配(SIAM)相结合。这样,DualFocus通过增强未见属性的检测来提高检索精度。为了进一步实现视觉和文本嵌入的粗细粒度对齐,我们提出了动态词级相似性(DTS)损失,该损失对匹配和非匹配描述进行精细化和可适应性相似度评估,从而通过详细和可适应性的相似度评估来增强匹配过程。通过关注词级比较,DualFocus在精度和鲁棒性方面显著优于现有技术。实验结果突出了DualFocus在CUHK-PEDES、ICFG-PEDES和RSTPReid上的卓越性能。
https://arxiv.org/abs/2405.07459
While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the image-text matching knowledge of the pretrained model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, we propose a TSG+ module to transfer the image-text matching knowledge from CLIP models to our region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the pretrained image-text knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods.
虽然视觉语言预训练模型(VLMs)在各种多模态理解任务中表现出色,但它们在细粒度音频-视觉推理方面(特别是音频-视觉问题回答,AVQA)的潜力仍然没有被充分利用。由于在区域级别需要视觉理解并且与音频模态无缝集成,AVQA为VLMs提出了特定的挑战。之前基于VLMs的AVQA方法仅仅将CLIP用作特征编码器,但忽略了其知识,还将音频和视频视为双流框架中的单独实体,正如大多数AVQA方法一样。本文提出了一种新的CLIP驱动的目标感知单流(TASS)网络用于AVQA,通过自然音频-视觉匹配特性将预训练模型的图像-文本匹配知识传递给我们的区域文本匹配过程。它包括两个关键组件:目标感知空间基线模块(TSG+)和单流联合时间基线模块(JTG)。具体来说,我们提出了一个TSG+模块,将其从CLIP模型中的图像-文本匹配知识传递到我们的区域文本匹配过程,而无需相应的标注目标。此外,与之前单独的双流网络不同,JTG通过简单的单流架构将音频和视频融合在一起,并在保留其时间关联的同时扩展了预训练的图像-文本知识到音频-文本匹配。通过在MUSIC-AVQA基准上进行的大量实验证实,我们的方法在现有技术水平上具有优越性。
https://arxiv.org/abs/2405.07451
In visual place recognition, accurately identifying and matching images of locations under varying environmental conditions and viewpoints remains a significant challenge. In this paper, we introduce a new technique, called Bag-of-Queries (BoQ), which learns a set of global queries designed to capture universal place-specific attributes. Unlike existing methods that employ self-attention and generate the queries directly from the input features, BoQ employs distinct learnable global queries, which probe the input features via cross-attention, ensuring consistent information aggregation. In addition, our technique provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. The performance of BoQ is demonstrated through extensive experiments on 14 large-scale benchmarks. It consistently outperforms current state-of-the-art techniques including NetVLAD, MixVPR and EigenPlaces. Moreover, as a global retrieval technique (one-stage), BoQ surpasses two-stage retrieval methods, such as Patch-NetVLAD, TransVPR and R2Former, all while being orders of magnitude faster and more efficient. The code and model weights are publicly available at this https URL.
在视觉位置识别中,准确地识别和匹配不同环境条件和观点下的地点图像仍然是一个重要的挑战。在本文中,我们引入了一种新的技术,称为Bag-of-Queries(BoQ),它学习了一组全局查询,旨在捕捉普遍地点特有的属性。与现有的方法不同,BoQ采用了一种独特的可学习全局查询的方法,通过跨注意来探究输入特征,确保信息的一致聚合。此外,我们的技术提供了一个可解释的注意机制,并集成了CNN和Vision Transformer骨干网络。通过在ImageNet等14个大型基准上进行广泛的实验,BoQ的性能证明了其超越了当前最先进的技巧,包括NetVLAD、MixVPR和EigenPlaces。此外,作为全局检索技术(一阶段),BoQ超越了两阶段检索方法,如Patch-NetVLAD、TransVPR和R2Former,所有这些都是以 orders of magnitude 更快的速度和更高效的方式。代码和模型权重都可以在本文的https URL上公开获得。
https://arxiv.org/abs/2405.07364
Dense vector representations for sentences made significant progress in recent years as can be seen on sentence similarity tasks. Real-world phrase retrieval applications, on the other hand, still encounter challenges for effective use of dense representations. We show that when target phrases reside inside noisy context, representing the full sentence with a single dense vector, is not sufficient for effective phrase retrieval. We therefore look into the notion of representing multiple, sub-sentence, consecutive word spans, each with its own dense vector. We show that this technique is much more effective for phrase mining, yet requires considerable compute to obtain useful span representations. Accordingly, we make an argument for contextualized word/token embeddings that can be aggregated for arbitrary word spans while maintaining the span's semantic meaning. We introduce a modification to the common contrastive loss used for sentence embeddings that encourages word embeddings to have this property. To demonstrate the effect of this method we present a dataset based on the STS-B dataset with additional generated text, that requires finding the best matching paraphrase residing in a larger context and report the degree of similarity to the origin phrase. We demonstrate on this dataset, how our proposed method can achieve better results without significant increase to compute.
近年来,在句子相似性任务中,稠密向量表示取得了显著进展。然而,在真实世界短语检索应用中,有效地使用稠密表示仍然存在挑战。我们发现,当目标短语位于嘈杂的上下文中时,用单个稠密向量表示整个句子是不够的,对于有效的短语检索来说,这是不够的。因此,我们研究了表示多个子句连续单词跨度(每个都有自己的稠密向量)的概念。我们发现,这种技术对于短语挖掘非常有效,然而需要相当大的计算成本来获得有用的跨度表示。因此,我们提出了一个基于上下文聚合的语境化词/标记嵌入的概念,该概念可以在任意单词跨度上保持语义意义。我们引入了一种修改后的常见的对比损失,以鼓励词嵌入具有该特性。为了证明这种方法的效果,我们在STS-B数据集上建立了一个增加生成文本的 dataset,需要找到一个更大的上下文中的最佳匹配的同义词,并报告与原始短语的相似程度。我们在这个 dataset 上展示了我们的方法在没有显著增加计算成本的情况下可以实现更好的结果。
https://arxiv.org/abs/2405.07263
Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and text. We conduct extensive experiments on retrieval across text, video, and audio. Our simple model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines. In addition, qualitative visualizations vividly showcase the superiority of our VLSA in learning discriminative visual-textual representations.
视频语言预训练是一个典型而具有挑战性的问题,旨在通过自监督的方式从大规模数据中学习视觉和文本表示。现有的预训练方法要么捕获图像-文本对之间的对应关系,要么利用帧的时间顺序。然而,它们并没有明确探索音频与其他两个模式之间的自然同步。在这项工作中,我们提出了一种名为VLSA的视频语言预训练增强框架,可以在统一的自监督Transformer中学习三模态表示。具体来说,我们的VLSA共同聚合视频、文本和音频的局部补丁和全局词向量。此外,我们利用局部补丁遮蔽建模来学习模式感知的特征,并利用全局音频匹配来捕捉视频和文本中的音频指导特征。我们在文本、视频和音频上进行广泛的实验。仅使用0.9M数据预训练的简单模型取得了比最先进的基准模型更好的结果。此外,定性的可视化生动地展示了VLSA在学习具有区分性的视觉文本表示方面的优越性。
https://arxiv.org/abs/2405.07202
In this paper, we show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object Segmentation (VOS) under complex circumstances. Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object Segmentation. Prior works on VOS mostly rely on direct comparison of semantic and contextual features to perform dense matching between current and past frames, passing over actual motion structure. On the other hand, Optical Flow Estimation task aims to approximate the scene motion field, exposing global motion patterns which are typically undiscoverable during all pairs similarity search. We present WarpFormer, an architecture for semi-supervised Video Object Segmentation that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching. Our framework employs a generic pretrained Optical Flow Estimation network whose prediction is used to warp both past frames and instance segmentation masks to the current frame domain. Consequently, warped segmentation masks are refined and fused together aiming to inpaint occluded regions and eliminate artifacts caused by flow field imperfects. Additionally, we employ novel large-scale MOSE 2023 dataset to train model on various complex scenarios. Our method demonstrates strong performance on DAVIS 2016/2017 validation (93.0% and 85.9%), DAVIS 2017 test-dev (80.6%) and YouTube-VOS 2019 validation (83.8%) that is competitive with alternative state-of-the-art methods while using much simpler memory mechanism and instance understanding logic.
在本文中,我们证明了将来自其他领域视频理解的知识与大规模学习相结合可以提高视频对象分割(VOS)在复杂情况下的鲁棒性。具体来说,我们关注将场景全局运动知识集成到大型半监督视频对象分割中,以提高大规模半监督视频对象分割的鲁棒性。先前的VOS研究主要依赖于对当前和过去帧进行语义和上下文特征的直接比较来进行密集匹配,而忽略了实际运动结构。另一方面,光学流估计任务旨在近似场景运动场,揭示在所有相似度搜索过程中通常不可见的全局运动模式。我们提出了WarpFormer,一种半监督视频对象分割架构,利用现有的运动理解知识进行平滑传播和更精确的匹配。我们的框架采用了一个通用的预训练的光学流估计网络,其预测用于将过去帧和实例分割掩码 warp 到当前帧领域。因此,warped分割掩码经过精细化和融合,旨在修复遮挡区域并消除由流场不完美引起的伪影。此外,我们还使用新的大型MOSE 2023数据集来训练模型,以训练其在各种复杂场景中的性能。我们的方法在DAVIS 2016/2017验证集(93.0%和85.9%)和DAVIS 2017测试集(80.6%)以及YouTube-VOS 2019验证集(83.8%)上的表现与最先进的替代方法竞争,同时使用更简单的内存机制和实例理解逻辑。
https://arxiv.org/abs/2405.07031
The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption
近年来,在视频对象分割(Video Object Segmentation)方面的研究取得了显著的成果,通过在当前和前一帧之间匹配密集的语义和实例级别特征,实现了长时传播。然而,全局特征匹配忽略了场景运动上下文,未能满足时间一致性。尽管一些方法引入局部匹配分支以实现平滑传播,但由于局部窗口的约束,它们无法建模由于局部窗口引起的复杂视觉效果变化。在本文中,我们提出了DeVOS(可变形视频对象分割)架构,结合基于记忆的匹配和运动引导的传播,实现了稳定的长期建模和强的时间一致性。对于短期局部传播,我们提出了名为ADVA(自适应变形视频注意力)的新注意机制,允许将相似性搜索区域适应性地扩展到查询特定的语义特征,从而确保对复杂形状和比例变化进行稳健跟踪。DeVOS采用光流技术获得场景运动特征,这些特征进一步通过变形注意力作为强的先验进行学习。我们的方法在DAVIS 2017 val和test-dev(88.1%, 83.0%)以及YouTube-VOS 2019 val(86.6%)上实现了 Top-rank 性能,具有稳定的运行速度和合理的内存消耗。
https://arxiv.org/abs/2405.08715
This letter introduces a novel framework for dense Visual Simultaneous Localization and Mapping (VSLAM) based on Gaussian Splatting. Recently Gaussian Splatting-based SLAM has yielded promising results, but rely on RGB-D input and is weak in tracking. To address these limitations, we uniquely integrates advanced sparse visual odometry with a dense Gaussian Splatting scene representation for the first time, thereby eliminating the dependency on depth maps typical of Gaussian Splatting-based SLAM systems and enhancing tracking robustness. Here, the sparse visual odometry tracks camera poses in RGB stream, while Gaussian Splatting handles map reconstruction. These components are interconnected through a Multi-View Stereo (MVS) depth estimation network. And we propose a depth smooth loss to reduce the negative effect of estimated depth maps. Furthermore, the consistency in scale between the sparse visual odometry and the dense Gaussian map is preserved by Sparse-Dense Adjustment Ring (SDAR). We have evaluated our system across various synthetic and real-world datasets. The accuracy of our pose estimation surpasses existing methods and achieves state-of-the-art performance. Additionally, it outperforms previous monocular methods in terms of novel view synthesis fidelity, matching the results of neural SLAM systems that utilize RGB-D input.
本文介绍了一种基于Gaussian Splatting的密集视觉同时定位与映射(VSLAM)的新框架。最近基于Gaussian Splatting的SLAM已经取得了良好的结果,但是依赖于RGB-D输入,并且在跟踪方面较弱。为了克服这些限制,我们独特地将先进的稀疏视觉欧拉角与密集Gaussian Splatting场景表示集成在一起,从而消除了Gaussian Splatting-based SLAM系统常见的深度图依赖,提高了跟踪的鲁棒性。在这里,稀疏视觉欧拉角跟踪相机姿态,而Gaussian Splatting处理地图重建。这些组件通过多视角立体(MVS)深度估计网络相互连接。我们提出了一个深度平滑损失来减少估计深度图的负影响。此外,通过稀疏-稠密调整环(SDAR)可以保留稀疏视觉欧拉角与密集Gaussian地图之间的一致缩放。我们在各种合成和真实世界数据集上对系统进行了评估。我们的姿态估计精度超越了现有方法,达到了最先进的水平。此外,在新颖视角合成保真度方面,它超过了之前的多目视觉SLAM系统,与利用RGB-D输入的神经SLAM系统的结果相匹敌。
https://arxiv.org/abs/2405.06241
Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.
Sora 揭示了在任意分辨率、 aspect ratios 和持续时间下扩展扩散变换(Diffusion Transformer)的潜力,但它仍然缺乏足够的实现细节。在技术报告中,我们引入了Lumina-T2X家族 - 一系列基于Flow的大型扩散变换(Flag-DiT)模型,配备了零初始化的注意力,作为统一框架,旨在将噪声转换为图像、视频、多视图3D物体和音频剪辑,基于文本指令。通过对潜在的空间-时间域进行标记,并包括可学习的占位符(如 [nextline] 和 [nextframe] 标记),Lumina-T2X 在各种空间-时间分辨率之间无缝统一了不同模态的表示。这种统一的方法使得不同模态的训练可以在单个框架内进行,允许在推理期间灵活生成多模态数据。高级技术如 RoPE、RMSNorm 和 flow matching 增强了 Flag-DiT 的稳定性、灵活性和可扩展性,使Lumina-T2X的模型扩展到70亿参数,并将上下文窗口扩展到128K标记。这对使用我们的Lumina-T2I模型创建超高清图像和用Lumina-T2V模型创建长720p视频非常有用。值得注意的是,Lumina-T2I,由50亿参数的Flag-DiT驱动,只需要在600亿参数的 naive DiT 的训练计算成本的35% 来进行训练。我们进一步的全面分析证实了Lumina-T2X在分辨率扩展、高分辨率编辑、生成一致的3D视图和合成流畅视频方面的初步能力。我们预计,Lumina-T2X的开源将进一步促进生成人工智能社区的创新、透明度和多样性。
https://arxiv.org/abs/2405.05945