Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at this https URL.
食管癌是全球最常见的癌症之一,在癌症相关死亡率中排名第六。准确的多模态癌症分期诊断方法可以帮助医生有效地个性化治疗方案。目前,基于CT的癌症诊断方法因全面检查患者状况而受到广泛关注。然而,基于多模态的方法可能会引入信息冗余,导致性能下降。此外,需要进一步探索多模态表示之间的高效且有效的交互作用,缺乏对多模态特征的预测关联的深入探讨。在这项工作中,我们基于CT图像的 multi-modal 异质图条件特征指导扩散模型研究淋巴结转移的诊断,并使用临床测量和放射学数据。为了探索多模态特征之间的复杂关系,我们构建了一个异质图。接下来,应用条件特征引导扩散方法消除信息冗余。此外,我们提出了一种遮罩关系表示学习策略,旨在揭示原发肿瘤和淋巴结图像表示的潜在预后相关性和优先级。各种实验结果证实了我们提出方法的的有效性。代码可在此处访问:https://www.xxx.com
https://arxiv.org/abs/2405.09539
We present QueryNER, a manually-annotated dataset and accompanying model for e-commerce query segmentation. Prior work in sequence labeling for e-commerce has largely addressed aspect-value extraction which focuses on extracting portions of a product title or query for narrowly defined aspects. Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types. We report baseline tagging results and conduct experiments comparing token and entity dropping for null and low recall query recovery. Challenging test sets are created using automatic transformations and show how simple data augmentation techniques can make the models more robust to noise. We make the QueryNER dataset publicly available.
我们提出了QueryNER,一个手动标注的电商查询分割数据集和相应的模型,用于电商查询分割。之前的工作在电商序列标注中主要解决了面向值的提取,即针对狭窄定义的方面提取产品标题或查询的部分。我们的工作重点在于将查询划分为有意义的片段,具有广泛应用的类型。我们报告了基线标签结果,并研究了对于空和低召回度的查询恢复,通过自动转换进行实验。具有挑战性的测试集是用自动转换创建的,展示了简单的数据增强技术如何使模型对噪声更加鲁棒。我们将QueryNER数据集公开发布。
https://arxiv.org/abs/2405.09507
Glass largely blurs the boundary between the real world and the reflection. The special transmittance and reflectance quality have confused the semantic tasks related to machine vision. Therefore, how to clear the boundary built by glass, and avoid over-capturing features as false positive information in deep structure, matters for constraining the segmentation of reflection surface and penetrating glass. We proposed the Fourier Boundary Features Network with Wider Catchers (FBWC), which might be the first attempt to utilize sufficiently wide horizontal shallow branches without vertical deepening for guiding the fine granularity segmentation boundary through primary glass semantic information. Specifically, we designed the Wider Coarse-Catchers (WCC) for anchoring large area segmentation and reducing excessive extraction from a structural perspective. We embed fine-grained features by Cross Transpose Attention (CTA), which is introduced to avoid the incomplete area within the boundary caused by reflection noise. For excavating glass features and balancing high-low layers context, a learnable Fourier Convolution Controller (FCC) is proposed to regulate information integration robustly. The proposed method has been validated on three different public glass segmentation datasets. Experimental results reveal that the proposed method yields better segmentation performance compared with the state-of-the-art (SOTA) methods in glass image segmentation.
玻璃在很大程度上模糊了现实世界和反射之间的边界。特殊的传输和反射质量使机器视觉相关的语义任务变得混乱。因此,如何清除玻璃构建的边界,并避免在深度结构中捕获到的特征作为假阳性信息,对于限制反射表面分割和穿透玻璃是有关于的。我们提出了Fourier边界特征网络(FBWC),这可能是第一个利用足够宽的横向浅分支,而不会导致垂直加深,指导通过主要玻璃语义信息进行细粒度分割边界的尝试。具体来说,我们设计了一个Wider粗 catch器(WCC),用于锚定大面积分割,并减少从结构角度引起的过度提取。我们通过跨变换注意(CTA)嵌入细粒度特征,这是为了避免反射噪声引起的不完整区域。为了挖掘玻璃特征并平衡高低层上下文,我们提出了一个可学习的Fourier卷积控制器(FCC)来调节信息整合的稳健性。所提出的方法已经在三个不同公开玻璃分割数据集上进行了验证。实验结果表明,与最先进的玻璃图像分割方法相比,所提出的方法具有更好的分割性能。
https://arxiv.org/abs/2405.09459
Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid human-AI technologies, and 3) investigates what these technologies can reveal about individual perspectives in online discussions. We propose a three-layered hierarchy for representing perspectives that can be obtained by a mixture of human intelligence and large language models. We illustrate how these representations can draw insights into the diversity of perspectives and allow us to investigate interactions in online discussions.
现代民主国家面临着公民参与度下降的一个关键问题。在线讨论论坛是提高公民参与度的重要途径。本论文提纲1)识别了促进大规模在线讨论与自然语言处理(NLP)相关的挑战,2)提出了通过融合人机智能技术来解决这些挑战的建议,3)研究了这些技术如何揭示关于在线讨论中个体观点的信息。我们提出了一个三层表示视角的三层结构,可以通过混合人类智慧和大型语言模型的方式获得。我们说明了这些表示如何揭示视角的多样性,并允许我们研究在线讨论中的互动。
https://arxiv.org/abs/2405.09439
Current orthopedic robotic systems largely focus on navigation, aiding surgeons in positioning a guiding tube but still requiring manual drilling and screw placement. The automation of this task not only demands high precision and safety due to the intricate physical interactions between the surgical tool and bone but also poses significant risks when executed without adequate human oversight. As it involves continuous physical interaction, the robot should collaborate with the surgeon, understand the human intent, and always include the surgeon in the loop. To achieve this, this paper proposes a new cognitive human-robot collaboration framework, including the intuitive AR-haptic human-robot interface, the visual-attention-based surgeon model, and the shared interaction control scheme for the robot. User studies on a robotic platform for orthopedic surgery are presented to illustrate the performance of the proposed method. The results demonstrate that the proposed human-robot collaboration framework outperforms full robot and full human control in terms of safety and ergonomics.
目前,机器人骨科系统主要关注导航,帮助医生在定位引导管时进行操作,但仍需要手动进行钻孔和螺栓植入。自动化这一任务不仅要求高精度和安全性,是由于手术工具与骨头的复杂物理相互作用所带来的,而且在缺乏充分人类监督的情况下执行也存在重大风险。由于涉及持续的身体交互,机器人应与医生合作,理解人类的意图,并始终将医生纳入循环。为实现这一目标,本文提出了一种新的人机协作框架,包括直观的AR-人机界面、基于视觉注意的医生模型和机器人共享交互控制方案。用户研究在骨科手术机器人平台上展示了所提出方法的有效性。结果表明,与全机器人控制和全人类控制相比,人机协作框架在安全和人机工程方面具有优势。
https://arxiv.org/abs/2405.09359
The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in computational and memory footprints as the convolutional kernel size increases. To address the first issue, we propose the multi-scale blueprint separable convolutions (MBSConv) as highly efficient building block with multi-scale receptive field, it can focus on the learning for the multi-scale information which is a vital component of discriminative representation. As for the second issue, we revisit the key properties of LKA in which we find that the adjacent direct interaction of local information and long-distance dependencies is crucial to provide remarkable performance. Thus, taking this into account and in order to mitigate the complexity of LKA, we propose a large coordinate kernel attention (LCKA) module which decomposes the 2D convolutional kernels of the depth-wise convolutional layers in LKA into horizontal and vertical 1-D kernels. LCKA enables the adjacent direct interaction of local information and long-distance dependencies not only in the horizontal direction but also in the vertical. Besides, LCKA allows for the direct use of extremely large kernels in the depth-wise convolutional layers to capture more contextual information, which helps to significantly improve the reconstruction performance, and it incurs lower computational complexity and memory footprints. Integrating MBSConv and LCKA, we propose a large coordinate kernel attention network (LCAN).
多尺度 receptive 场和大型内核注意 (LKA) 模块已经被证明在轻量图像超分辨率任务中显著提高了性能。然而,现有的轻量级超分辨率(SR)方法很少关注设计具有多尺度 receptive 场的有效构建模块,并且随着卷积核大小的增加,它们的 LKA 模块的计算和内存足迹呈指数增长。为解决第一个问题,我们提出了多尺度蓝色模板分离卷积(MBSConv)作为具有多尺度 receptive 场的非常高效构建模块,它可以关注多尺度信息,这是判别表示的重要组成部分。对于第二个问题,我们重新审视了 LKA 的关键特性,我们发现邻近信息之间的直接相互作用和长距离依赖关系对提供出色的性能至关重要。因此,考虑到这一点,为了减轻 LKA 的复杂性,我们提出了大型坐标卷积注意(LCKA)模块,它将 LKA 的深度卷积层中的 2D 卷积核拆分为水平和垂直 1D 卷积核。LCKA 不仅使相邻直接相互作用于局部信息和长距离依赖关系,而且在水平和垂直方向上都有。此外,LCKA 允许在深度卷积层中直接使用极其大的卷积核来捕捉更多的上下文信息,从而显著提高重构性能,并使其计算复杂性和内存足迹更低。将 MBSConv 和 LCKA 集成起来,我们提出了大型坐标卷积注意网络 (LCAN)。
https://arxiv.org/abs/2405.09353
Image-guided depth completion aims at generating a dense depth map from sparse LiDAR data and RGB image. Recent methods have shown promising performance by reformulating it as a classification problem with two sub-tasks: depth discretization and probability prediction. They divide the depth range into several discrete depth values as depth categories, serving as priors for scene depth distributions. However, previous depth discretization methods are easy to be impacted by depth distribution variations across different scenes, resulting in suboptimal scene depth distribution priors. To address the above problem, we propose a progressive depth decoupling and modulating network, which incrementally decouples the depth range into bins and adaptively generates multi-scale dense depth maps in multiple stages. Specifically, we first design a Bins Initializing Module (BIM) to construct the seed bins by exploring the depth distribution information within a sparse depth map, adapting variations of depth distribution. Then, we devise an incremental depth decoupling branch to progressively refine the depth distribution information from global to local. Meanwhile, an adaptive depth modulating branch is developed to progressively improve the probability representation from coarse-grained to fine-grained. And the bi-directional information interactions are proposed to strengthen the information interaction between those two branches (sub-tasks) for promoting information complementation in each branch. Further, we introduce a multi-scale supervision mechanism to learn the depth distribution information in latent features and enhance the adaptation capability across different scenes. Experimental results on public datasets demonstrate that our method outperforms the state-of-the-art methods. The code will be open-sourced at [this https URL](this https URL).
图像引导深度完成旨在从稀疏的激光雷达数据和彩色图像中生成密集的深度图。最近的方法通过将其转化为分类问题,将其分为两个子任务:深度离散化和概率预测,通过分段深度值,表现出良好的性能。他们将深度范围划分为几个离散的深度值作为深度类别,作为场景深度分布的预训练。然而,之前的深度离散化方法很容易受到不同场景中深度分布的变化影响,导致场景深度分布预优劣。为了应对这个问题,我们提出了一个逐步深度解耦和调制网络,该网络在多个阶段逐步解耦深度范围并生成多尺度密集深度图。具体来说,我们首先设计了一个Bins初始化模块(BIM),通过探索稀疏深度图中的深度分布信息来构建种子值,适应深度分布的变化。然后,我们设计了一个逐步深度解耦分支,从全局到局部逐步优化深度分布信息。同时,我们还开发了一个自适应深度调制分支,从粗粒度到细粒度逐步改进概率表示。为了增强这两个分支(子任务)之间的信息交互,以提高每个分支的信息互补,我们引入了多尺度监督机制,以学习潜在特征中的深度分布信息,并增强在不同场景下的适应能力。在公开数据集上的实验结果表明,我们的方法优于最先进的方法。代码将在此处[https://this](https://this)公开源码。
https://arxiv.org/abs/2405.09342
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
This paper explores a novel multi-modal alternating learning paradigm pursuing a reconciliation between the exploitation of uni-modal features and the exploration of cross-modal interactions. This is motivated by the fact that current paradigms of multi-modal learning tend to explore multi-modal features simultaneously. The resulting gradient prohibits further exploitation of the features in the weak modality, leading to modality competition, where the dominant modality overpowers the learning process. To address this issue, we study the modality-alternating learning paradigm to achieve reconcilement. Specifically, we propose a new method called ReconBoost to update a fixed modality each time. Herein, the learning objective is dynamically adjusted with a reconcilement regularization against competition with the historical models. By choosing a KL-based reconcilement, we show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others and help enhance the overall performance. The major difference with the classic GB is that we only preserve the newest model for each modality to avoid overfitting caused by ensembling strong learners. Furthermore, we propose a memory consolidation scheme and a global rectification scheme to make this strategy more effective. Experiments over six multi-modal benchmarks speak to the efficacy of the method. We release the code at this https URL.
本文探讨了一种新型的多模态交替学习范式,旨在实现单模态特征的充分利用和跨模态相互作用的探索之间的调和。这是由于当前的多模态学习范式倾向于同时探索多模态特征。由此产生的梯度禁止进一步挖掘弱模态特征,导致模态竞争,其中优势模态会压倒学习过程。为了解决这个问题,我们研究了模态交替学习范式以实现调和。具体来说,我们提出了一种名为ReconBoost的新方法,每次更新一个固定的模态。在这里,学习目标通过和谐 regularization 对抗历史模型的竞争进行动态调整。通过选择基于KL的和谐方法,我们证明了所提出的方法类似于Friedman的梯度 Boost (GB)算法,其中更新后的学习者可以纠正他人的错误并帮助提高整体性能。与经典GB的主要区别在于,我们只保留每个模态中最新的模型,以避免由于集成强势学习者而引起的过拟合。此外,我们还提出了一种记忆整合方案和全局矩形化方案,使这种策略更加有效。六个多模态基准的实验结果证实了这种方法的有效性。您可以在以下链接处获取代码:https://url.cn/
https://arxiv.org/abs/2405.09321
One goal of dexterous robotic grasping is to allow robots to handle objects with the same level of flexibility and adaptability as humans. However, it remains a challenging task to generate an optimal grasping strategy for dexterous hands, especially when it comes to delicate manipulation and accurate adjustment the desired grasping poses for objects of varying shapes and sizes. In this paper, we propose a novel dexterous grasp generation scheme called \textbf{\textit{GrainGrasp}} that provides fine-grained contact guidance for each fingertip. In particular, we employ a generative model to predict separate contact maps for each fingertip on the object point cloud, effectively capturing the specifics of finger-object interactions. In addition, we develop a new dexterous grasping optimization algorithm that solely relies on the point cloud as input, eliminating the necessity for complete mesh information of the object. By leveraging the contact maps of different fingertips, the proposed optimization algorithm can generate precise and determinable strategies for human-like object grasping. Experimental results confirm the efficiency of the proposed scheme. Our code is available at this https URL
灵巧机器人抓取的一个目标是使机器人能够像人类一样处理具有相同程度的灵活性和适应性的物体。然而,为灵巧的手生成最优抓取策略仍然是一个具有挑战性的任务,尤其是在处理形状和大小不等的物体时,更是如此。在本文中,我们提出了一个名为 \textbf{\textit{GrainGrasp}} 的新颖灵巧抓取生成方案,为每个手指提供细粒度的接触指导。 特别是,我们采用生成模型预测物体点云上每个手指的单独接触图,有效捕捉了手指与物体之间互动的特定细节。此外,我们还开发了一种仅依赖点云的灵巧抓取优化算法,消除了需要物体完整网格信息的必要性。通过利用不同手指的接触图,所提出的优化算法可以生成人类式物体抓取的精确和可确定策略。实验结果证实了所提出方案的有效性。我们的代码可以从该链接获取:
https://arxiv.org/abs/2405.09310
The graph neural networks has been proved to be an efficient machine learning technique in real life applications. The handwritten recognition is one of the useful area in real life use where both offline and online handwriting recognition are required. The chain code as feature extraction technique has shown significant results in literature and we have been able to use chain codes with graph neural networks. To the best of our knowledge, this work presents first time a novel combination of handwritten trajectories features as chain codes and graph neural networks together. The handwritten trajectories for offline handwritten text has been evaluated using recovery of drawing order, whereas online handwritten trajectories are directly used with chain codes. Our results prove that present combination surpass previous results and minimize error rate in few epochs only.
已经证明,图神经网络在现实生活中应用是有效的机器学习技术。手写识别是现实生活中的一个有用的领域,需要同时进行离线和在线手写识别。作为特征提取技术,链式码在文献中已经显示出显著的成果,我们能够使用图神经网络与链式码一起工作。据我们所知,这项工作首次将手写轨迹特征与链式码和图神经网络相结合,形成了一种新的组合。我们使用恢复绘制顺序来评估手写在线文本的手写轨迹,而在线手写轨迹则直接使用链式码。我们的结果证明,这种结合超出了以前的结果,并且在几轮训练后仅能最小化误差率。
https://arxiv.org/abs/2405.09247
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
Internal Language Model (LM)-based methods use permutation language modeling (PLM) to solve the error correction caused by conditional independence in external LM-based methods. However, random permutations of human interference cause fit oscillations in the model training, and Iterative Refinement (IR) operation to improve multimodal information decoupling also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance the location-context-image interaction capability, improving autoregressive generalization with internal LM. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks to dynamically exploit token dependencies. The adaptive masks increase the diversity of training data and prevent model dependency on a specific order. It reduces the training overhead of PLM while avoiding training fit oscillations. Second, we develop Cross-modal Hierarchical Attention mechanism (CHA) to couple context and image features. This processing establishes rich positional semantic dependencies between context and image while avoiding IR. Extensive experimental results show the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
内部语言模型(LM)方法使用变换语言建模(PLM)来解决外部LM方法中由条件独立性引起的错误纠正。然而,人类干扰的随机变换导致训练中的拟合振荡,迭代优化(IR)操作为了提高多模态信息解耦还会引入额外的开销。为了应对这些问题,本文提出了具有自适应变换的层次注意力自回归模型(HAAP),以增强位置上下文-图像交互能力,提高自回归模型的泛化。首先,我们提出隐式变换神经元(IPN)来生成自适应的注意力掩码以动态利用词依赖关系。自适应掩码增加了训练数据的多样性,防止了模型对特定顺序的依赖。同时,降低了PLM的训练开销,并避免了训练中的拟合振荡。其次,我们开发了跨模态层次注意力机制(CHA)来将上下文和图像特征耦合。这种处理建立了上下文和图像之间的丰富位置语义依赖关系,同时避免了IR。大量的实验结果表明,与最先进的(SOTA)性能相比,HAAP在准确性、复杂性和延迟方面都取得了卓越的成绩。
https://arxiv.org/abs/2405.09125
Humans use collaborative robots as tools for accomplishing various tasks. The interaction between humans and robots happens in tight shared workspaces. However, these machines must be safe to operate alongside humans to minimize the risk of accidental collisions. Ensuring safety imposes many constraints, such as reduced torque and velocity limits during operation, thus increasing the time to accomplish many tasks. However, for applications such as using collaborative robots as haptic interfaces with intermittent contacts for virtual reality applications, speed limitations result in poor user experiences. This research aims to improve the efficiency of a collaborative robot while improving the safety of the human user. We used Gaussian process models to predict human hand motion and developed strategies for human intention detection based on hand motion and gaze to improve the time for the robot and human security in a virtual environment. We then studied the effect of prediction. Results from comparisons show that the prediction models improved the robot time by 3\% and safety by 17\%. When used alongside gaze, prediction with Gaussian process models resulted in an improvement of the robot time by 2\% and the safety by 13\%.
人类使用协作机器人作为完成各种任务的工具。人类和机器人之间的互动发生在紧密共享的工作空间中。然而,为了最小化意外碰撞的风险,这些机器必须安全地与人类一起操作。确保安全性会带来许多限制,例如在操作期间减小扭矩和速度限制,从而增加完成许多任务的所需时间。然而,对于将协作机器人用作虚拟现实应用中的触觉接口的应用,速度限制会导致用户体验差。这项研究旨在提高协作机器人的效率,同时提高人类用户的可靠性。我们使用高斯过程模型预测人类手部运动,并基于手部动作和眼神来开发了人类意图检测策略,以提高机器人和人类在虚拟环境中的安全时间。然后我们研究了预测的影响。比较结果表明,预测模型提高了机器人的时间3%,安全性提高了17%。当与眼神结合使用时,使用高斯过程模型的预测提高了机器人的时间2%,安全性提高了13%。
https://arxiv.org/abs/2405.09109
This study developed an explainable AI for ship collision avoidance. Initially, a critic network composed of sub-task critic networks was proposed to individually evaluate each sub-task in collision avoidance to clarify the AI decision-making processes involved. Additionally, an attempt was made to discern behavioral intentions through a Q-value analysis and an Attention mechanism. The former focused on interpreting intentions by examining the increment of the Q-value resulting from AI actions, while the latter incorporated the significance of other ships in the decision-making process for collision avoidance into the learning objective. AI's behavioral intentions in collision avoidance were visualized by combining the perceived collision danger with the degree of attention to other ships. The proposed method was evaluated through a numerical experiment. The developed AI was confirmed to be able to safely avoid collisions under various congestion levels, and AI's decision-making process was rendered comprehensible to humans. The proposed method not only facilitates the understanding of DRL-based controllers/systems in the ship collision avoidance task but also extends to any task comprising sub-tasks.
这项研究开发了一个可解释性AI用于避碰。最初,提出了一种由子任务批评网络组成的批评网络,以单独评估避碰中的每个子任务,以阐明涉及避碰AI决策过程。此外,通过Q值分析和关注机制试图通过分析AI行动产生的Q值增量来辨别行为意图。前一个方法专注于通过检查AI行动产生的Q值的增加来解释意图,而另一个方法将其他船舶在避碰决策过程中的重要性纳入学习目标。通过将感知避碰危险与关注其他船舶的程度相结合,可视化了避碰AI的行为意图。所提出的方法通过数值实验进行了评估。经证实,该AI在各种拥塞级别下能够安全避碰,并且AI的决策过程对人类是可理解的。所提出的方法不仅有助于在避碰任务中理解基于强化学习的控制器/系统,而且还能扩展到包括子任务在内的任何任务。
https://arxiv.org/abs/2405.09081
Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.
尽管在过去的几年中,面部识别已经取得了显著的进步,但设计一个多任务面部识别模型仍然具有挑战性。大多数面部识别任务都被单独研究,并没有从相关任务之间的协同作用中受益。在本文中,我们提出了一种名为 Q-Face 的具有新颖性的多任务面部识别方法,该方法使用一个统一模型同时执行多个面部识别任务。我们将来自大型预训练模型的多个层次的特征融合在一起,使整个模型可以利用局部和全局面部信息来支持多个任务。此外,我们还设计了一个任务适应模块,在查询向量和融合多级特征之间进行跨注意,并最终根据每个面部识别任务自适应地提取所需特征。大量实验证明,我们的方法可以同时执行多个任务,在面部表情识别、动作单元检测、面部属性分析、年龄估计和面部姿态估计方面的表现均为最先进的水平。与传统方法相比,我们的方法为多任务面部识别提供了新的可能性,并展示了准确性和效率的潜力。
https://arxiv.org/abs/2405.09059
The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features and the feasibility based on temporal features of realizing effective detection. According to this analysis, we use a multi-frame as a detection unit and propose a detection method based on temporal energy selective scaling (TESS). Specifically, we investigated the composition of intensity temporal profiles (ITPs) formed by pixels on a multi-frame detection unit. For the target-present pixel, the target passing through the pixel will bring a weak transient disturbance on the ITP and introduce a change in the statistical properties of ITP. We use a well-designed function to amplify the transient disturbance, suppress the background and noise components, and output the trajectory of the target on the multi-frame detection unit. Subsequently, to solve the contradiction between the detection rate and the false alarm rate brought by the traditional threshold segmentation, we associate the temporal and spatial features of the output trajectory and propose a trajectory extraction method based on the 3D Hough transform. Finally, we model the trajectory of the target and propose a trajectory-based multi-target tracking method. Compared with the various state-of-the-art detection and tracking methods, experiments in multiple scenarios prove the superiority of our proposed methods.
被动光学遥感(PORS)中检测和跟踪小目标具有广泛的应用价值。然而,之前提出的大多数方法很少利用目标运动产生的丰富时变特征,导致低信号-噪声比(SCR)目标检测和跟踪性能较差。在本文中,我们分析基于空间特征和基于时变特征实现有效检测的难度,并根据分析结果提出了一种基于时变能量选择性缩放(TESS)的检测方法。具体来说,我们研究了多帧中像素产生的强度时变轮廓(ITP)的组成。对于目标存在的像素,穿过像素的目标会对ITP产生弱暂态干扰,并改变ITP的统计特性。我们使用一个精心设计的函数来放大暂态干扰,抑制背景和噪声分量,并输出目标在多帧检测单元上的轨迹。为了解决传统阈值分割带来的检测率和误报警率之间的矛盾,我们将输出轨迹的时域和空间特征相关联,并提出了基于3D Hough变换的轨迹提取方法。最后,我们建模了目标轨迹,并提出了基于轨迹的多目标跟踪方法。与各种最先进的检测和跟踪方法相比,多个场景下的实验证明了我们提出方法的优越性。
https://arxiv.org/abs/2405.09054
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at this https URL
目前用于长视频理解的数据集往往无法提供真正的长视频理解挑战,因为许多来源于这些数据集的任务只要分析视频中的一两个随机帧就可以成功解决。为解决这个问题,我们提出了一个名为CinePile的新数据集和基准,专门为真正的长视频理解而设计。本文详细介绍了我们创造问题-答案数据集的方法,利用先进的神经网络与人类交互并基于人类生成的原始数据。我们全面的数据集包括305,000个多选题问题(MCQs),涵盖各种视觉和多模态方面,包括时间理解、理解人-对象交互和推理场景中事件或动作。此外,我们还评估了我们的数据集中的最新视频相关LLM,包括开源和专有版本,在测试集上。研究结果表明,即使是最先进的视频相关LLM在这些任务上也无法与人类 performance相媲美,这突出了视频理解的复杂性和挑战性。该数据集可在此处访问:<https://this URL>
https://arxiv.org/abs/2405.08813
Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.
浏览新闻文章的复杂多变的景观,需要理解各种参与者的身份,这些参与者从决策者到反对派人物、公民等,在塑造新闻故事中发挥着关键作用。了解他们的角色、政治观点、社会地位等,对于深入理解新闻内容至关重要。尽管现有的工作集中于通过社交媒体数据突出实体、覆盖差异和政治立场,但自动检测新闻内容中的参与者角色仍然是一个未被探索的领域。在本文中,我们通过引入一种有效的分类新闻文章中参与者类型的方法,跨越了这个领域的空白。我们的方法将参与者分类问题转化为自然语言推理任务,利用新闻文章的上下文信息和外部知识来提高参与者类型检测的准确性。此外,我们所提出的模型在零散设置中表现出优异效果,进一步拓展了其在各种新闻环境中的应用。
https://arxiv.org/abs/2405.08751