Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
近年来,大型语言模型(LLMs)已经在预测神经科学研究结果方面超越了人类专家(Luo等人,2024)。这种超群表现的基础是什么?一种可能性是,相对于从更广泛的训练中产生的广泛的推理能力,该特定科学文献中的统计模式是LLMs性能超越人类专家的原因。为了评估这种可能性,我们在1300亿个领域特定知识点的数据上训练了一个规模相对较小的124M参数GPT-2模型。尽管规模比训练在数十亿个token的较大LLM要小得多,但小模型在预测神经科学结果方面实现了专家水平。当小模型通过专门为神经科学文本训练的tokenizer进行训练时,或者当神经科学文献被用于微调预训练的GPT-2时,它们的成功预测结果表明,专家级的性能可能通过领域特定的自回归训练方法实现。
https://arxiv.org/abs/2405.09395
This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel registration algorithms and guidewire segmentation methods as its perception modules. We additionally propose three modules: a topology-constrained 2D-3D instrument end-point lifting method, a tree-based fast path planning algorithm, and a prior-free endovascular navigation strategy. VascularPilot3D is compatible with most mainstream endovascular robots. Ex-vivo experiments validate that VascularPilot3D achieves 100% success rate among 25 trials. It reduces the human surgeon's overall control loops by 18.38%. VascularPilot3D is promising for general clinical autonomous endovascular navigations.
这项研究报道了VascularPilot3D,这是第一个3D完全自主式内窥镜导航系统。作为自主引导线导航探索,VascularPilot3D是基于内窥镜成像系统(本研究中的荧光X射线)和典型内窥镜机器人开发的完整导航系统。VascularPilot3D采用之前研究过的快速3D-2D血管配准算法和引导线分割方法作为其感知模块。此外,我们还提出了三个模块:基于树的高速路径规划算法、基于约束的2D-3D器械端点提升方法和无需先验的内窥镜导航策略。VascularPilot3D兼容大多数主流内窥镜机器人。实验验证表明,VascularPilot3D在25个试点研究中实现了100%的成功率。它减少了人类外科医生的总操作循环次数 by 18.38%。VascularPilot3D在一般临床自主内窥镜导航方面具有前景。
https://arxiv.org/abs/2405.09375
Recent advances in large language models (LLMs) have led to their extensive global deployment, and ensuring their safety calls for comprehensive and multilingual toxicity evaluations. However, existing toxicity benchmarks are overwhelmingly focused on English, posing serious risks to deploying LLMs in other languages. We address this by introducing PolygloToxicityPrompts (PTP), the first large-scale multilingual toxicity evaluation benchmark of 425K naturally occurring prompts spanning 17 languages. We overcome the scarcity of naturally occurring toxicity in web-text and ensure coverage across languages with varying resources by automatically scraping over 100M web-text documents. Using PTP, we investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs. Notably, we find that toxicity increases as language resources decrease or model size increases. Although instruction- and preference-tuning reduce toxicity, the choice of preference-tuning method does not have any significant impact. Our findings shed light on crucial shortcomings of LLM safeguarding and highlight areas for future research.
近年来,大型语言模型(LLMs)的进步导致它们在世界各地得到了广泛的部署。为了确保其安全性,需要进行全面和多语言的毒性评估。然而,现有的毒性基准大多数都关注英语,这使得将LLM部署到其他语言上存在严重风险。我们通过引入PolygloToxicityPrompt(PTP)作为第一个425K个跨17种语言的自然生成提示的大型多语言毒性评估基准来解决这一问题。我们通过自动爬取超过100M个网页文本文档,跨越不同资源,来覆盖各种语言。使用PTP,我们研究了研究问题,通过基准测试60多个LLM,探讨了模型大小、提示语言和指令和偏好调整方法对毒性的影响。值得注意的是,我们发现,随着语言资源减少或模型大小增加,毒性会增加。尽管指令和偏好调整可以减少毒性,但选择偏好调整方法并没有显著影响。我们的研究结果揭示了LLM保护的诸多不足,并指出了未来研究的方向。
https://arxiv.org/abs/2405.09373
Synthetic aperture radar (SAR) is essential in actively acquiring information for Earth observation. SAR Automatic Target Recognition (ATR) focuses on detecting and classifying various target categories under different image conditions. The current deep learning-based SAR ATR methods are typically designed for specific datasets and applications. Various target characteristics, scene background information, and sensor parameters across ATR datasets challenge the generalization of those methods. This paper aims to achieve general SAR ATR based on a foundation model with Self-Supervised Learning (SSL). Our motivation is to break through the specific dataset and condition limitations and obtain universal perceptual capabilities across the target, scene, and sensor. A foundation model named SARATR-X is proposed with the following four aspects: pre-training dataset, model backbone, SSL, and evaluation task. First, we integrated 14 datasets with various target categories and imaging conditions as a pre-training dataset. Second, different model backbones were discussed to find the most suitable approaches for remote-sensing images. Third, we applied two-stage training and SAR gradient features to ensure the diversity and scalability of SARATR-X. Finally, SARATR-X has achieved competitive and superior performance on 5 datasets with 8 task settings, which shows that the foundation model can achieve universal SAR ATR. We believe it is time to embrace fundamental models for SAR image interpretation in the era of increasing big data.
合成孔雷达(SAR)在积极获取地球观测信息方面至关重要。SAR自动目标识别(ATR)关注于在不同的图像条件下检测和分类各种目标类别。目前基于深度学习的SAR ATR方法通常是为特定数据集和应用设计的。各种目标特征、场景背景信息和ATR数据集中的传感器参数挑战了这些方法的一般化。本文旨在基于自监督学习(SSL)的基础模型实现通用SAR ATR。我们的目标是突破特定数据和条件的限制,获得目标、场景和传感器之间的普遍感知能力。 我们提出了一个名为SARATR-X的基础模型,包括以下四个方面:预训练数据集、模型骨架、SSL和评估任务。首先,我们将14个数据集与各种目标和成像条件集成作为一个预训练数据集。其次,讨论了不同的模型骨架,以找到最适合远程感测图像的适当方法。第三,我们应用了两阶段培训和SAR梯度特征来确保SARATR-X的多样性和可扩展性。最后,SARATR-X在5个数据集和8个任务设置上实现了竞争性和卓越性能,这表明基础模型可以实现通用SAR ATR。我们认为,在数据和数据量不断增加的时代,应该拥抱基本模型用于SAR图像解释。
https://arxiv.org/abs/2405.09365
Current orthopedic robotic systems largely focus on navigation, aiding surgeons in positioning a guiding tube but still requiring manual drilling and screw placement. The automation of this task not only demands high precision and safety due to the intricate physical interactions between the surgical tool and bone but also poses significant risks when executed without adequate human oversight. As it involves continuous physical interaction, the robot should collaborate with the surgeon, understand the human intent, and always include the surgeon in the loop. To achieve this, this paper proposes a new cognitive human-robot collaboration framework, including the intuitive AR-haptic human-robot interface, the visual-attention-based surgeon model, and the shared interaction control scheme for the robot. User studies on a robotic platform for orthopedic surgery are presented to illustrate the performance of the proposed method. The results demonstrate that the proposed human-robot collaboration framework outperforms full robot and full human control in terms of safety and ergonomics.
目前,机器人骨科系统主要关注导航,帮助医生在定位引导管时进行操作,但仍需要手动进行钻孔和螺栓植入。自动化这一任务不仅要求高精度和安全性,是由于手术工具与骨头的复杂物理相互作用所带来的,而且在缺乏充分人类监督的情况下执行也存在重大风险。由于涉及持续的身体交互,机器人应与医生合作,理解人类的意图,并始终将医生纳入循环。为实现这一目标,本文提出了一种新的人机协作框架,包括直观的AR-人机界面、基于视觉注意的医生模型和机器人共享交互控制方案。用户研究在骨科手术机器人平台上展示了所提出方法的有效性。结果表明,与全机器人控制和全人类控制相比,人机协作框架在安全和人机工程方面具有优势。
https://arxiv.org/abs/2405.09359
Localizing oneself during endoscopic procedures can be problematic due to the lack of distinguishable textures and landmarks, as well as difficulties due to the endoscopic device such as a limited field of view and challenging lighting conditions. Expert knowledge shaped by years of experience is required for localization within the human body during endoscopic procedures. In this work, we present a deep learning method based on anatomy recognition, that constructs a surgical path in an unsupervised manner from surgical videos, modelling relative location and variations due to different viewing angles. At inference time, the model can map an unseen video's frames on the path and estimate the viewing angle, aiming to provide guidance, for instance, to reach a particular destination. We test the method on a dataset consisting of surgical videos of transsphenoidal adenomectomies, as well as on a synthetic dataset. An online tool that lets researchers upload their surgical videos to obtain anatomy detections and the weights of the trained YOLOv7 model are available at: this https URL.
在内镜手术过程中,将自身定位可能会出现问题,因为内镜设备(如有限视野和复杂的照明条件)以及由于缺乏可区分纹理和标志而产生的困难。在本文中,我们提出了一种基于解剖识别的深度学习方法,在未经监督的情况下从手术视频中构建手术路径,并建模不同视角下的相对位置和变化。在推理时,该模型可以在路径上映射未见过的视频帧,并估计视角,旨在提供指导,例如,到达特定目的地。我们在包括Transsphenoidal腺瘤在内的大规模手术视频数据集上测试了该方法,以及在合成数据集上进行了测试。在这个网站上有这样一个在线工具,让研究人员上传他们的手术视频以获得解剖检测和训练的YOLOv7模型的权重:https:// this URL.
https://arxiv.org/abs/2405.09355
The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in computational and memory footprints as the convolutional kernel size increases. To address the first issue, we propose the multi-scale blueprint separable convolutions (MBSConv) as highly efficient building block with multi-scale receptive field, it can focus on the learning for the multi-scale information which is a vital component of discriminative representation. As for the second issue, we revisit the key properties of LKA in which we find that the adjacent direct interaction of local information and long-distance dependencies is crucial to provide remarkable performance. Thus, taking this into account and in order to mitigate the complexity of LKA, we propose a large coordinate kernel attention (LCKA) module which decomposes the 2D convolutional kernels of the depth-wise convolutional layers in LKA into horizontal and vertical 1-D kernels. LCKA enables the adjacent direct interaction of local information and long-distance dependencies not only in the horizontal direction but also in the vertical. Besides, LCKA allows for the direct use of extremely large kernels in the depth-wise convolutional layers to capture more contextual information, which helps to significantly improve the reconstruction performance, and it incurs lower computational complexity and memory footprints. Integrating MBSConv and LCKA, we propose a large coordinate kernel attention network (LCAN).
多尺度 receptive 场和大型内核注意 (LKA) 模块已经被证明在轻量图像超分辨率任务中显著提高了性能。然而,现有的轻量级超分辨率(SR)方法很少关注设计具有多尺度 receptive 场的有效构建模块,并且随着卷积核大小的增加,它们的 LKA 模块的计算和内存足迹呈指数增长。为解决第一个问题,我们提出了多尺度蓝色模板分离卷积(MBSConv)作为具有多尺度 receptive 场的非常高效构建模块,它可以关注多尺度信息,这是判别表示的重要组成部分。对于第二个问题,我们重新审视了 LKA 的关键特性,我们发现邻近信息之间的直接相互作用和长距离依赖关系对提供出色的性能至关重要。因此,考虑到这一点,为了减轻 LKA 的复杂性,我们提出了大型坐标卷积注意(LCKA)模块,它将 LKA 的深度卷积层中的 2D 卷积核拆分为水平和垂直 1D 卷积核。LCKA 不仅使相邻直接相互作用于局部信息和长距离依赖关系,而且在水平和垂直方向上都有。此外,LCKA 允许在深度卷积层中直接使用极其大的卷积核来捕捉更多的上下文信息,从而显著提高重构性能,并使其计算复杂性和内存足迹更低。将 MBSConv 和 LCKA 集成起来,我们提出了大型坐标卷积注意网络 (LCAN)。
https://arxiv.org/abs/2405.09353
Image-guided depth completion aims at generating a dense depth map from sparse LiDAR data and RGB image. Recent methods have shown promising performance by reformulating it as a classification problem with two sub-tasks: depth discretization and probability prediction. They divide the depth range into several discrete depth values as depth categories, serving as priors for scene depth distributions. However, previous depth discretization methods are easy to be impacted by depth distribution variations across different scenes, resulting in suboptimal scene depth distribution priors. To address the above problem, we propose a progressive depth decoupling and modulating network, which incrementally decouples the depth range into bins and adaptively generates multi-scale dense depth maps in multiple stages. Specifically, we first design a Bins Initializing Module (BIM) to construct the seed bins by exploring the depth distribution information within a sparse depth map, adapting variations of depth distribution. Then, we devise an incremental depth decoupling branch to progressively refine the depth distribution information from global to local. Meanwhile, an adaptive depth modulating branch is developed to progressively improve the probability representation from coarse-grained to fine-grained. And the bi-directional information interactions are proposed to strengthen the information interaction between those two branches (sub-tasks) for promoting information complementation in each branch. Further, we introduce a multi-scale supervision mechanism to learn the depth distribution information in latent features and enhance the adaptation capability across different scenes. Experimental results on public datasets demonstrate that our method outperforms the state-of-the-art methods. The code will be open-sourced at [this https URL](this https URL).
图像引导深度完成旨在从稀疏的激光雷达数据和彩色图像中生成密集的深度图。最近的方法通过将其转化为分类问题,将其分为两个子任务:深度离散化和概率预测,通过分段深度值,表现出良好的性能。他们将深度范围划分为几个离散的深度值作为深度类别,作为场景深度分布的预训练。然而,之前的深度离散化方法很容易受到不同场景中深度分布的变化影响,导致场景深度分布预优劣。为了应对这个问题,我们提出了一个逐步深度解耦和调制网络,该网络在多个阶段逐步解耦深度范围并生成多尺度密集深度图。具体来说,我们首先设计了一个Bins初始化模块(BIM),通过探索稀疏深度图中的深度分布信息来构建种子值,适应深度分布的变化。然后,我们设计了一个逐步深度解耦分支,从全局到局部逐步优化深度分布信息。同时,我们还开发了一个自适应深度调制分支,从粗粒度到细粒度逐步改进概率表示。为了增强这两个分支(子任务)之间的信息交互,以提高每个分支的信息互补,我们引入了多尺度监督机制,以学习潜在特征中的深度分布信息,并增强在不同场景下的适应能力。在公开数据集上的实验结果表明,我们的方法优于最先进的方法。代码将在此处[https://this](https://this)公开源码。
https://arxiv.org/abs/2405.09342
Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs.
现有的偏差消除方法在指定的和评估过程中为实现不同社会群体之间的平衡,不可避免地做出不合理的或不受欢迎的预测,因为它们将个体事实排除在之外,导致现有知识的修改。在本文中,我们首先建立了一个新的偏见缓解基准BiasKE,利用现有和附加构建数据集,通过公平性、特异性和平衡度量来系统地评估偏差表现。同时,我们提出了一个新的偏差消除方法,称为公平性印章(FAST),它通过在个体有偏见知识上进行细粒度校准来实现可编辑的公平性。全面的实验证明,FAST在显著的偏差消除性能方面超过了最先进的基线,同时不影响知识保留的整体模型能力,突出了在LLM中可编辑公平性策略的可能性。
https://arxiv.org/abs/2405.09341
Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.
尽管语言模型(LMs)已经提高了问答系统的性能,但它们仍然需要大量数据。数据标注是一个耗时过程。相比之下,尤其是在问答系统中,可能需要对大量文档进行解析和标注,以及问题和它们的相应答案。此外,问答模型通常只在训练它们的领域中表现良好。由于标注成本高昂,我们认为LMs中具有领域无关知识(如语言理解)足够创建一个经过良好筛选的 dataset。以此动机,我们证明了在几 shot设置中,使用大型语言模型可以比最先进的方法提高各种数据集的问答系统性能。为此,我们利用提示框架进行数据生成,表明语言模型包含有价值的任务无关知识,可以用于超越常见的预训练/微调方案。因此,我们在几 shot问题回答方面持续优于之前的解决方案。
https://arxiv.org/abs/2405.09335
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
Recent advances in computed tomography (CT) imaging, especially with dual-robot systems, have introduced new challenges for scan trajectory optimization. This paper presents a novel approach using Gated Recurrent Units (GRUs) to optimize CT scan trajectories. Our approach exploits the flexibility of robotic CT systems to select projections that enhance image quality by improving resolution and contrast while reducing scan time. We focus on cone-beam CT and employ several projection-based metrics, including absorption, pixel intensities, contrast-to-noise ratio, and data completeness. The GRU network aims to minimize data redundancy and maximize completeness with a limited number of projections. We validate our method using simulated data of a test specimen, focusing on a specific voxel of interest. The results show that the GRU-optimized scan trajectories can outperform traditional circular CT trajectories in terms of image quality metrics. For the used specimen, SSIM improves from 0.38 to 0.49 and CNR increases from 6.97 to 9.08. This finding suggests that the application of GRU in CT scan trajectory optimization can lead to more efficient, cost-effective, and high-quality imaging solutions.
近年来,计算机断层扫描(CT)成像技术的进步,特别是双机器人系统,为扫描轨迹优化带来了新的挑战。本文提出了一种使用门控循环单元(GRUs)优化CT扫描轨迹的新型方法。我们的方法利用了机器人CT系统的灵活性,通过提高分辨率、对比度并减少扫描时间来提高图像质量。我们专注于锥束CT,并采用几个基于投影的指标,包括吸收、像素强度、对比度-噪声比和数据完整性。GRU网络的目标是利用有限的投影量来最小化数据冗余并最大化完整性。我们使用模拟测试样本的数据来验证我们的方法,并重点关注感兴趣的某个体积。结果显示,与传统圆形CT轨迹相比,GRU优化扫描轨迹在图像质量指标上具有优势。对于使用的样品,SSIM从0.38提高至0.49,CNR从6.97提高至9.08。这一发现表明,在CT扫描轨迹优化中应用GRU可以实现更高效、经济且高质量的成像解决方案。
https://arxiv.org/abs/2405.09333
This paper explores a novel multi-modal alternating learning paradigm pursuing a reconciliation between the exploitation of uni-modal features and the exploration of cross-modal interactions. This is motivated by the fact that current paradigms of multi-modal learning tend to explore multi-modal features simultaneously. The resulting gradient prohibits further exploitation of the features in the weak modality, leading to modality competition, where the dominant modality overpowers the learning process. To address this issue, we study the modality-alternating learning paradigm to achieve reconcilement. Specifically, we propose a new method called ReconBoost to update a fixed modality each time. Herein, the learning objective is dynamically adjusted with a reconcilement regularization against competition with the historical models. By choosing a KL-based reconcilement, we show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others and help enhance the overall performance. The major difference with the classic GB is that we only preserve the newest model for each modality to avoid overfitting caused by ensembling strong learners. Furthermore, we propose a memory consolidation scheme and a global rectification scheme to make this strategy more effective. Experiments over six multi-modal benchmarks speak to the efficacy of the method. We release the code at this https URL.
本文探讨了一种新型的多模态交替学习范式,旨在实现单模态特征的充分利用和跨模态相互作用的探索之间的调和。这是由于当前的多模态学习范式倾向于同时探索多模态特征。由此产生的梯度禁止进一步挖掘弱模态特征,导致模态竞争,其中优势模态会压倒学习过程。为了解决这个问题,我们研究了模态交替学习范式以实现调和。具体来说,我们提出了一种名为ReconBoost的新方法,每次更新一个固定的模态。在这里,学习目标通过和谐 regularization 对抗历史模型的竞争进行动态调整。通过选择基于KL的和谐方法,我们证明了所提出的方法类似于Friedman的梯度 Boost (GB)算法,其中更新后的学习者可以纠正他人的错误并帮助提高整体性能。与经典GB的主要区别在于,我们只保留每个模态中最新的模型,以避免由于集成强势学习者而引起的过拟合。此外,我们还提出了一种记忆整合方案和全局矩形化方案,使这种策略更加有效。六个多模态基准的实验结果证实了这种方法的有效性。您可以在以下链接处获取代码:https://url.cn/
https://arxiv.org/abs/2405.09321
One goal of dexterous robotic grasping is to allow robots to handle objects with the same level of flexibility and adaptability as humans. However, it remains a challenging task to generate an optimal grasping strategy for dexterous hands, especially when it comes to delicate manipulation and accurate adjustment the desired grasping poses for objects of varying shapes and sizes. In this paper, we propose a novel dexterous grasp generation scheme called \textbf{\textit{GrainGrasp}} that provides fine-grained contact guidance for each fingertip. In particular, we employ a generative model to predict separate contact maps for each fingertip on the object point cloud, effectively capturing the specifics of finger-object interactions. In addition, we develop a new dexterous grasping optimization algorithm that solely relies on the point cloud as input, eliminating the necessity for complete mesh information of the object. By leveraging the contact maps of different fingertips, the proposed optimization algorithm can generate precise and determinable strategies for human-like object grasping. Experimental results confirm the efficiency of the proposed scheme. Our code is available at this https URL
灵巧机器人抓取的一个目标是使机器人能够像人类一样处理具有相同程度的灵活性和适应性的物体。然而,为灵巧的手生成最优抓取策略仍然是一个具有挑战性的任务,尤其是在处理形状和大小不等的物体时,更是如此。在本文中,我们提出了一个名为 \textbf{\textit{GrainGrasp}} 的新颖灵巧抓取生成方案,为每个手指提供细粒度的接触指导。 特别是,我们采用生成模型预测物体点云上每个手指的单独接触图,有效捕捉了手指与物体之间互动的特定细节。此外,我们还开发了一种仅依赖点云的灵巧抓取优化算法,消除了需要物体完整网格信息的必要性。通过利用不同手指的接触图,所提出的优化算法可以生成人类式物体抓取的精确和可确定策略。实验结果证实了所提出方案的有效性。我们的代码可以从该链接获取:
https://arxiv.org/abs/2405.09310
Background: Rapid advancements in natural language processing have led to the development of large language models with the potential to revolutionize mental health care. These models have shown promise in assisting clinicians and providing support to individuals experiencing various psychological challenges. Objective: This study aims to compare the performance of two large language models, GPT-4 and Chat-GPT, in responding to a set of 18 psychological prompts, to assess their potential applicability in mental health care settings. Methods: A blind methodology was employed, with a clinical psychologist evaluating the models' responses without knowledge of their origins. The prompts encompassed a diverse range of mental health topics, including depression, anxiety, and trauma, to ensure a comprehensive assessment. Results: The results demonstrated a significant difference in performance between the two models (p > 0.05). GPT-4 achieved an average rating of 8.29 out of 10, while Chat-GPT received an average rating of 6.52. The clinical psychologist's evaluation suggested that GPT-4 was more effective at generating clinically relevant and empathetic responses, thereby providing better support and guidance to potential users. Conclusions: This study contributes to the growing body of literature on the applicability of large language models in mental health care settings. The findings underscore the importance of continued research and development in the field to optimize these models for clinical use. Further investigation is necessary to understand the specific factors underlying the performance differences between the two models and to explore their generalizability across various populations and mental health conditions.
背景:自然语言处理技术的快速发展导致开发了具有潜在推翻精神卫生保健变革的大语言模型。这些模型在协助临床医生和为各种心理挑战经历的个人提供支持方面显示出希望。目标:本研究旨在比较GPT-4和Chat-GPT在回答一组18个心理提示方面的性能,以评估它们在精神卫生保健场所的潜在应用。方法:采用盲法进行研究,临床心理学家的评估过程中不知道模型的来源。提示涵盖了广泛的心理健康问题,包括抑郁症、焦虑和创伤,以确保全面评估。结果:研究结果表明,两个模型在性能上存在显著差异(p > 0.05)。GPT-4获得了平均评分8.29分,而Chat-GPT获得了平均评分6.52分。临床心理学家的评估表明,GPT-4在生成具有临床相关性和同情心的响应方面效果更好,从而为潜在用户提供了更好的支持和指导。结论:本研究为大型语言模型在精神卫生保健场所的应用提供了越来越多的证据。研究结果强调,该领域需要继续研究和开发,以优化这些模型用于临床使用。还需要进一步研究来了解两个模型性能差异的具体原因,并探讨它们在不同人群和心理状况下的可推广性。
https://arxiv.org/abs/2405.09300
AI-based analysis of histopathology whole slide images (WSIs) is central in computational pathology. However, image quality can impact model performance. Here, we investigate to what extent unsharp areas of WSIs impact deep convolutional neural network classification performance. We propose a multi-model approach, i.e. DeepBlurMM, to alleviate the impact of unsharp image areas and improve the model performance. DeepBlurMM uses the sigma cut-offs to determine the most suitable model for predicting tiles with various levels of blurring within a single WSI, where sigma is the standard deviation of the Gaussian distribution. Specifically, the cut-offs categorise the tiles into sharp or slight blur, moderate blur, and high blur. Each blur level has a corresponding model to be selected for tile-level predictions. Throughout the simulation study, we demonstrated the application of DeepBlurMM in a binary classification task for breast cancer Nottingham Histological Grade 1 vs 3. Performance, evaluated over 5-fold cross-validation, showed that DeepBlurMM outperformed the base model under moderate blur and mixed blur conditions. Unsharp image tiles (local blurriness) at prediction time reduced model performance. The proposed multi-model approach improved performance under some conditions, with the potential to improve quality in both research and clinical applications.
基于AI的病理学全切片图像(WSIs)分析在计算病理学中具有核心地位。然而,图像质量会 impact 模型性能。在这里,我们研究了 WSIs 非锐利区域对深度卷积神经网络分类性能的影响程度。我们提出了一个多模型方法,即 DeepBlurMM,以减轻非锐利图像区域对模型性能的影响,并提高模型性能。DeepBlurMM 使用高斯分布的σ截止值来确定在单个 WSI 中预测具有各种模糊程度的贴片的最合适的模型。具体来说,截止值将贴片分类为锐利、轻微模糊、中度模糊和高模糊。对于每个模糊级别,都有相应的模型用于预测贴片级别的结果。在模拟研究中,我们证明了 DeepBlurMM 在乳腺癌诺丁山病理 grade 1 与 3 的二分类任务中的应用。性能通过 5 倍交叉验证评估,在 moderate blur 和 mixed blur 条件下,DeepBlurMM 超过了基线模型。预测时间内的非锐利图像贴片(局部模糊)降低了模型性能。所提出的多模型方法在某些条件下改善了性能,具有在研究和临床应用中提高质量的潜力。
https://arxiv.org/abs/2405.09298
Markedness in natural language is often associated with non-literal meanings in discourse. Differential Object Marking (DOM) in Korean is one instance of this phenomenon, where post-positional markers are selected based on both the semantic features of the noun phrases and the discourse features that are orthogonal to the semantic features. Previous work has shown that distributional models of language recover certain semantic features of words -- do these models capture implied discourse-level meanings as well? We evaluate whether a set of large language models are capable of associating discourse meanings with different object markings in Korean. Results suggest that discourse meanings of a grammatical marker can be more challenging to encode than that of a discourse marker.
标记自然语言中的标记性通常与语篇中的非字面意义相关。韩国的差异对象标记(DOM)是这种现象的一个实例,其中后置语标记根据名词短语的语义特征以及与语义特征不相关的语篇特征进行选择。之前的工作表明,语言的分布模型可以恢复单词的某些语义特征,但这些模型是否能够捕捉隐含的语篇级含义?我们评估了是否有一组大型语言模型能够将韩语中的语篇意义与不同的标记相对应。结果表明,语篇标记的语义特征可能比语篇标记更具挑战性。
https://arxiv.org/abs/2405.09293
With the benefit of deep learning techniques, recent researches have made significant progress in image compression artifacts reduction. Despite their improved performances, prevailing methods only focus on learning a mapping from the compressed image to the original one but ignore the intrinsic attributes of the given compressed images, which greatly harms the performance of downstream parsing tasks. Different from these methods, we propose to decouple the intrinsic attributes into two complementary features for artifacts reduction,ie, the compression-insensitive features to regularize the high-level semantic representations during training and the compression-sensitive features to be aware of the compression degree. To achieve this, we first employ adversarial training to regularize the compressed and original encoded features for retaining high-level semantics, and we then develop the compression quality-aware feature encoder for compression-sensitive features. Based on these dual complementary features, we propose a Dual Awareness Guidance Network (DAGN) to utilize these awareness features as transformation guidance during the decoding phase. In our proposed DAGN, we develop a cross-feature fusion module to maintain the consistency of compression-insensitive features by fusing compression-insensitive features into the artifacts reduction baseline. Our method achieves an average 2.06 dB PSNR gains on BSD500, outperforming state-of-the-art methods, and only requires 29.7 ms to process one image on BSD500. Besides, the experimental results on LIVE1 and LIU4K also demonstrate the efficiency, effectiveness, and superiority of the proposed method in terms of quantitative metrics, visual quality, and downstream machine vision tasks.
得益于深度学习技术的优势,近年来图像压缩伪影减少的研究取得了显著进展。尽管其性能有所提高,但现有的方法仅关注从压缩图像到原始图像的映射学习,而忽略了给定压缩图像的固有属性,这大大削弱了下游解码任务的性能。与这些方法不同,我们提出了一种将固有属性解耦为两个互补特征的方法,以便在图像压缩伪影减少中实现压缩敏感特征的感知。为了实现这一目标,我们首先使用对抗训练来对压缩和原始编码特征进行规范,保留高级语义表示,然后我们为压缩敏感特征开发了压缩质量感知特征编码器。基于这些互补特征,我们提出了一个双感知指导网络(DAGN)来在解码阶段利用这些感知特征作为变换指导。在我们的DAGN中,我们开发了一个跨特征融合模块,通过将压缩敏感特征与 artifacts reduction 基线融合来保持压缩感知特征的一致性。我们的方法在BSD500上的平均PSNR增益达到2.06 dB,超越了最先进的方法,并且仅在BSD500上处理一张图片就需要29.7毫秒。此外,LIVE1和LIU4K的实验结果也证明了我们在数量指标、视觉质量和下游机器视觉任务方面的方法的有效性和优越性。
https://arxiv.org/abs/2405.09291
Deep learning classifiers are prone to latching onto dominant confounders present in a dataset rather than on the causal markers associated with the target class, leading to poor generalization and biased predictions. Although explainability via counterfactual image generation has been successful at exposing the problem, bias mitigation strategies that permit accurate explainability in the presence of dominant and diverse artifacts remain unsolved. In this work, we propose the DeCoDEx framework and show how an external, pre-trained binary artifact detector can be leveraged during inference to guide a diffusion-based counterfactual image generator towards accurate explainability. Experiments on the CheXpert dataset, using both synthetic artifacts and real visual artifacts (support devices), show that the proposed method successfully synthesizes the counterfactual images that change the causal pathology markers associated with Pleural Effusion while preserving or ignoring the visual artifacts. Augmentation of ERM and Group-DRO classifiers with the DeCoDEx generated images substantially improves the results across underrepresented groups that are out of distribution for each class. The code is made publicly available at this https URL.
深度学习分类器容易在数据集中固有的主导混淆因素上留下印象,而不是在目标类别的相关因果标记上,导致泛化差和有偏预测。尽管通过反事实图像生成来解释该问题已经取得成功,但允许在主导和多样异常物中实现准确解释的偏差减轻策略仍然是一个未解决的问题。在本文中,我们提出了DeCoDEx框架,并展示了如何在外部预训练的二进制异常物检测器的基础上,在推理过程中指导扩散式反事实图像生成器走向准确解释。在CheXpert数据集上进行的实验(使用合成异常物和真实视觉异常物)表明,与该方法相结合,可以成功生成反事实图像,这些图像在改变与胸膜积液相关的因果病理特征的同时,保留或忽略视觉异常。使用DeCoDEx生成的图像对ERM和Group-DRO分类器的扩展显著提高了分布不寻常类别的结果。代码可在此处公开访问:https://this URL。
https://arxiv.org/abs/2405.09288
Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
近年来,短视频的发展速度加快,通常包含视觉和音频模态。背景音乐对于短视频来说非常重要,因为它可以显著影响观众的情感。然而,目前,短视频的背景音乐通常由视频制作人选择,缺乏针对短视频的自动音乐推荐方法。本文介绍了一种创新的跨模态检索模型MVBind,用于短片的跨模态检索。MVBind是一种自监督方法,通过直接从数据中获取模态关系的固有知识,无需手动注释。此外,为了弥补短视频没有相应的音乐-视觉对数据集的不足,我们构建了一个数据集SVM-10K(短视频与音乐-10K),主要包含精心选择短的短视频。在这个数据集上,MVBind表现出比其他基线方法显著的优异性能。构建的数据集和代码将发布,以促进未来研究。
https://arxiv.org/abs/2405.09286