Objective: Federated Learning (FL) enables collaborative model training while keeping data locally. Currently, most FL studies in radiology are conducted in simulated environments due to numerous hurdles impeding its translation into practice. The few existing real-world FL initiatives rarely communicate specific measures taken to overcome these hurdles, leaving behind a significant knowledge gap. Minding efforts to implement real-world FL, there is a notable lack of comprehensive assessment comparing FL to less complex alternatives. Materials & Methods: We extensively reviewed FL literature, categorizing insights along with our findings according to their nature and phase while establishing a FL initiative, summarized to a comprehensive guide. We developed our own FL infrastructure within the German Radiological Cooperative Network (RACOON) and demonstrated its functionality by training FL models on lung pathology segmentation tasks across six university hospitals. We extensively evaluated FL against less complex alternatives in three distinct evaluation scenarios. Results: The proposed guide outlines essential steps, identified hurdles, and proposed solutions for establishing successful FL initiatives conducting real-world experiments. Our experimental results show that FL outperforms less complex alternatives in all evaluation scenarios, justifying the effort required to translate FL into real-world applications. Discussion & Conclusion: Our proposed guide aims to aid future FL researchers in circumventing pitfalls and accelerating translation of FL into radiological applications. Our results underscore the value of efforts needed to translate FL into real-world applications by demonstrating advantageous performance over alternatives, and emphasize the importance of strategic organization, robust management of distributed data and infrastructure in real-world settings.
目标:联邦学习(FL)可以在保留数据本地的情况下实现协作模型训练。目前,由于许多阻碍将其转化为实践的障碍,大多数放射学领域的FL研究都是在模拟环境中进行的。少数现有现实世界的FL倡议很少详细介绍为克服这些障碍所采取的具体措施,留下了相当大的知识空白。关注实施现实世界FL,在比较FL与其他更简单选项的全局评估方面存在显著的不足。材料和方法:我们广泛审查了FL文献,根据其性质和阶段将见解进行分类,同时建立FL倡议,并将其总结为一本全面的指南。我们还在德国放射学合作网络(RACOON)内开发了自己的FL基础设施,并通过在六所大学医院的肺病理分割任务上训练FL模型来展示其功能性。我们在三个不同的评估场景对FL与更简单的替代方案进行了广泛评估。结果:所提出的指南概述了建立成功FL倡议的必要步骤,识别了障碍并提出了解决方案。我们的实验结果表明,FL在所有评估场景中都优于更简单的替代方案,从而为将FL融入放射学应用付出了所需的努力。讨论与结论:我们的指南旨在帮助未来的FL研究人员避免陷阱,加速FL向放射学应用的转化。我们的结果强调了将FL融入现实世界应用所需的努力,通过证明其相对于替代方案的优越性能,并着重强调了在现实环境中的战略组织、分布式数据和基础设施的稳健管理的重要性。
https://arxiv.org/abs/2405.09409
While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. We adopt a late interaction re-ranking method inspired by text matching for image retrieval, comparing it against the original method proposed for volume and region retrieval achieving retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide essential insights and benchmarks for the development and evaluation of CBIR approaches in the context of medical imaging.
虽然基于内容的图像检索(CBIR)在自然图像检索中已经得到了广泛研究,但在医学图像中应用时仍然存在挑战,主要原因是医学图像的3D性质。最近的研究表明,在放射学图像检索背景下,预训练视觉嵌入可能有用于CBIR。然而,还没有一个用于检索3D体积医学图像的基准,这阻碍了客观评估和比较所提出的CBIR方法在医学成像中的效率。在这项研究中,我们延长了以前的工作,并使用TotalSegmentator数据集(TS)建立了基于区域的和多器官检索的基准,并对医学图像和非医学图像的预训练嵌入进行了比较。我们对29个粗粒度和104个详细解剖结构的体积和区域水平的预训练嵌入进行了比较,采用了一种类似于文本匹配的晚期交互重新排名方法,将其与体积和区域检索的原始方法进行比较,实现了检索召回率为1.0,具有多样解剖结构的广泛大小范围。本文所提出的研究成果和方法提供了开发和评估CBIR方法在医学成像领域的必要见解和基准。
https://arxiv.org/abs/2405.09334
Due to the increasing need for effective security measures and the integration of cameras in commercial products, a hugeamount of visual data is created today. Law enforcement agencies (LEAs) are inspecting images and videos to findradicalization, propaganda for terrorist organizations and illegal products on darknet markets. This is time consuming.Instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specificlocations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deepconvolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has fivecontributions. The first contribution allows image-based geo-localization to estimate the origin of an image. CNNs andgeotagged images are used to create a model that determines the location of an image by its pixel values. The secondcontribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposedmethod encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition ofperson attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attributeproblem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotationtool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimalannotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion.Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectableconcepts is required for the users. The methods are validated on data with varying locations (popular and non-touristiclocations), varying person attributes (CelebA dataset), and varying number of annotations.
由于对有效安全措施的需求不断增加以及摄像头在商业产品中的应用,如今产生了大量的视觉数据。执法机构(LEAs)正在检查图像和视频以寻找极端化、恐怖主义组织和非法商品在暗网市场上的传播。这需要耗费大量时间。 instead of an undirected search, LEAs would like to adapt to new crimes and threats, and focus only on data from specific locations, persons or objects, which requires flexible interpretation of image content. Visual concept detection with deep convolutional neural networks (CNNs) is a crucial component to understand the image content. This paper has five contributions. The first contribution allows image-based geolocation to estimate the origin of an image. CNNs and geotagged images are used to create a model that determines the location of an image by its pixel values. The second contribution enables analysis of fine-grained concepts to distinguish sub-categories in a generic concept. The proposed method encompasses data acquisition and cleaning and concept hierarchies. The third contribution is the recognition of person attributes (e.g., glasses or moustache) to enable query by textual description for a person. The person-attribute problem is treated as a specific sub-task of concept classification. The fourth contribution is an intuitive image annotation tool based on active learning. Active learning allows users to define novel concepts flexibly and train CNNs with minimal annotation effort. The fifth contribution increases the flexibility for LEAs in the query definition by using query expansion. Query expansion maps user queries to known and detectable concepts. Therefore, no prior knowledge of the detectable concepts is required for the users. The methods are validated on data with varying locations (popular and non-tourist locations), varying person attributes (CelebA dataset), and varying number of annotations.
https://arxiv.org/abs/2405.09194
Modeling visual saliency in graphical user interfaces (GUIs) allows to understand how people perceive GUI designs and what elements attract their attention. One aspect that is often overlooked is the fact that computational models depend on a series of design parameters that are not straightforward to decide. We systematically analyze how different design parameters affect scanpath evaluation metrics using a state-of-the-art computational model (DeepGaze++). We particularly focus on three design parameters: input image size, inhibition-of-return decay, and masking radius. We show that even small variations of these design parameters have a noticeable impact on standard evaluation metrics such as DTW or Eyenalysis. These effects also occur in other scanpath models, such as UMSS and ScanGAN, and in other datasets such as MASSVIS. Taken together, our results put forward the impact of design decisions for predicting users' viewing behavior on GUIs.
在图形用户界面(GUIs)中建模视觉显著性可以帮助人们理解如何看待GUI设计以及哪些元素会吸引他们的注意。通常被忽视的一个方面是,计算模型依赖于一系列设计参数,而这些参数并不容易决定。我们使用最先进的计算模型(DeepGaze++)系统地分析不同设计参数如何影响扫描路径评估指标。我们特别关注三个设计参数:输入图像大小、抑制返回衰减和掩码半径。我们发现,即使是这些设计参数的小变化也会对标准评估指标,如DTW或Eyenalysis产生显著影响。这些影响也存在于其他扫描路径模型中,如UMSS和ScanGAN,以及其他数据集中。结合我们的结果,我们提出了设计决策对预测用户在GUIs上的观看行为具有影响的观点。
https://arxiv.org/abs/2405.08981
This paper addresses the problem of pathological lung segmentation, a significant challenge in medical image analysis, particularly pronounced in cases of peripheral opacities (severe fibrosis and consolidation) because of the textural similarity between lung tissue and surrounding areas. To overcome these challenges, this paper emphasizes the use of CycleGAN for unpaired image-to-image translation, in order to provide an augmentation method able to generate fake pathological images matching an existing ground truth. Although previous studies have employed CycleGAN, they often neglect the challenge of shape deformation, which is crucial for accurate medical image segmentation. Our work introduces an innovative strategy that incorporates additional loss functions. Specifically, it proposes an L1 loss based on the lung surrounding which shape is constrained to remain unchanged at the transition from the healthy to pathological domains. The lung surrounding is derived based on ground truth lung masks available in the healthy domain. Furthermore, preprocessing steps, such as cropping based on ribs/vertebra locations, are applied to refine the input for the CycleGAN, ensuring that the network focus on the lung region. This is essential to avoid extraneous biases, such as the zoom effect bias, which can divert attention from the main task. The method is applied to enhance in semi-supervised manner the lung segmentation process by employing a U-Net model trained with on-the-fly data augmentation incorporating synthetic pathological tissues generated by the CycleGAN model. Preliminary results from this research demonstrate significant qualitative and quantitative improvements, setting a new benchmark in the field of pathological lung segmentation. Our code is available at this https URL
本文解决了病理肺部分割的问题,这是医学图像分析的一个显著挑战,尤其是在边缘透明度(严重纤维化和胶质化)病例中更加突出,因为肺组织与周围区域的纹理相似。为了克服这些挑战,本文强调使用CycleGAN进行无配对图像到图像的转换,以提供一种能够生成与现有真实 ground truth 匹配的假病理图像的增强方法。虽然之前的研究已经使用了CycleGAN,但它们通常忽视了形状变形的重要性,这对于准确医学图像分割至关重要。我们的工作引入了一种创新策略,包括额外的损失函数。具体来说,它提出了一个基于肺周围约束的L1损失,该约束在从健康到病理域的转换过程中保持形状不变。肺周围基于健康的领域内存在的真实肺mask为基础进行提取。此外,对输入进行预处理步骤,如基于肋/椎位置的裁剪,以优化CycleGAN,确保网络集中于肺区域。这对于避免诸如放大效果偏见等额外偏差至关重要。将该方法应用于半监督方式增强肺分割过程,通过使用训练时数据增强包含由CycleGAN模型生成的合成病理组织的方法。来自这项研究的结果表明,在半监督方式下,肺分割过程有显著的质量和数量改进,为病理肺部分割领域树立了新的基准。我们的代码可在此处访问:https://www.thisurl.com
https://arxiv.org/abs/2405.08556
In biological evolution complex neural structures grow from a handful of cellular ingredients. As genomes in nature are bounded in size, this complexity is achieved by a growth process where cells communicate locally to decide whether to differentiate, proliferate and connect with other cells. This self-organisation is hypothesized to play an important part in the generalisation, and robustness of biological neural networks. Artificial neural networks (ANNs), on the other hand, are traditionally optimized in the space of weights. Thus, the benefits and challenges of growing artificial neural networks remain understudied. Building on the previously introduced Neural Developmental Programs (NDP), in this work we present an algorithm for growing ANNs that solve reinforcement learning tasks. We identify a key challenge: ensuring phenotypic complexity requires maintaining neuronal diversity, but this diversity comes at the cost of optimization stability. To address this, we introduce two mechanisms: (a) equipping neurons with an intrinsic state inherited upon neurogenesis; (b) lateral inhibition, a mechanism inspired by biological growth, which controlls the pace of growth, helping diversity persist. We show that both mechanisms contribute to neuronal diversity and that, equipped with them, NDPs achieve comparable results to existing direct and developmental encodings in complex locomotion tasks
在生物进化中,复杂的神经结构从一些细胞成分开始生长。由于自然界中的基因组大小是有限的,这种复杂性是通过细胞在局部交流以决定是否分化、增殖并与其他细胞连接来实现增长的。这种自组织被认为在生物神经网络的泛化和鲁棒性中发挥了重要作用。另一方面,人工神经网络(ANN)在权重空间中通常是优化的。因此,生长人工神经网络的收益和挑战仍然没有被充分研究。在之前引入的神经发育程序(NDP)的基础上,在这篇论文中,我们提出了一个生长ANN的算法,用于解决强化学习任务。我们认识到一个关键挑战是:保证表型复杂性需要保持神经元多样性,但这种多样性是以优化稳定性为代价的。为了应对这个问题,我们引入了两个机制:(a)为神经元提供源于神经发生学的内在状态;(b)横向抑制,一种受到生物生长启发的机制,它控制生长速度,有助于维持多样性。我们证明了这两个机制都贡献了神经元多样性,有了它们,NDP在复杂运动任务上的效果与现有的直接和发育编码相当。
https://arxiv.org/abs/2405.08510
In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define that computational methods that alter semantic face attributes to exceed human discrimination thresholds are sources of face forgery. Guided by our new definition, we construct a large face forgery image dataset, where each image is associated with a set of labels organized in a hierarchical graph. Our dataset enables two new testing protocols to probe the generalization of face forgery detectors. Moreover, we propose a semantics-oriented face forgery detection method that captures label relations and prioritizes the primary task (\ie, real or fake face detection). We show that the proposed dataset successfully exposes the weaknesses of current detectors as the test set and consistently improves their generalizability as the training set. Additionally, we demonstrate the superiority of our semantics-oriented method over traditional binary and multi-class classification-based detectors.
近年来,深度学习极大地简化了生成逼真假脸图像的过程。为了意识到这种伪造技术的危险性,研究人员开发了各种工具来检测这些伪造技术。然而,没有一个工具问到这个问题:什么数字编辑会使得真实照片脸孔图像伪造,而其他工具不会?在本文中,我们将面部伪造置于语义背景下,并定义了能够超过人类判断阈值的语义人脸属性编辑的方法是面部伪造的源头。在受到新定义的指导下,我们构建了一个大规模的人脸伪造图像数据集,其中每个图像都与一个由层次图组织起来的标签集相关联。我们的数据集使得有两个新的测试协议可以探究面部伪造检测器的泛化能力。此外,我们提出了一个语义导向的面部伪造检测方法,它抓住了标签关系并优先考虑了主要任务(即真实或伪造脸孔检测)。我们证明了所提出的数据集成功地揭示了当前检测器的缺陷作为测试集,并且在训练集上持续改进了它们的泛化能力。此外,我们还证明了我们的语义导向方法比传统的二分类和多分类分类基础检测器具有优越性。
https://arxiv.org/abs/2405.08487
Perivascular spaces(PVSs) form a central component of the brainś waste clearance system, the glymphatic system. These structures are visible on MRI images, and their morphology is associated with aging and neurological disease. Manual quantification of PVS is time consuming and subjective. Numerous deep learning methods for PVS segmentation have been developed, however the majority have been developed and evaluated on homogenous datasets and high resolution scans, perhaps limiting their applicability for the wide range of image qualities acquired in clinic and research. In this work we train a nnUNet, a top-performing biomedical image segmentation algorithm, on a heterogenous training sample of manually segmented MRI images of a range of different qualities and resolutions from 6 different datasets. These are compared to publicly available deep learning methods for 3D segmentation of PVS. The resulting model, PINGU (Perivascular space Identification Nnunet for Generalised Usage), achieved voxel and cluster level dice scores of 0.50(SD=0.15), 0.63(0.17) in the white matter(WM), and 0.54(0.11), 0.66(0.17) in the basal ganglia(BG). Performance on data from unseen sites was substantially lower for both PINGU(0.20-0.38(WM, voxel), 0.29-0.58(WM, cluster), 0.22-0.36(BG, voxel), 0.46-0.60(BG, cluster)) and the publicly available algorithms(0.18-0.30(WM, voxel), 0.29-0.38(WM cluster), 0.10-0.20(BG, voxel), 0.15-0.37(BG, cluster)), but PINGU strongly outperformed the publicly available algorithms, particularly in the BG. Finally, training PINGU on manual segmentations from a single site with homogenous scan properties gave marginally lower performances on internal cross-validation, but in some cases gave higher performance on external validation. PINGU stands out as broad-use PVS segmentation tool, with particular strength in the BG, an area of PVS related to vascular disease and pathology.
皮层外间隙(PVS)是清除系统中的一个重要组成部分,称为糖质系统。这些结构在MRI图像上是可见的,它们的形态与衰老和神经系统疾病有关。手动量化PVS是耗时且主观的。已经开发了许多用于PVS分割的深度学习方法,然而,大多数都针对具有相同质量和分辨率的高质量MRI数据集进行开发和评估,这可能使它们在广泛的诊所和研究的图像质量上应用有限。在这项工作中,我们使用了一个nnUNet,一种在各种质量和分辨率下手动分割的生物医学图像分割算法的顶级性能,对6个不同数据集的异质训练样本进行训练。这些与公开可用的深度学习方法进行比较,用于3D分割PVS。得到的模型PINGU(Perivascular space Identification Nnunet for Generalised Usage)在白质(WM)的体积和聚类级别 dice 分数分别为0.50(SD=0.15),0.63(0.17),在黑质(BG)的体积和聚类级别 dice 分数分别为0.54(0.11),0.66(0.17)。对于未见过的站点数据,PINGU的性能显著较低,尤其是在BG方面(0.20-0.38(WM,体积),0.29-0.58(WM,聚类),0.22-0.36(BG,体积),0.46-0.60(BG,聚类))。然而,与公开可用的算法相比,PINGU在BG方面表现出色。最后,使用同质扫描属性从单一站点训练PINGU,在内部交叉验证上的性能稍低,但有时在 external validation 上表现出更高的性能。总的来说,PINGU是一个通用的 PVS 分割工具,尤其是在 BG 方面,这是一个与血管疾病和病理学相关的 PVS 区域。
https://arxiv.org/abs/2405.08337
The grasp generation of dexterous hand often requires a large number of grasping annotations. Especially for functional grasp-requiring the grasp pose to be convenient for the subsequent use of the object. However, annotating high DoF dexterous hand pose is rather challenging. This prompt us to explore how people achieve manipulations on new objects based on past grasp experiences. We find that people are adept at discovering and leveraging various similarities between objects when grasping new items, including shape, layout, and grasp type. In light of this, we analyze and collect grasp-related similarity relationships among 51 common tool-like object categories and annotate semantic grasp representation for 1768 objects. These data are organized into the form of a knowledge graph, which helps infer our proposed cross-category functional grasp synthesis. Through extensive experiments, we demonstrate that the grasp-related knowledge indeed contributed to achieving functional grasp transfer across unknown or entirely new categories of objects. We will publicly release the dataset and code to facilitate future research.
熟练的手的抓取生成通常需要大量的抓取注释。特别是对于需要功能抓取且抓持姿势对于后续使用对象来说方便的对象。然而,对高维度灵活手抓取姿势的注释相当具有挑战性。这个提示我们探索人们如何基于过去的抓取经验来在新物体上进行操作。我们发现人们擅长发现和利用物体之间的各种相似性,包括形状、布局和抓取类型。鉴于这一点,我们对51个常见工具状物体类别进行了抓取相关相似关系分析,并为1768个物体标注了语义抓取表示。这些数据以知识图谱的形式组织起来,有助于推断我们提出的跨类功能抓取合成。通过大量实验,我们证实了抓取相关知识确实有助于在未知的或完全新的物体类别之间实现功能抓取转移。我们将公开发布该数据集和代码,以促进未来研究。
https://arxiv.org/abs/2405.08310
Histology slide digitization is becoming essential for telepathology (remote consultation), knowledge sharing (education), and using the state-of-the-art artificial intelligence algorithms (augmented/automated end-to-end clinical workflows). However, the cumulative costs of digital multi-slide high-speed brightfield scanners, cloud/on-premises storage, and personnel (IT and technicians) make the current slide digitization workflows out-of-reach for limited-resource settings, further widening the health equity gap; even single-slide manual scanning commercial solutions are costly due to hardware requirements (high-resolution cameras, high-spec PC/workstation, and support for only high-end microscopes). In this work, we present a new cloud slide digitization workflow for creating scanner-quality whole-slide images (WSIs) from uploaded low-quality videos, acquired from cheap and inexpensive microscopes with built-in cameras. Specifically, we present a pipeline to create stitched WSIs while automatically deblurring out-of-focus regions, upsampling input 10X images to 40X resolution, and reducing brightness/contrast and light-source illumination variations. We demonstrate the WSI creation efficacy from our workflow on World Health Organization-declared neglected tropical disease, Cutaneous Leishmaniasis (prevalent only in the poorest regions of the world and only diagnosed by sub-specialist dermatopathologists, rare in poor countries), as well as other common pathologies on core biopsies of breast, liver, duodenum, stomach and lymph node. The code and pretrained models will be accessible via our GitHub (this https URL), and the cloud platform will be available at this https URL for uploading microscope videos and downloading/viewing WSIs with shareable links (no sign-in required) for telepathology and knowledge sharing.
活组织切片数字化在远程咨询、知识共享和应用最先进的人工智能算法(增强/自动化端到端临床工作流程)中变得越来越重要。然而,数字多滑动高倍光场扫描仪、云/本地存储和人员(IT和技术人员)的累积成本使得当前的扫描数字化工作流程对于资源有限的环境无能为力,进一步扩大了健康差距。即使是单张光片手动扫描的商业解决方案,由于硬件要求(高分辨率相机、高性能PC/工作站和支持仅高端显微镜)也较为昂贵。 在这项工作中,我们提出了一个新的云扫描数字化工作流程,用于从上传的低质量视频创建扫描质量的整张图像(WSIs),这些视频来自具有内置摄像头的廉价且经济实惠的显微镜。特别地,我们提出了一种在自动模糊失焦区域、提高输入10倍图像分辨率、降低亮度和对比度以及降低光源照明变化的情况下创建WSI的流程。我们用我们的工作流程在世界卫生组织宣布的贫困热带疾病、皮肤利什曼病(仅在世界最贫困地区流行,仅由下专科皮肤病理学家诊断,在贫困国家更为罕见)以及其他常见乳腺癌、肝脏、十二指肠、胃和淋巴结核心活组织切片上进行了WSI创建 efficacy 的实验,证实了我们的工作流程的有效性。 代码和预训练的模型将通过我们的GitHub(就是这个https URL)访问,而云平台将通过这个https URL提供上传显微镜视频和使用共享链接下载/查看WSIs(无需登录)进行远程咨询和知识共享。
https://arxiv.org/abs/2405.08169
Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP's ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is this https URL.
生成式对抗网络(GANs)是这个任务的经典模型,但通常在图像和文本描述之间缺乏一致性,并在生成图像的丰富性上不足。最近,条件反向传播(CAT)技术,如条件批归一化和实例归一化,已经被应用到GAN的不同层中,以控制图像中的内容合成。CAT是一个多层感知器,根据批归一化统计数据独立预测数据,而其他层则无法访问全局文本信息。为解决这个问题,我们首先将CAT和循环神经网络(RAT)建模,以确保不同层可以访问全局信息。然后,在RAT之间引入平移注意力和循环神经网络(RNN)的特点,降低信息遗忘的特点。此外,我们的生成器和鉴别器都利用了强大的预训练模型Clip,该模型已通过在潜在空间中学习多模态表示来建立文本和图像之间的联系。鉴别器利用Clip理解复杂场景,从而准确评估生成图像的质量。在CUB、牛津和CelebA-tiny数据集上进行了大量实验,证明了与当前最先进模型相比,所提出的模型具有优越性。代码在这个https URL上。
https://arxiv.org/abs/2405.08114
Purpose: To introduce a deep learning model capable of multi-organ segmentation in MRI scans, offering a solution to the current limitations in MRI analysis due to challenges in resolution, standardized intensity values, and variability in sequences. Materials and Methods: he model was trained on 1,200 manually annotated MRI scans from the UK Biobank, 221 in-house MRI scans and 1228 CT scans, leveraging cross-modality transfer learning from CT segmentation models. A human-in-the-loop annotation workflow was employed to efficiently create high-quality segmentations. The model's performance was evaluated on NAKO and the AMOS22 dataset containing 600 and 60 MRI examinations. Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) was used to assess segmentation accuracy. The model will be open sourced. Results: The model showcased high accuracy in segmenting well-defined organs, achieving Dice Similarity Coefficient (DSC) scores of 0.97 for the right and left lungs, and 0.95 for the heart. It also demonstrated robustness in organs like the liver (DSC: 0.96) and kidneys (DSC: 0.95 left, 0.95 right), which present more variability. However, segmentation of smaller and complex structures such as the portal and splenic veins (DSC: 0.54) and adrenal glands (DSC: 0.65 left, 0.61 right) revealed the need for further model optimization. Conclusion: The proposed model is a robust, tool for accurate segmentation of 40 anatomical structures in MRI and CT images. By leveraging cross-modality learning and interactive annotation, the model achieves strong performance and generalizability across diverse datasets, making it a valuable resource for researchers and clinicians. It is open source and can be downloaded from this https URL.
目的:介绍一种能够在MRI扫描中进行多器官分割的深度学习模型,解决了由于分辨率、标准化强度值和序列变异性导致的MRI分析 current limitations。材料和方法:该模型在来自英国生物银行的手动标注的1,200个MRI扫描、221个内部MRI扫描和1,228个CT扫描上进行训练,利用来自CT分割模型的跨模态转移学习。采用人机交互注释工作流程来高效地创建高质量分割。对模型性能进行了评估,在NAKO和包括600个和60个MRI检查的AMOS22数据集上。Dice相似性系数(DSC)和汉明距离(HD)被用来评估分割准确性。该模型将开源。结果:该模型在分割明确定义的器官方面表现出色,右肺和左肺的Dice相似性系数(DSC)分别为0.97,心脏的Dice相似性系数(DSC)分别为0.95。它还展示了在肝脏(DSC: 0.96)和肾脏(DSC: 0.95 left, 0.95 right)等结构中保持稳健性,这些结构具有更大的变异性。然而,对较小和复杂结构的分割(如门静脉和脾静脉,DSC: 0.54)和肾上腺素分泌细胞(DSC: 0.65 left, 0.61 right)的分割揭示了进一步模型优化。结论:所提出的模型是一种准确分割MRI和CT图像中40个解剖结构的有力工具。通过利用跨模态学习和支持性注释,该模型在各种数据集上都实现了强大的性能和泛化能力,成为研究人员和临床医生的有价值资源。该模型是开源的,可以从https://这个链接下载。
https://arxiv.org/abs/2405.06463
In recent years, diffusion models (DMs) have become a popular method for generating synthetic data. By achieving samples of higher quality, they quickly became superior to generative adversarial networks (GANs) and the current state-of-the-art method in generative modeling. However, their potential has not yet been exploited in radar, where the lack of available training data is a long-standing problem. In this work, a specific type of DMs, namely denoising diffusion probabilistic model (DDPM) is adapted to the SAR domain. We investigate the network choice and specific diffusion parameters for conditional and unconditional SAR image generation. In our experiments, we show that DDPM qualitatively and quantitatively outperforms state-of-the-art GAN-based methods for SAR image generation. Finally, we show that DDPM profits from pretraining on largescale clutter data, generating SAR images of even higher quality.
近年来,扩散模型(DMs)已成为生成合成数据的一种流行方法。通过实现高质量的样本,它们迅速成为生成对抗网络(GANs)和当前生成建模状态的佼佼者。然而,在雷达领域,缺乏可用训练数据是一个长期存在的问题。在这项工作中,我们针对SAR领域 adapt 一种特定类型的DM,即去噪扩散概率模型(DDPM)。我们研究了条件下的和无条件SAR图像生成网络选择和扩散参数。在我们的实验中,我们证明了DDPM在SAR图像生成方面既具有定性又具有定量优势。最后,我们证明了DDPM在大型杂乱数据上的预训练使其产生更高质量的SAR图像。
https://arxiv.org/abs/2405.07776
Sign Language Production (SLP) is a challenging task, given the limited resources available and the inherent diversity within sign data. As a result, previous works have suffered from the problem of regression to the mean, leading to under-articulated and incomprehensible signing. In this paper, we propose using dictionary examples and a learnt codebook of facial expressions to create expressive sign language sequences. However, simply concatenating signs and adding the face creates robotic and unnatural sequences. To address this we present a 7-step approach to effectively stitch sequences together. First, by normalizing each sign into a canonical pose, cropping, and stitching we create a continuous sequence. Then, by applying filtering in the frequency domain and resampling each sign, we create cohesive natural sequences that mimic the prosody found in the original data. We leverage a SignGAN model to map the output to a photo-realistic signer and present a complete Text-to-Sign (T2S) SLP pipeline. Our evaluation demonstrates the effectiveness of the approach, showcasing state-of-the-art performance across all datasets. Finally, a user evaluation shows our approach outperforms the baseline model and is capable of producing realistic sign language sequences.
手语生产(SLP)是一项具有挑战性的任务,由于资源有限以及手语数据固有的多样性,因此以前的工作都受到了回归平均值的问题,导致手语表达不充分且难以理解。在本文中,我们提出了一种使用词汇示例和学过的表情代码库来创建表现手语序列的方法。然而,简单地将手语和面部拼接起来会创建出机械和异常的手语序列。为了解决这个问题,我们提出了7步方法来有效地将序列拼接起来。首先,通过将每个手语正常化到规范姿势、裁剪和缝合,我们创建了一个连续的手语序列。然后,通过在频域中应用滤波器和重采样每个手语,我们创建了凝聚人心且与原始数据中的语调相符的自然序列。我们利用SignGAN模型将输出映射到照片真实的手语签名者,并提出了完整的文本到手语(T2S)SLP管道。我们的评估展示了这种方法的有效性,展示了所有数据集上的最先进性能。最后,用户评估结果表明,我们的方法超过了基线模型,具有产生真实手语序列的能力。
https://arxiv.org/abs/2405.07663
It is an interesting question Can and How Large Language Models (LLMs) understand non-language network data, and help us detect unknown malicious flows. This paper takes Carpet Bombing as a case study and shows how to exploit LLMs' powerful capability in the networking area. Carpet Bombing is a new DDoS attack that has dramatically increased in recent years, significantly threatening network infrastructures. It targets multiple victim IPs within subnets, causing congestion on access links and disrupting network services for a vast number of users. Characterized by low-rates, multi-vectors, these attacks challenge traditional DDoS defenses. We propose DoLLM, a DDoS detection model utilizes open-source LLMs as backbone. By reorganizing non-contextual network flows into Flow-Sequences and projecting them into LLMs semantic space as token embeddings, DoLLM leverages LLMs' contextual understanding to extract flow representations in overall network context. The representations are used to improve the DDoS detection performance. We evaluate DoLLM with public datasets CIC-DDoS2019 and real NetFlow trace from Top-3 countrywide ISP. The tests have proven that DoLLM possesses strong detection capabilities. Its F1 score increased by up to 33.3% in zero-shot scenarios and by at least 20.6% in real ISP traces.
这是一个有趣的问题:CAN和How Large Language Models(LLMs)如何理解非语言网络数据,并帮助我们检测未知的恶意流量。本文以网络攻击Carpet Bombing为例,展示了在网络领域利用LLMs强大能力的方法。Carpet Bombing是一种近年来显著增加的新型DDoS攻击,对网络基础设施造成了严重威胁。它针对多播IP目标,在子网内攻击多个受害者IP,导致访问链路拥堵,并影响了成千上万的用户。这种攻击的特点是低带宽,多向量,挑战了传统的DDoS防御。我们提出了DoLLM,一种利用开源LLMs作为骨架的DDoS检测模型。通过将非上下文网络流量重新组织为流序列并将其投影到LLMs语义空间中的标记词嵌入,DoLLM利用LLMs的上下文理解来提取整个网络上下文的流量表示。这些表示用于提高DDoS检测性能。我们使用公共数据集CIC-DDoS2019和来自Top-3国家ISP的实时NetFlow迹对DoLLM进行了评估。测试证明了DoLLM具有很强的检测能力。在零击场景中的F1得分增加了33.3%,而在真实ISP迹中的F1得分至少增加了20.6%。
https://arxiv.org/abs/2405.07638
In this paper, we consider pluractional markers in Kaqchikel, Karuk, and Yurok. Like Balinese, each of these languages marks one type of pluractionality via reduplication, and a different type of pluractionality via non-reduplicative affixation. This paper serves as a proof-of-concept for applying model-theoretic approaches to language as a lens that can help us to recognize linguistic organization that is not apparent on the surface.
在本文中,我们将考虑卡哈奇尔、卡鲁克和犹罗克的聚类标记。与巴利语一样,这些语言通过重叠标记一种聚类性,并通过非重叠标记另一种聚类性。本文旨在证明将模型理论方法应用于语言作为一种透镜,可以帮助我们识别那些表面上看不到的语言组织结构。
https://arxiv.org/abs/2405.07597
Unveiling the real appearance of retouched faces to prevent malicious users from deceptive advertising and economic fraud has been an increasing concern in the era of digital economics. This article makes the first attempt to investigate the face retouching reversal (FRR) problem. We first collect an FRR dataset, named deepFRR, which contains 50,000 StyleGAN-generated high-resolution (1024*1024) facial images and their corresponding retouched ones by a commercial online API. To our best knowledge, deepFRR is the first FRR dataset tailored for training the deep FRR models. Then, we propose a novel diffusion-based FRR approach (FRRffusion) for the FRR task. Our FRRffusion consists of a coarse-to-fine two-stage network: A diffusion-based Facial Morpho-Architectonic Restorer (FMAR) is constructed to generate the basic contours of low-resolution faces in the first stage, while a Transformer-based Hyperrealistic Facial Detail Generator (HFDG) is designed to create high-resolution facial details in the second stage. Tested on deepFRR, our FRRffusion surpasses the GP-UNIT and Stable Diffusion methods by a large margin in four widespread quantitative metrics. Especially, the de-retouched images by our FRRffusion are visually much closer to the raw face images than both the retouched face images and those restored by the GP-UNIT and Stable Diffusion methods in terms of qualitative evaluation with 85 subjects. These results sufficiently validate the efficacy of our work, bridging the recently-standing gap between the FRR and generic image restoration tasks. The dataset and code are available at this https URL.
在数字经济的时期,揭开修饰前后的脸的真实外观以防止恶意用户进行欺骗性广告和经济欺诈是一个越来越关注的问题。本文是首次调查了脸部修饰反向(FRR)问题。我们首先收集了一个名为deepFRR的数据集,其中包含50,000个由StyleGAN生成的具有1024*1024分辨率的高清(1024*1024)面部图像以及它们由商业在线API修整过的相应图像。据我们所知,deepFRR是第一个针对训练深度FRR模型的FRR数据集。然后,我们提出了一个新颖的扩散为基础的FRR方法(FRRffusion)用于FRR任务。我们的FRRffusion包括一个粗到细的两级网络:首先,通过扩散构建面部形态还原器(FMAR),以生成低分辨率面部的基本轮廓;其次,设计了一个Transformer-based超现实面部细节生成器(HFDG),用于在第二个阶段创建高分辨率面部细节。在deepFRR上测试我们的FRRffusion,我们的FRRffusion在四个广泛的定量指标上超过了GP-UNIT和Stable Diffusion方法。特别是,我们通过FRRffusion生成的去修补过的图像在视觉上与原始面部图像非常接近,而在定量评估中,与GP-UNIT和Stable Diffusion方法相比,修复后的图像在85个受试者中的质量评估结果也相差无几。这些结果充分验证了我们的工作,缩小了FRR和通用图像修复任务之间的最近空白。数据集和代码可在此https URL找到。
https://arxiv.org/abs/2405.07582
Tattoos have been used effectively as soft biometrics to assist law enforcement in the identification of offenders and victims, as they contain discriminative information, and are a useful indicator to locate members of a criminal gang or organisation. Due to various privacy issues in the acquisition of images containing tattoos, only a limited number of databases exists. This lack of databases has delayed the development of new methods to effectively retrieve a potential suspect's tattoo images from a candidate gallery. To mitigate this issue, in our work, we use an unsupervised generative approach to create a balanced database consisting of 28,550 semi-synthetic images with tattooed subjects from 571 tattoo categories. Further, we introduce a novel Tattoo Template Reconstruction Network (TattTRN), which learns to map the input tattoo sample to its respective tattoo template to enhance the distinguishing attributes of the final feature embedding. Experimental results with real data, i.e., WebTattoo and BIVTatt databases, demonstrate the soundness of the presented approach: an accuracy of up to 99% is achieved for checking at most the first 20 entries of the candidate list.
纹身作为一种软生物识别技术,在协助警方识别罪犯和受害者的过程中取得了有效果,因为它们包含有歧视性信息,并且是联系犯罪集团或组织的有用指标。然而,由于纹身图像收集过程中存在各种隐私问题,因此只有少数数据库存在。纹身数据库的缺乏导致新方法有效地检索潜在嫌疑人的纹身图像的发展受到了延迟。为了减轻这个问题,在我们的工作中,我们使用无监督生成方法创建了一个由571个纹身类别中纹身 subject 的28,550个半合成图像组成的平衡数据库。此外,我们引入了一种名为TattTRN的新纹身模板重建网络(TattTRN),它学会了将纹身样本映射到其相应的纹身模板,以增强最终特征嵌入的区分性特征。用真实数据进行实验结果(即WebTattoo和BIVTatt数据库)证明了所提出方法的有效性:在检查候选名单的前20个条目时,达到99%的准确率。
https://arxiv.org/abs/2405.07571
The Lévy walk, a type of random walk characterized by linear step lengths that follow a power-law distribution, is observed in the migratory behaviors of various organisms, ranging from bacteria to humans. Notably, Lévy walks with power exponents close to two are frequently observed, though their underlying causes remain elusive. This study introduces a simplified, abstract random walk model designed to produce inverse square Lévy walks, also known as Cauchy walks and explores the conditions that facilitate these phenomena. In our model, agents move toward a randomly selected destination in multi-dimensional space, and their movement strategy is parameterized by the extent to which they pursue the shortest path. When the search cost is proportional to the distance traveled, this parameter effectively reflects the emphasis on minimizing search costs. Our findings reveal that strict adherence to this cost minimization constraint results in a Brownian walk pattern. However, removing this constraint transitions the movement to an inverse square Lévy walk. Therefore, by modulating the prioritization of search costs, our model can seamlessly alternate between Brownian and Cauchy walk dynamics. This model has the potential to be utilized for exploring the parameter space of an optimization problem.
Levy walk,一种由线性步长且符合对数分布的随机漫步特征的随机漫步类型,在各种生物的迁徙行为中都有观察到。值得注意的是,经常观察到具有接近于2的功率指数的Lévy漫步,尽管其潜在原因仍然是神秘的。本研究介绍了一个简化的、抽象的随机漫步模型,旨在产生反比例Lévy漫步,也就是Cauchy漫步,并探讨了促进这些现象的条件。在我们的模型中,代理商在多维空间中随机选择一个目标,他们的运动策略由他们追求最短路径的程度参数化。当搜索成本与所走距离成比例时,这个参数有效地反映了强调最小化搜索成本的重视。我们的研究结果表明,严格遵守这种成本最小化约束会导致布朗运动模式。然而,消除这个约束将使运动转向反比例Lévy漫步。因此,通过调整搜索成本的优先级,我们的模型可以平滑地交替出现布朗和Cauchy运动模式。这个模型有可能用于探索优化问题的参数空间。
https://arxiv.org/abs/2405.07541
To deal with the task assignment problem of multi-AUV systems under kinematic constraints, which means steering capability constraints for underactuated AUVs or other vehicles likely, an improved task assignment algorithm is proposed combining the Dubins Path algorithm with improved SOM neural network algorithm. At first, the aimed tasks are assigned to the AUVs by improved SOM neural network method based on workload balance and neighborhood function. When there exists kinematic constraints or obstacles which may cause failure of trajectory planning, task re-assignment will be implemented by change the weights of SOM neurals, until the AUVs can have paths to reach all the targets. Then, the Dubins paths are generated in several limited cases. AUV's yaw angle is limited, which result in new assignments to the targets. Computation flow is designed so that the algorithm in MATLAB and Python can realizes the path planning to multiple targets. Finally, simulation results prove that the proposed algorithm can effectively accomplish the task assignment task for multi-AUV system.
为解决多AUV系统在运动约束下的任务分配问题,即对于未操纵的AUV或其他车辆可能存在的转向能力限制,提出了一种改进的任务分配算法,将基于工作负载平衡和邻居功能的有放大的SOM神经网络算法与Dubins路径算法相结合。首先,通过基于工作负载平衡和邻居功能的有放大的SOM神经网络方法将目标任务分配给AUV。当存在可能导致轨迹规划失败的刚性约束或障碍时,通过改变SOM神经元的权重实现任务重新分配,直到AUV可以到达所有目标。然后,在几个有限案例中生成Dubins路径。AUV的偏航角度受到限制,导致目标的新分配。计算流程旨在使MATLAB和Python中的算法能够实现对多个目标的路径规划。最后,仿真结果证明,与传统方法相比,所提出的算法可以有效完成多AUV系统的任务分配任务。
https://arxiv.org/abs/2405.07536