Glass largely blurs the boundary between the real world and the reflection. The special transmittance and reflectance quality have confused the semantic tasks related to machine vision. Therefore, how to clear the boundary built by glass, and avoid over-capturing features as false positive information in deep structure, matters for constraining the segmentation of reflection surface and penetrating glass. We proposed the Fourier Boundary Features Network with Wider Catchers (FBWC), which might be the first attempt to utilize sufficiently wide horizontal shallow branches without vertical deepening for guiding the fine granularity segmentation boundary through primary glass semantic information. Specifically, we designed the Wider Coarse-Catchers (WCC) for anchoring large area segmentation and reducing excessive extraction from a structural perspective. We embed fine-grained features by Cross Transpose Attention (CTA), which is introduced to avoid the incomplete area within the boundary caused by reflection noise. For excavating glass features and balancing high-low layers context, a learnable Fourier Convolution Controller (FCC) is proposed to regulate information integration robustly. The proposed method has been validated on three different public glass segmentation datasets. Experimental results reveal that the proposed method yields better segmentation performance compared with the state-of-the-art (SOTA) methods in glass image segmentation.
玻璃在很大程度上模糊了现实世界和反射之间的边界。特殊的传输和反射质量使机器视觉相关的语义任务变得混乱。因此,如何清除玻璃构建的边界,并避免在深度结构中捕获到的特征作为假阳性信息,对于限制反射表面分割和穿透玻璃是有关于的。我们提出了Fourier边界特征网络(FBWC),这可能是第一个利用足够宽的横向浅分支,而不会导致垂直加深,指导通过主要玻璃语义信息进行细粒度分割边界的尝试。具体来说,我们设计了一个Wider粗 catch器(WCC),用于锚定大面积分割,并减少从结构角度引起的过度提取。我们通过跨变换注意(CTA)嵌入细粒度特征,这是为了避免反射噪声引起的不完整区域。为了挖掘玻璃特征并平衡高低层上下文,我们提出了一个可学习的Fourier卷积控制器(FCC)来调节信息整合的稳健性。所提出的方法已经在三个不同公开玻璃分割数据集上进行了验证。实验结果表明,与最先进的玻璃图像分割方法相比,所提出的方法具有更好的分割性能。
https://arxiv.org/abs/2405.09459
Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid human-AI technologies, and 3) investigates what these technologies can reveal about individual perspectives in online discussions. We propose a three-layered hierarchy for representing perspectives that can be obtained by a mixture of human intelligence and large language models. We illustrate how these representations can draw insights into the diversity of perspectives and allow us to investigate interactions in online discussions.
现代民主国家面临着公民参与度下降的一个关键问题。在线讨论论坛是提高公民参与度的重要途径。本论文提纲1)识别了促进大规模在线讨论与自然语言处理(NLP)相关的挑战,2)提出了通过融合人机智能技术来解决这些挑战的建议,3)研究了这些技术如何揭示关于在线讨论中个体观点的信息。我们提出了一个三层表示视角的三层结构,可以通过混合人类智慧和大型语言模型的方式获得。我们说明了这些表示如何揭示视角的多样性,并允许我们研究在线讨论中的互动。
https://arxiv.org/abs/2405.09439
3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
3D内容创作在各种应用中发挥着重要作用,如游戏、机器人模拟和虚拟现实。然而,该过程费力且耗时,需要熟练的设计师投入大量精力创作单个3D资产。为应对这一挑战,文本到3D生成技术作为一种有前途的自动化3D创作的解决方案应运而生。通过利用大型视觉语言模型的成功,这些技术旨在根据文本描述生成3D内容。尽管在最近一段时间内这一领域取得了进展,但现有的解决方案在生成质量和效率方面仍然存在显著的限制。在本次调查中,我们深入研究了最新的文本到3D创作方法。我们提供了关于文本到3D创作的全面背景,包括讨论训练和评估指标所使用的数据集以及用于评估生成3D模型的质量的评估指标。接着,我们深入探讨了作为3D生成过程基础的各种3D表示。此外,我们还对迅速发展的关于生成管道的研究进行了全面的比较,并将它们分为前馈生成、基于优化的生成和视图重构方法。通过分析这些方法的优缺点,我们希望揭示它们各自的潜能和局限。最后,我们指出了未来研究的几个有前景的方向。通过这次调查,我们希望激励研究人员进一步探索开放词汇文本条件下3D内容创作的潜力。
https://arxiv.org/abs/2405.09431
A fundamental tenet of pattern recognition is that overlap between training and testing sets causes an optimistic accuracy estimate. Deep CNNs for face recognition are trained for N-way classification of the identities in the training set. Accuracy is commonly estimated as average 10-fold classification accuracy on image pairs from test sets such as LFW, CALFW, CPLFW, CFP-FP and AgeDB-30. Because train and test sets have been independently assembled, images and identities in any given test set may also be present in any given training set. In particular, our experiments reveal a surprising degree of identity and image overlap between the LFW family of test sets and the MS1MV2 training set. Our experiments also reveal identity label noise in MS1MV2. We compare accuracy achieved with same-size MS1MV2 subsets that are identity-disjoint and not identity-disjoint with LFW, to reveal the size of the optimistic bias. Using more challenging test sets from the LFW family, we find that the size of the optimistic bias is larger for more challenging test sets. Our results highlight the lack of and the need for identity-disjoint train and test methodology in face recognition research.
模式识别的一个基本信念是,训练集和测试集之间的覆盖会导致乐观估计的准确性估计。用于面部识别的深度卷积神经网络通过N路分类对训练集中的身份进行训练。通常将准确性估计为像LFW、CALFW、CPLFW、CFP-FP和AgeDB-30等测试集中的图像对的平均10倍分类准确度。因为训练和测试集是独立组装的,所以每个测试集中的图像和身份可能在任何训练集中找到。特别是,我们的实验揭示了LFW家族测试集和MS1MV2训练集之间身份和图像重叠的令人惊讶的程度。我们的实验还揭示了MS1MV2中身份标签噪声。我们比较了与LFW相同大小的子集获得的准确度,这些子集不与LFW具有相同的身份,以揭示乐观估计的规模。使用LFW家族更具有挑战性的测试集,我们发现,对于具有更大挑战性的测试集,乐观估计的规模更大。我们的结果突出了在面部识别研究中缺乏身份-不相同的训练和测试方法的问题,并强调了需要改进这个问题。
https://arxiv.org/abs/2405.09403
Current orthopedic robotic systems largely focus on navigation, aiding surgeons in positioning a guiding tube but still requiring manual drilling and screw placement. The automation of this task not only demands high precision and safety due to the intricate physical interactions between the surgical tool and bone but also poses significant risks when executed without adequate human oversight. As it involves continuous physical interaction, the robot should collaborate with the surgeon, understand the human intent, and always include the surgeon in the loop. To achieve this, this paper proposes a new cognitive human-robot collaboration framework, including the intuitive AR-haptic human-robot interface, the visual-attention-based surgeon model, and the shared interaction control scheme for the robot. User studies on a robotic platform for orthopedic surgery are presented to illustrate the performance of the proposed method. The results demonstrate that the proposed human-robot collaboration framework outperforms full robot and full human control in terms of safety and ergonomics.
目前,机器人骨科系统主要关注导航,帮助医生在定位引导管时进行操作,但仍需要手动进行钻孔和螺栓植入。自动化这一任务不仅要求高精度和安全性,是由于手术工具与骨头的复杂物理相互作用所带来的,而且在缺乏充分人类监督的情况下执行也存在重大风险。由于涉及持续的身体交互,机器人应与医生合作,理解人类的意图,并始终将医生纳入循环。为实现这一目标,本文提出了一种新的人机协作框架,包括直观的AR-人机界面、基于视觉注意的医生模型和机器人共享交互控制方案。用户研究在骨科手术机器人平台上展示了所提出方法的有效性。结果表明,与全机器人控制和全人类控制相比,人机协作框架在安全和人机工程方面具有优势。
https://arxiv.org/abs/2405.09359
The multi-scale receptive field and large kernel attention (LKA) module have been shown to significantly improve performance in the lightweight image super-resolution task. However, existing lightweight super-resolution (SR) methods seldom pay attention to designing efficient building block with multi-scale receptive field for local modeling, and their LKA modules face a quadratic increase in computational and memory footprints as the convolutional kernel size increases. To address the first issue, we propose the multi-scale blueprint separable convolutions (MBSConv) as highly efficient building block with multi-scale receptive field, it can focus on the learning for the multi-scale information which is a vital component of discriminative representation. As for the second issue, we revisit the key properties of LKA in which we find that the adjacent direct interaction of local information and long-distance dependencies is crucial to provide remarkable performance. Thus, taking this into account and in order to mitigate the complexity of LKA, we propose a large coordinate kernel attention (LCKA) module which decomposes the 2D convolutional kernels of the depth-wise convolutional layers in LKA into horizontal and vertical 1-D kernels. LCKA enables the adjacent direct interaction of local information and long-distance dependencies not only in the horizontal direction but also in the vertical. Besides, LCKA allows for the direct use of extremely large kernels in the depth-wise convolutional layers to capture more contextual information, which helps to significantly improve the reconstruction performance, and it incurs lower computational complexity and memory footprints. Integrating MBSConv and LCKA, we propose a large coordinate kernel attention network (LCAN).
多尺度 receptive 场和大型内核注意 (LKA) 模块已经被证明在轻量图像超分辨率任务中显著提高了性能。然而,现有的轻量级超分辨率(SR)方法很少关注设计具有多尺度 receptive 场的有效构建模块,并且随着卷积核大小的增加,它们的 LKA 模块的计算和内存足迹呈指数增长。为解决第一个问题,我们提出了多尺度蓝色模板分离卷积(MBSConv)作为具有多尺度 receptive 场的非常高效构建模块,它可以关注多尺度信息,这是判别表示的重要组成部分。对于第二个问题,我们重新审视了 LKA 的关键特性,我们发现邻近信息之间的直接相互作用和长距离依赖关系对提供出色的性能至关重要。因此,考虑到这一点,为了减轻 LKA 的复杂性,我们提出了大型坐标卷积注意(LCKA)模块,它将 LKA 的深度卷积层中的 2D 卷积核拆分为水平和垂直 1D 卷积核。LCKA 不仅使相邻直接相互作用于局部信息和长距离依赖关系,而且在水平和垂直方向上都有。此外,LCKA 允许在深度卷积层中直接使用极其大的卷积核来捕捉更多的上下文信息,从而显著提高重构性能,并使其计算复杂性和内存足迹更低。将 MBSConv 和 LCKA 集成起来,我们提出了大型坐标卷积注意网络 (LCAN)。
https://arxiv.org/abs/2405.09353
In this work, we present Score MUsic Graph (SMUG)-Explain, a framework for generating and visualizing explanations of graph neural networks applied to arbitrary prediction tasks on musical scores. Our system allows the user to visualize the contribution of input notes (and note features) to the network output, directly in the context of the musical score. We provide an interactive interface based on the music notation engraving library Verovio. We showcase the usage of SMUG-Explain on the task of cadence detection in classical music. All code is available on this https URL.
在这项工作中,我们提出了Score Music Graph (SMUG)-Explain,一个用于生成和可视化应用到任意预测任务的图形化解释的框架,针对音乐乐谱。我们的系统允许用户在音乐乐谱的上下文中直接可视化输入音符(和音符特征)对网络输出的贡献。我们还基于音乐乐谱雕刻库Verovio提供了一个交互式的界面。我们在古典音乐中的句尾检测任务中展示了SMUG-Explain的使用。所有代码都可以在https://url.com/这个网址上找到。
https://arxiv.org/abs/2405.09241
To safely navigate intricate real-world scenarios, autonomous vehicles must be able to adapt to diverse road conditions and anticipate future events. World model (WM) based reinforcement learning (RL) has emerged as a promising approach by learning and predicting the complex dynamics of various environments. Nevertheless, to the best of our knowledge, there does not exist an accessible platform for training and testing such algorithms in sophisticated driving environments. To fill this void, we introduce CarDreamer, the first open-source learning platform designed specifically for developing WM based autonomous driving algorithms. It comprises three key components: 1) World model backbone: CarDreamer has integrated some state-of-the-art WMs, which simplifies the reproduction of RL algorithms. The backbone is decoupled from the rest and communicates using the standard Gym interface, so that users can easily integrate and test their own algorithms. 2) Built-in tasks: CarDreamer offers a comprehensive set of highly configurable driving tasks which are compatible with Gym interfaces and are equipped with empirically optimized reward functions. 3) Task development suite: This suite streamlines the creation of driving tasks, enabling easy definition of traffic flows and vehicle routes, along with automatic collection of multi-modal observation data. A visualization server allows users to trace real-time agent driving videos and performance metrics through a browser. Furthermore, we conduct extensive experiments using built-in tasks to evaluate the performance and potential of WMs in autonomous driving. Thanks to the richness and flexibility of CarDreamer, we also systematically study the impact of observation modality, observability, and sharing of vehicle intentions on AV safety and efficiency. All code and documents are accessible on this https URL.
为了在复杂的现实场景中安全导航,自动驾驶车辆必须能够适应各种道路条件并预测未来事件。基于强化学习的(RL)世界模型(WM)作为一种有前景的方法,通过学习和预测各种环境中的复杂动态而 emergence。然而,据我们所知,目前没有可用的平台来训练和测试这种算法在复杂驾驶环境中的自动驾驶算法。为填补这一空白,我们介绍了CarDreamer,第一个专为开发基于RL的自驾算法而设计的开源学习平台。它包括三个关键组件:1)世界模型骨架:CarDreamer集成了一些最先进的WMs,简化了RL算法的复制。骨架与其余部分解耦并使用标准的Gym界面通信,以便用户轻松地将自己的算法集成和测试。2)内置任务:CarDreamer提供了一系列高度可配置的驾驶任务,与Gym接口兼容,并配备经过实证优化的奖励函数。3)任务开发套件:该套件简化了驾驶任务的创建,用户可以轻松定义交通流量和车辆路线,并自动收集多模态观察数据。可视化服务器允许用户通过浏览器追踪实时代理驾驶员的视频和性能指标。此外,我们使用内置任务对WMs在自动驾驶中的性能和潜力进行了广泛的实验评估。由于CarDreamer的丰富性和灵活性,我们还系统地研究了观测模式、可观测性和车辆意图共享对AV安全性和效率的影响。所有代码和文档都可以在https://这个URL访问。
https://arxiv.org/abs/2405.09111
Humans use collaborative robots as tools for accomplishing various tasks. The interaction between humans and robots happens in tight shared workspaces. However, these machines must be safe to operate alongside humans to minimize the risk of accidental collisions. Ensuring safety imposes many constraints, such as reduced torque and velocity limits during operation, thus increasing the time to accomplish many tasks. However, for applications such as using collaborative robots as haptic interfaces with intermittent contacts for virtual reality applications, speed limitations result in poor user experiences. This research aims to improve the efficiency of a collaborative robot while improving the safety of the human user. We used Gaussian process models to predict human hand motion and developed strategies for human intention detection based on hand motion and gaze to improve the time for the robot and human security in a virtual environment. We then studied the effect of prediction. Results from comparisons show that the prediction models improved the robot time by 3\% and safety by 17\%. When used alongside gaze, prediction with Gaussian process models resulted in an improvement of the robot time by 2\% and the safety by 13\%.
人类使用协作机器人作为完成各种任务的工具。人类和机器人之间的互动发生在紧密共享的工作空间中。然而,为了最小化意外碰撞的风险,这些机器必须安全地与人类一起操作。确保安全性会带来许多限制,例如在操作期间减小扭矩和速度限制,从而增加完成许多任务的所需时间。然而,对于将协作机器人用作虚拟现实应用中的触觉接口的应用,速度限制会导致用户体验差。这项研究旨在提高协作机器人的效率,同时提高人类用户的可靠性。我们使用高斯过程模型预测人类手部运动,并基于手部动作和眼神来开发了人类意图检测策略,以提高机器人和人类在虚拟环境中的安全时间。然后我们研究了预测的影响。比较结果表明,预测模型提高了机器人的时间3%,安全性提高了17%。当与眼神结合使用时,使用高斯过程模型的预测提高了机器人的时间2%,安全性提高了13%。
https://arxiv.org/abs/2405.09109
In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. We additionally perform song classification based on the generated tracks. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.
在本文中,我们探讨了使用潜在扩散模型(一种强大的生成模型家族)从脑电图(EEG)录音中重构自然主义音乐的潜力。与简单音乐且音色有限的作品(如MIDI生成的曲目或单旋律作品)相比,这里的重点是复杂音乐,具有多样化的乐器、声音和效果,丰富和谐和音色。本研究代表了一种使用非侵入性EEG数据实现高质量音乐重建的初步尝试,采用端到端训练方法直接在原始数据上进行,无需手动预处理和通道选择。我们将模型 training 放在公共 NMED-T 数据集上,并通过提出基于神经嵌入的指标进行定量评估。此外,我们还基于生成的曲目进行歌曲分类。我们的工作对神经解码和脑-机接口的持续研究做出了贡献,揭示了使用EEG数据进行复杂听觉信息重建的可行性。
https://arxiv.org/abs/2405.09062
Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.
尽管在过去的几年中,面部识别已经取得了显著的进步,但设计一个多任务面部识别模型仍然具有挑战性。大多数面部识别任务都被单独研究,并没有从相关任务之间的协同作用中受益。在本文中,我们提出了一种名为 Q-Face 的具有新颖性的多任务面部识别方法,该方法使用一个统一模型同时执行多个面部识别任务。我们将来自大型预训练模型的多个层次的特征融合在一起,使整个模型可以利用局部和全局面部信息来支持多个任务。此外,我们还设计了一个任务适应模块,在查询向量和融合多级特征之间进行跨注意,并最终根据每个面部识别任务自适应地提取所需特征。大量实验证明,我们的方法可以同时执行多个任务,在面部表情识别、动作单元检测、面部属性分析、年龄估计和面部姿态估计方面的表现均为最先进的水平。与传统方法相比,我们的方法为多任务面部识别提供了新的可能性,并展示了准确性和效率的潜力。
https://arxiv.org/abs/2405.09059
Modeling visual saliency in graphical user interfaces (GUIs) allows to understand how people perceive GUI designs and what elements attract their attention. One aspect that is often overlooked is the fact that computational models depend on a series of design parameters that are not straightforward to decide. We systematically analyze how different design parameters affect scanpath evaluation metrics using a state-of-the-art computational model (DeepGaze++). We particularly focus on three design parameters: input image size, inhibition-of-return decay, and masking radius. We show that even small variations of these design parameters have a noticeable impact on standard evaluation metrics such as DTW or Eyenalysis. These effects also occur in other scanpath models, such as UMSS and ScanGAN, and in other datasets such as MASSVIS. Taken together, our results put forward the impact of design decisions for predicting users' viewing behavior on GUIs.
在图形用户界面(GUIs)中建模视觉显著性可以帮助人们理解如何看待GUI设计以及哪些元素会吸引他们的注意。通常被忽视的一个方面是,计算模型依赖于一系列设计参数,而这些参数并不容易决定。我们使用最先进的计算模型(DeepGaze++)系统地分析不同设计参数如何影响扫描路径评估指标。我们特别关注三个设计参数:输入图像大小、抑制返回衰减和掩码半径。我们发现,即使是这些设计参数的小变化也会对标准评估指标,如DTW或Eyenalysis产生显著影响。这些影响也存在于其他扫描路径模型中,如UMSS和ScanGAN,以及其他数据集中。结合我们的结果,我们提出了设计决策对预测用户在GUIs上的观看行为具有影响的观点。
https://arxiv.org/abs/2405.08981
For the shape control of deformable free-form surfaces, simulation plays a crucial role in establishing the mapping between the actuation parameters and the deformed shapes. The differentiation of this forward kinematic mapping is usually employed to solve the inverse kinematic problem for determining the actuation parameters that can realize a target shape. However, the free-form surfaces obtained from simulators are always different from the physically deformed shapes due to the errors introduced by hardware and the simplification adopted in physical simulation. To fill the gap, we propose a novel deformation function based sim-to-real learning method that can map the geometric shape of a simulated model into its corresponding shape of the physical model. Unlike the existing sim-to-real learning methods that rely on completely acquired dense markers, our method accommodates sparsely distributed markers and can resiliently use all captured frames -- even for those in the presence of missing markers. To demonstrate its effectiveness, our sim-to-real method has been integrated into a neural network-based computational pipeline designed to tackle the inverse kinematic problem on a pneumatically actuated deformable mannequin.
对于可变形自由曲面形状的控制,仿真在确定驱动参数与变形形状之间的映射方面起着关键作用。通常采用向前运动学映射的微分来求解确定驱动参数以实现目标形状的反向运动学问题。然而,由于硬件误差和物理仿真中简化的采用,从仿真器获得的自由曲面总是与物理变形形状不同。为了填补这一空白,我们提出了一个基于仿真的学习方法的新颖变形函数,可以将模拟模型的几何形状映射到物理模型的相应形状。与现有的仿真学习方法完全基于获得的密集标记不同,我们的方法可以适应稀疏分布的标记,并且可以弹性地使用所有捕获的帧——即使是在缺失标记的情况下。为了证明其有效性,我们的仿真方法已经集成到了一个用于解决气动驱动可变形人体模型的反向运动学问题的神经网络计算管道中。
https://arxiv.org/abs/2405.08935
With the proliferation of edge devices, there is a significant increase in attack surface on these devices. The decentralized deployment of threat intelligence on edge devices, coupled with adaptive machine learning techniques such as the in-context learning feature of large language models (LLMs), represents a promising paradigm for enhancing cybersecurity on low-powered edge devices. This approach involves the deployment of lightweight machine learning models directly onto edge devices to analyze local data streams, such as network traffic and system logs, in real-time. Additionally, distributing computational tasks to an edge server reduces latency and improves responsiveness while also enhancing privacy by processing sensitive data locally. LLM servers can enable these edge servers to autonomously adapt to evolving threats and attack patterns, continuously updating their models to improve detection accuracy and reduce false positives. Furthermore, collaborative learning mechanisms facilitate peer-to-peer secure and trustworthy knowledge sharing among edge devices, enhancing the collective intelligence of the network and enabling dynamic threat mitigation measures such as device quarantine in response to detected anomalies. The scalability and flexibility of this approach make it well-suited for diverse and evolving network environments, as edge devices only send suspicious information such as network traffic and system log changes, offering a resilient and efficient solution to combat emerging cyber threats at the network edge. Thus, our proposed framework can improve edge computing security by providing better security in cyber threat detection and mitigation by isolating the edge devices from the network.
随着边缘设备的普及,这些设备上的攻击面显著增加。在边缘设备上分布式威胁情报的集中部署,与大型语言模型(LLMs)的上下文学习特征等自适应机器学习技术的结合,代表了一种增强网络安全低功耗边缘设备的有前途的范式。这种方法涉及在边缘设备上直接部署轻量级机器学习模型以实时分析本地数据流,如网络流量和系统日志。此外,将计算任务分配给边缘服务器可以降低延迟并提高响应速度,同时通过在本地处理敏感数据而增强隐私。LLM服务器可以使得这些边缘服务器能够自主适应不断变化的威胁和攻击模式,持续更新模型以提高检测准确性和减少误报。此外,合作学习机制使边缘设备之间实现安全且可信的相互知识共享,增强网络集体智慧,并能够实现针对检测到的异常情况的动态威胁缓解措施,如设备隔离。这种方法的可扩展性和灵活性使其非常适合各种不断变化的网络环境,因为边缘设备仅发送网络流量和系统日志变化等可疑信息,为解决网络边缘 emerging cyber threats 提供了一个弹性和高效的解决方案。因此,我们提出的框架可以通过在网络边缘隔离边缘设备来提高边缘计算安全性,从而通过隔离边缘设备从网络来提高网络威胁检测和缓解的 security。
https://arxiv.org/abs/2405.08755
We present a simple algorithm for differentiable rendering of surfaces represented by Signed Distance Fields (SDF), which makes it easy to integrate rendering into gradient-based optimization pipelines. To tackle visibility-related derivatives that make rendering non-differentiable, existing physically based differentiable rendering methods often rely on elaborate guiding data structures or reparameterization with a global impact on variance. In this article, we investigate an alternative that embraces nonzero bias in exchange for low variance and architectural simplicity. Our method expands the lower-dimensional boundary integral into a thin band that is easy to sample when the underlying surface is represented by an SDF. We demonstrate the performance and robustness of our formulation in end-to-end inverse rendering tasks, where it obtains results that are competitive with or superior to existing work.
我们提出了一种简单的方法来对由Signed Distance Fields (SDF)表示的表面进行不同寻常的渲染,这使得将渲染集成到梯度基础优化流程中变得容易。为了处理与可见性相关的偏导数,现有的基于物理的不同寻常渲染方法通常依赖于复杂的引导数据结构或全局对方差的影响。在本文中,我们研究了一种以低方差换取非零偏差的替代方法。我们的方法将低维边界积分扩展成薄带,并且在底层表面用SDF表示时,可以轻松采样。我们证明了我们在端到端反向渲染任务中的性能和鲁棒性,其中它获得了与现有工作竞争的结果或优于现有工作。
https://arxiv.org/abs/2405.08733
This study investigates the computational speed and accuracy of two numerical integration methods, cubature and sampling-based, for integrating an integrand over a 2D polygon. Using a group of rovers searching the Martian surface with a limited sensor footprint as a test bed, the relative error and computational time are compared as the area was subdivided to improve accuracy in the sampling-based approach. The results show that the sampling-based approach exhibits a $14.75\%$ deviation in relative error compared to cubature when it matches the computational performance at $100\%$. Furthermore, achieving a relative error below $1\%$ necessitates a $10000\%$ increase in relative time to calculate due to the $\mathcal{O}(N^2)$ complexity of the sampling-based method. It is concluded that for enhancing reinforcement learning capabilities and other high iteration algorithms, the cubature method is preferred over the sampling-based method.
这项研究探讨了两种数值积分方法:立方和基于采样的方法,在整合一个二维多边形中的积分多项式的计算速度和精度。使用一组轮式机器人,其有限传感器足迹作为火星表面测试台,将采样的精度与分段面积的提高精度进行比较,当采样方法在计算性能达到100%时。结果显示,与立方相比,基于采样的方法在相对误差方面存在14.75%的偏差。此外,为了实现相对误差低于1%,需要将相对时间增加10000%以计算由于采样的方法 $\mathcal{O}(N^2)$ 复杂性。因此,结论是,为了增强强化学习能力和其他高迭代算法,立方方法比基于采样的方法更受欢迎。
https://arxiv.org/abs/2405.08691
Depth estimation plays a crucial role in various tasks within endoscopic surgery, including navigation, surface reconstruction, and augmented reality visualization. Despite the significant achievements of foundation models in vision tasks, including depth estimation, their direct application to the medical domain often results in suboptimal performance. This highlights the need for efficient adaptation methods to adapt these models to endoscopic depth estimation. We propose Endoscopic Depth Any Camera (EndoDAC) which is an efficient self-supervised depth estimation framework that adapts foundation models to endoscopic scenes. Specifically, we develop the Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks to tailor the foundational model to the surgical domain, utilizing remarkably few trainable parameters. Given that camera information is not always accessible, we also introduce a self-supervised adaptation strategy that estimates camera intrinsics using the pose encoder. Our framework is capable of being trained solely on monocular surgical videos from any camera, ensuring minimal training costs. Experiments demonstrate that our approach obtains superior performance even with fewer training epochs and unaware of the ground truth camera intrinsics. Code is available at this https URL.
深度估计在内窥镜手术的各种任务中扮演着至关重要的角色,包括导航、表面重建和增强现实可视化。尽管基础模型在视觉任务中取得了显著的成就,包括深度估计,但它们的直接应用到医学领域通常会导致性能较低。这表明了需要有效的适应方法将这些模型应用于内窥镜深度估计。我们提出了Endoscopic Depth Any Camera(EndoDAC),这是一种有效的自监督深度估计框架,将基础模型适应内窥镜场景。具体来说,我们开发了基于动态向量的高级低秩适应(DV-LoRA)方法,并使用卷积颈块将基本模型裁剪为手术领域,利用训练参数的数量非常少。鉴于相机信息通常不可用,我们还引入了一种自监督的适应策略,使用姿态编码器估计相机内参。我们的框架能够仅通过单目手术视频进行训练,确保最小化训练成本。实验结果表明,即使训练次数较少,甚至不知道真实相机内参,我们的方法也能获得卓越的性能。代码位于此链接处。
https://arxiv.org/abs/2405.08672
Since the release of ChatGPT and GPT-4, large language models (LLMs) and multimodal large language models (MLLMs) have garnered significant attention due to their powerful and general capabilities in understanding, reasoning, and generation, thereby offering new paradigms for the integration of artificial intelligence with medicine. This survey comprehensively overviews the development background and principles of LLMs and MLLMs, as well as explores their application scenarios, challenges, and future directions in medicine. Specifically, this survey begins by focusing on the paradigm shift, tracing the evolution from traditional models to LLMs and MLLMs, summarizing the model structures to provide detailed foundational knowledge. Subsequently, the survey details the entire process from constructing and evaluating to using LLMs and MLLMs with a clear logic. Following this, to emphasize the significant value of LLMs and MLLMs in healthcare, we survey and summarize 6 promising applications in healthcare. Finally, the survey discusses the challenges faced by medical LLMs and MLLMs and proposes a feasible approach and direction for the subsequent integration of artificial intelligence with medicine. Thus, this survey aims to provide researchers with a valuable and comprehensive reference guide from the perspectives of the background, principles, and clinical applications of LLMs and MLLMs.
自 ChatGPT 和 GPT-4 发布以来,大型语言模型(LLMs)和多模态大型语言模型(MLLMs)因其在理解、推理和生成方面的强大和通用能力而备受关注,为将人工智能与医疗相结合提供了新的范例。这项调查全面回顾了 LLMs 和 MLLMs 的开发背景和原理,并探讨了它们在医学中的应用场景、挑战和未来发展方向。具体来说,这项调查首先关注范式的转变,从传统模型到 LLMs 和 MLLMs 的演变过程,并总结模型的结构以提供详细的基础知识。接着,调查详细描述了从构建和评估到使用 LLMs 和 MLLMs 的整个过程,并强调了 LLMs 和 MLLMs 在医疗保健中的重要价值。随后,我们调查和总结了 6 个医疗保健领域的有益应用。最后,调查讨论了医疗 LLMs 和 MLLMs 面临的问题,并为将来的人工智能与医疗结合提出了一种可行的方式和方向。因此,这项调查旨在为研究人员提供关于 LLMs 和 MLLMs 的背景、原则和临床应用方面宝贵的全面参考指南。
https://arxiv.org/abs/2405.08603
The rise of advanced large language models such as GPT-4, GPT-4o, and the Claude family has made fake audio detection increasingly challenging. Traditional fine-tuning methods struggle to keep pace with the evolving landscape of synthetic speech, necessitating continual learning approaches that can adapt to new audio while retaining the ability to detect older types. Continual learning, which acts as an effective tool for detecting newly emerged deepfake audio while maintaining performance on older types, lacks a well-constructed and user-friendly evaluation framework. To address this gap, we introduce EVDA, a benchmark for evaluating continual learning methods in deepfake audio detection. EVDA includes classic datasets from the Anti-Spoofing Voice series, Chinese fake audio detection series, and newly generated deepfake audio from models like GPT-4 and GPT-4o. It supports various continual learning techniques, such as Elastic Weight Consolidation (EWC), Learning without Forgetting (LwF), and recent methods like Regularized Adaptive Weight Modification (RAWM) and Radian Weight Modification (RWM). Additionally, EVDA facilitates the development of robust algorithms by providing an open interface for integrating new continual learning methods
高级大型语言模型(如GPT-4、GPT-4o和Claude家族)的崛起使得伪造音频检测变得越来越具有挑战性。传统的微调方法很难与不断变化的合成语音格局保持同步,需要不断学习的方法来适应新的音频,同时保留检测较老类型音频的能力。持续学习作为一种有效的工具,在检测新型深度伪造音频的同时保持对较老类型的检测性能,但它缺乏一个结构良好和用户友好的评估框架。为了填补这一空白,我们引入了EVDA,一个用于评估持续学习在深度伪造音频检测中的基准。EVDA包括来自反伪造声音系列、中国伪造音频检测系列以及GPT-4和GPT-4o生成的全新深度伪造音频。它支持各种持续学习技术,例如EWC、学习不遗忘(LwF)以及像Regularized Adaptive Weight Modification(RAWM)和Radian Weight Modification(RWM)这样的最近方法。此外,EVDA通过提供一个开放的接口,促进将新的持续学习方法集成到算法中,从而推动其发展。
https://arxiv.org/abs/2405.08596
Tissue tracking in echocardiography is challenging due to the complex cardiac motion and the inherent nature of ultrasound acquisitions. Although optical flow methods are considered state-of-the-art (SOTA), they struggle with long-range tracking, noise occlusions, and drift throughout the cardiac cycle. Recently, novel learning-based point tracking techniques have been introduced to tackle some of these issues. In this paper, we build upon these techniques and introduce EchoTracker, a two-fold coarse-to-fine model that facilitates the tracking of queried points on a tissue surface across ultrasound image sequences. The architecture contains a preliminary coarse initialization of the trajectories, followed by reinforcement iterations based on fine-grained appearance changes. It is efficient, light, and can run on mid-range GPUs. Experiments demonstrate that the model outperforms SOTA methods, with an average position accuracy of 67% and a median trajectory error of 2.86 pixels. Furthermore, we show a relative improvement of 25% when using our model to calculate the global longitudinal strain (GLS) in a clinical test-retest dataset compared to other methods. This implies that learning-based point tracking can potentially improve performance and yield a higher diagnostic and prognostic value for clinical measurements than current techniques. Our source code is available at: this https URL.
在超声心动图中,组织追踪是一个具有挑战性的任务,因为心脏运动复杂,超声采集的固有本质导致了跟踪过程中的噪声阻塞和漂移。尽管光学流方法被认为是最先进的(SOTA),但它们在长距离跟踪、噪声遮挡和整个心动周期内的漂移方面表现不佳。最近,基于学习的新颖跟踪技术已经引入,以解决这些问题。在本文中,我们沿着这些技术进行了探讨,并引入了EchoTracker,一种双精度模型,用于在超声图像序列上追踪被查询的点在组织表面的轨迹。该架构包含轨迹的初步粗初始化,然后基于细粒度外观变化进行强化迭代。它既高效又轻便,可以在中高档GPU上运行。实验证明,与SOTA方法相比,该模型具有优异的性能,平均位置精度为67%,轨迹误差为2.86像素。此外,我们还展示了使用我们的模型在临床测试再测数据集中计算全局纵向应变(GLS)时的相对改善,相比其他方法,GLS的计算性能提高了25%。这表明,基于学习的点追踪技术有可能提高性能,为临床测量提供更高的诊断和预后价值,目前的技术无法满足临床需求。我们的源代码可在此处下载:https:// this URL.
https://arxiv.org/abs/2405.08587