This paper introduces the Global-Local Image Perceptual Score (GLIPS), an image metric designed to assess the photorealistic image quality of AI-generated images with a high degree of alignment to human visual perception. Traditional metrics such as FID and KID scores do not align closely with human evaluations. The proposed metric incorporates advanced transformer-based attention mechanisms to assess local similarity and Maximum Mean Discrepancy (MMD) to evaluate global distributional similarity. To evaluate the performance of GLIPS, we conducted a human study on photorealistic image quality. Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores. Additionally, we introduce the Interpolative Binning Scale (IBS), a refined scaling method that enhances the interpretability of metric scores by aligning them more closely with human evaluative standards. The proposed metric and scaling approach not only provides more reliable assessments of AI-generated images but also suggest pathways for future enhancements in image generation technologies.
本文介绍了全局局部图像感知分数(GLIPS),一种用于评估高度与人类视觉感知一致的AI生成的图像的图像指标。传统指标如FID和KID得分与人类评价的关系不密切。所提出的指标采用先进的Transformer基注意力机制来评估局部相似度和最大均方差(MMD)以评估全局分布相似度。为了评估GLIPS的性能,我们在图像质量方面进行了一项人类研究。对各种生成模型进行的全局测试表明,GLIPS在与人类评分的一致性方面显著优于现有指标如FID、SSIM和MS-SSIM。此外,我们还引入了平滑分割尺度(IBS),一种通过更紧密地与人类评价标准对齐来提高指标分数解释性的平滑分割方法。所提出的指标和分割方法不仅为AI生成图像提供了更可靠的评估,还提出了未来图像生成技术改进的路径。
https://arxiv.org/abs/2405.09426
In this article, we focus on the critical tasks of plant protection in arable farms, addressing a modern challenge in agriculture: integrating ecological considerations into the operational strategy of precision weeding robots like \bbot. This article presents the recent advancements in weed management algorithms and the real-world performance of \bbot\ at the University of Bonn's Klein-Altendorf campus. We present a novel Rolling-view observation model for the BonnBot-Is weed monitoring section which leads to an average absolute weeding performance enhancement of $3.4\%$. Furthermore, for the first time, we show how precision weeding robots could consider bio-diversity-aware concerns in challenging weeding scenarios. We carried out comprehensive weeding experiments in sugar-beet fields, covering both weed-only and mixed crop-weed situations, and introduced a new dataset compatible with precision weeding. Our real-field experiments revealed that our weeding approach is capable of handling diverse weed distributions, with a minimal loss of only $11.66\%$ attributable to intervention planning and $14.7\%$ to vision system limitations highlighting required improvements of the vision system.
在本文中,我们重点讨论了农田保护任务中的关键任务,解决了农业领域的一个现代挑战:将生态考虑因素整合到像\bbot这样的精确喷雾机器人操作策略中。本文介绍了喷雾管理算法的最新进展以及\bbot\在 Bonn 大学 Klein-Altendorf 校园的实地表现。我们提出了 BonnBot-Is 杂草监测部分的滚动查看观察模型,使得平均绝对喷雾性能提高了 3.4%。此外,我们还展示了精确喷雾机器人如何考虑挑战性喷雾场景中的生物多样性关注。我们在糖菜田进行了全面的喷雾实验,涵盖了只有杂草和混合种植作物的情况,并引入了一个与精确喷雾兼容的新数据集。我们的实地实验表明,我们的喷雾方法能够处理不同的杂草分布,干预计划的损失只有 11.66%,而视觉系统限制引起的损失为 14.7%。
https://arxiv.org/abs/2405.09118
We introduce BEVRender, a novel learning-based approach for the localization of ground vehicles in Global Navigation Satellite System (GNSS)-denied off-road scenarios. These environments are typically challenging for conventional vision-based state estimation due to the lack of distinct visual landmarks and the instability of vehicle poses. To address this, BEVRender generates high-quality local bird's eye view (BEV) images of the local terrain. Subsequently, these images are aligned with a geo-referenced aerial map via template-matching to achieve accurate cross-view registration. Our approach overcomes the inherent limitations of visual inertial odometry systems and the substantial storage requirements of image-retrieval localization strategies, which are susceptible to drift and scalability issues, respectively. Extensive experimentation validates BEVRender's advancement over existing GNSS-denied visual localization methods, demonstrating notable enhancements in both localization accuracy and update frequency. The code for BEVRender will be made available soon.
我们提出了BEVRender,一种新的基于学习的在GNSS拒绝的离线场景中定位地面车辆的新方法。这些环境通常对传统视觉状态估计方法具有挑战性,因为缺乏明显的视觉地标和车辆姿态的不稳定性。为了应对这个问题, BEVRender生成高质量的局部鸟瞰(BEV)图像,并通过模板匹配与地理参考的无人机地图对它们进行对齐,以实现准确的跨视图配准。我们的方法克服了视觉惯性导航系统的固有局限性和图像检索定位策略需要大量存储空间的问题,这些问题容易受到漂移和可扩展性的影响。大量的实验证实,BEVRender在现有GNSS拒绝的视觉本地化方法中取得了显著的进步,证明了在定位精度和更新频率方面的显著改进。BEVRender的代码即将发布。
https://arxiv.org/abs/2405.09001
With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.
随着深度学习的 recent 进步,已经开发了众多算法来提高视频质量、减少视觉伪影和提高感知质量。然而,关于增强内容质量评估的研究仍然很少。增强内容的质量评估通常基于为压缩应用设计的质量度量指标。在本文中,我们提出了一个专门针对增强视频内容的新颖盲深度视频质量评估(VQA)方法。它采用了一种基于新数据库(包含 13K 训练补丁)的内容质量感知循环记忆转置(RMT)网络架构来获得视频质量表示,并通过一种基于新数据库(包含13K训练补丁)的内容质量感知对比学习策略来优化它。提取出的质量表示通过线性回归合并生成视频级的质量索引。所提出的方法,RMT-BVQA,通过五倍交叉验证对 VDPVE(用于感知视频增强的数据集)数据库进行了评估。结果表明,与十个现有无参考质量度量指标相比,其相关性能优越。
https://arxiv.org/abs/2405.08621
Multi-objective combinatorial optimization (MOCO) problems are prevalent in various real-world applications. Most existing neural methods for MOCO problems rely solely on decomposition and utilize precise hypervolume to enhance diversity. However, these methods often approximate only limited regions of the Pareto front and spend excessive time on diversity enhancement because of ambiguous decomposition and time-consuming hypervolume calculation. To address these limitations, we design a Geometry-Aware Pareto set Learning algorithm named GAPL, which provides a novel geometric perspective for neural MOCO via a Pareto attention model based on hypervolume expectation maximization. In addition, we propose a hypervolume residual update strategy to enable the Pareto attention model to capture both local and non-local information of the Pareto set/front. We also design a novel inference approach to further improve quality of the solution set and speed up hypervolume calculation and local subset selection. Experimental results on three classic MOCO problems demonstrate that our GAPL outperforms state-of-the-art neural baselines via superior decomposition and efficient diversity enhancement.
多目标组合优化(MOCO)问题在各种现实应用中普遍存在。大多数现有的神经方法仅基于分解,并利用精确的半径来增强多样性。然而,由于模糊的分解和耗时的半径计算,这些方法通常只近似Pareto前沿的有限区域,并且花费大量时间进行多样性增强。为了克服这些限制,我们设计了一种基于超体积期望最大化基于Pareto注意模型的Geometry-Aware Pareto集学习算法,为神经MOCO提供了新颖的几何视角。此外,我们还提出了一种半径残差更新策略,使Pareto注意模型能够捕捉到Pareto集/前的局部和非局部信息。我们还设计了一种新的推理方法,以进一步提高解决方案集的质量和加快半径计算和局部子集选择。在三个经典的MOCO问题上的实验结果表明,我们的GAPL通过卓越的分解和高效的多样性增强超越了最先进的神经 baseline。
https://arxiv.org/abs/2405.08604
Underwater imaging often suffers from low quality due to factors affecting light propagation and absorption in water. To improve image quality, some underwater image enhancement (UIE) methods based on convolutional neural networks (CNN) and Transformer have been proposed. However, CNN-based UIE methods are limited in modeling long-range dependencies, and Transformer-based methods involve a large number of parameters and complex self-attention mechanisms, posing efficiency challenges. Considering computational complexity and severe underwater image degradation, a state space model (SSM) with linear computational complexity for UIE, named WaterMamba, is proposed. We propose spatial-channel omnidirectional selective scan (SCOSS) blocks comprising spatial-channel coordinate omnidirectional selective scan (SCCOSS) modules and a multi-scale feedforward network (MSFFN). The SCOSS block models pixel and channel information flow, addressing dependencies. The MSFFN facilitates information flow adjustment and promotes synchronized operations within SCCOSS modules. Extensive experiments showcase WaterMamba's cutting-edge performance with reduced parameters and computational resources, outperforming state-of-the-art methods on various datasets, validating its effectiveness and generalizability. The code will be released on GitHub after acceptance.
由于影响水下成像光传播和吸收的因素,水下成像通常会导致低质量。为提高图像质量,已经提出了基于卷积神经网络(CNN)和Transformer的一些水下图像增强(UIE)方法。然而,基于CNN的UIE方法在建模长距离依赖方面有限,而基于Transformer的方法参数数量较大且具有复杂的自注意机制,导致效率挑战。在考虑计算复杂性和严重的水下图像退化的情况下,我们提出了一个具有线性计算复杂度的水下图像增强状态空间模型(WaterMamba)。我们提出了包括空间通道坐标全方向选择扫描(SCCOSS)模块和多尺度全向导网络(MSFFN)的时空通道全方向选择扫描(SCOSS)块。SCOSS块建模像素和通道信息流,解决依赖关系。MSFFN促进信息流调整和SCOSS模块内的同步操作。大量实验展示了WaterMamba在较低参数和计算资源下的尖端性能,其在各种数据集上的表现优于最先进的方法,验证了其有效性和通用性。代码将在接受审核后发布到GitHub上。
https://arxiv.org/abs/2405.08419
Pretrained language models (LMs) showcase significant capabilities in processing molecular text, while concurrently, message passing neural networks (MPNNs) demonstrate resilience and versatility in the domain of molecular science. Despite these advancements, we find there are limited studies investigating the bidirectional interactions between molecular structures and their corresponding textual representations. Therefore, in this paper, we propose two strategies to evaluate whether an information integration can enhance the performance: contrast learning, which involves utilizing an MPNN to supervise the training of the LM, and fusion, which exploits information from both models. Our empirical analysis reveals that the integration approaches exhibit superior performance compared to baselines when applied to smaller molecular graphs, while these integration approaches do not yield performance enhancements on large scale graphs.
预训练语言模型(LMs)在处理分子文本时表现出显著的能力,同时,消息传递神经网络(MPNN)在分子科学领域表现出弹性和多样性。尽管有这些进步,但我们发现研究分子结构与其相应文本表示之间双向互动的有限研究。因此,在本文中,我们提出了两种评估信息整合是否能够提高性能的方法:对比学习,涉及利用MPNN监督LM的训练,和融合,利用两个模型的信息。我们的实证分析表明,当应用于较小的分子图时,整合方法显示出比基线更优越的性能,而当应用于大图时,这些整合方法并没有提高性能。
https://arxiv.org/abs/2405.08334
As an important subtopic of image enhancement, color transfer aims to enhance the color scheme of a source image according to a reference one while preserving the semantic context. To implement color transfer, the palette-based color mapping framework was proposed. \textcolor{black}{It is a classical solution that does not depend on complex semantic analysis to generate a new color scheme. However, the framework usually requires manual settings, blackucing its practicality.} The quality of traditional palette generation depends on the degree of color separation. In this paper, we propose a new palette-based color transfer method that can automatically generate a new color scheme. With a redesigned palette-based clustering method, pixels can be classified into different segments according to color distribution with better applicability. {By combining deep learning-based image segmentation and a new color mapping strategy, color transfer can be implemented on foreground and background parts independently while maintaining semantic consistency.} The experimental results indicate that our method exhibits significant advantages over peer methods in terms of natural realism, color consistency, generality, and robustness.
作为图像增强的一个重要子主题,色彩转移的目的是根据参考图像增强源图像的颜色方案,同时保留语义上下文。为了实现色彩转移,基于调色板的颜色映射框架被提出。\textcolor{black}{这是一种经典的解决方案,不需要进行复杂的语义分析来生成新的颜色方案。然而,该框架通常需要手动设置,降低了其实用性。} 传统调色板生成的质量取决于色彩分离的程度。在本文中,我们提出了一种新的基于调色板的颜色转移方法,可以自动生成新的颜色方案。通过重新设计的基于调色板的分聚方法,可以根据色彩分布将像素分类为不同的片段,具有更好的应用效果。{通过将基于深度学习的图像分割和新的颜色映射策略相结合,可以在前景和背景部分独立地实现色彩转移,同时保持语义一致性。} 实验结果表明,我们的方法在自然真实感、色彩一致性、泛化和鲁棒性方面显著优于同类方法。
https://arxiv.org/abs/2405.08263
In large-scale disaster events, the planning of optimal rescue routes depends on the object detection ability at the disaster scene, with one of the main challenges being the presence of dense and occluded objects. Existing methods, which are typically based on the RGB modality, struggle to distinguish targets with similar colors and textures in crowded environments and are unable to identify obscured objects. To this end, we first construct two multimodal dense and occlusion vehicle detection datasets for large-scale events, utilizing RGB and height map modalities. Based on these datasets, we propose a multimodal collaboration network for dense and occluded vehicle detection, MuDet for short. MuDet hierarchically enhances the completeness of discriminable information within and across modalities and differentiates between simple and complex samples. MuDet includes three main modules: Unimodal Feature Hierarchical Enhancement (Uni-Enh), Multimodal Cross Learning (Mul-Lea), and Hard-easy Discriminative (He-Dis) Pattern. Uni-Enh and Mul-Lea enhance the features within each modality and facilitate the cross-integration of features from two heterogeneous modalities. He-Dis effectively separates densely occluded vehicle targets with significant intra-class differences and minimal inter-class differences by defining and thresholding confidence values, thereby suppressing the complex background. Experimental results on two re-labeled multimodal benchmark datasets, the 4K-SAI-LCS dataset, and the ISPRS Potsdam dataset, demonstrate the robustness and generalization of the MuDet. The codes of this work are available openly at \url{this https URL}.
在大规模灾害事件中,最优救援路线的规划取决于灾害现场的物体检测能力,其中主要挑战是存在密集和遮挡的对象。现有的方法,通常基于RGB模态,在拥挤的环境中很难区分具有相似颜色和纹理的目标,也无法识别遮挡的对象。为此,我们首先为大规模事件构建了两个多模态密集和遮挡车辆检测数据集,利用RGB和高度图模态。基于这些数据集,我们提出了一个多模态协作网络用于密集和遮挡车辆检测,MuDet短。MuDet通过分层增强模态之间的可区分信息完整性并区分简单和复杂样本来提高模型的可解释性。MuDet包括三个主要模块:单模态特征级联增强(Uni-Enh)、多模态跨学习(Mul-Lea)和硬-容易区分(He-Dis)模式。Uni-Enh和Mul-Lea通过在每种模块内增强特征和在两个异质模态之间促进特征的跨整合来提高模型的性能。He-Dis通过定义和阈值信心值有效地将密集遮挡的车辆目标与具有显著内部类差异的简单目标区分开来,从而抑制复杂背景。在两个重新标注的多模态基准数据集(4K-SAI-LCS数据集和ISPRS Potsdam数据集)上的实验结果证明了MuDet的稳健性和泛化能力。本工作的代码公开可见,在\url{这个链接}处。
https://arxiv.org/abs/2405.08251
Ancient murals are valuable cultural heritage with great archaeological value. They provide insights into ancient religions, ceremonies, folklore, among other things through their content. However, due to long-term oxidation and inadequate protection, ancient murals have suffered continuous damage, including peeling and mold etc. Additionally, since ancient murals were typically painted indoors, the light intensity in images captured by digital devices is often low. The poor visibility hampers the further restoration of damaged areas. To address the escalating damage to ancient frescoes and facilitate batch restoration at archaeological sites, we propose a two-stage restoration model which called MER(Mural Enhancement and Restoration net) for ancient murals that are damaged and have been captured in low light. Our two-stage model not only enhances the visual quality of restored images but also achieves commendable results in relevant metric evaluations compared with other competitors. Furthermore, we have launched a website dedicated to the restoration of ancient mural paintings, utilizing the proposed model. Code is available at this https URL.
古代壁画是一种具有重要考古价值的宝贵文化遗产。它们通过内容揭示了古代宗教、仪式、民间传说等方面,为我们提供了深入了解古代文化的途径。然而,由于长时间的氧化和不足的保护,古代壁画遭受了持续的损害,包括剥落和霉菌等问题。此外,由于古代壁画通常是在室内绘制的,数字设备捕捉到的图像光线强度往往较低。图像的低可见性阻碍了进一步修复受损区域的修复。为解决古代壁画受到的损害,并促进在考古遗址上的批量修复,我们提出了一个两阶段修复模型,称为MER(壁画增强和修复网络)模型,用于受损害且在低光线下捕捉的古代壁画。我们的两阶段模型不仅提高了修复图像的视觉质量,还与其他竞争对手相比取得了可圈可点的成果。此外,我们还创建了一个专门用于修复古代壁画绘画的网站,并使用所提出的模型进行修复。代码可在此链接处获取。
https://arxiv.org/abs/2405.08245
Understanding degraded speech is demanding, requiring increased listening effort (LE). Evaluating processed and unprocessed speech with respect to LE can objectively indicate if speech enhancement systems benefit listeners. However, existing methods for measuring LE are complex and not widely applicable. In this study, we propose a simple method to evaluate speech intelligibility and LE simultaneously without additional strain on subjects or operators. We assess this method using results from two independent studies in Norway and Denmark, testing 76 (50+26) subjects across 9 (6+3) processing conditions. Despite differences in evaluation setups, subject recruitment, and processing systems, trends are strikingly similar, demonstrating the proposed method's robustness and ease of implementation into existing practices.
理解受损语音是具有挑战性的,需要增加听力和理解努力(LE)。用LE评估经过处理和未处理的语音可以客观地表明,语音增强系统是否对听者有益。然而,现有的测量LE的方法是复杂且不广泛的。在这项研究中,我们提出了一种简单的方法来同时评估语音可懂度和LE,而不会对参与者或操作者造成额外的压力。我们用来自挪威和丹麦的两个独立研究的结果来评估这种方法,测试了9个(6个+3个)处理条件中的76个(50个+26个)参与者。尽管评估设置、参与者和处理系统存在差异,但趋势是相似的,这表明所提出的方法的稳健性和易用性,以及它对现有实践的适应性。
https://arxiv.org/abs/2405.07641
Assistive technologies for the visually impaired have evolved to facilitate interaction with a complex and dynamic world. In this paper, we introduce AIris, an AI-powered wearable device that provides environmental awareness and interaction capabilities to visually impaired users. AIris combines a sophisticated camera mounted on eyewear with a natural language processing interface, enabling users to receive real-time auditory descriptions of their surroundings. We have created a functional prototype system that operates effectively in real-world conditions. AIris demonstrates the ability to accurately identify objects and interpret scenes, providing users with a sense of spatial awareness previously unattainable with traditional assistive devices. The system is designed to be cost-effective and user-friendly, supporting general and specialized tasks: face recognition, scene description, text reading, object recognition, money counting, note-taking, and barcode scanning. AIris marks a transformative step, bringing AI enhancements to assistive technology, enabling rich interactions with a human-like feel.
辅助技术使盲人更容易与复杂和动态的世界进行交互。在本文中,我们介绍了AIris,一种由人工智能驱动的智能可穿戴设备,为盲人提供环境和交互功能。AIris将先进的摄像头安装在眼镜上,并配备了自然语言处理界面,使用户能够实时听到他们周围环境的描述。我们创建了一个在现实条件下有效运作的功能原型系统。AIris展示了准确识别物体和解释场景的能力,为用户提供了一种以前无法通过传统辅助设备获得的空间意识。该系统旨在实现成本效益和用户友好性,支持一般和专业任务:面部识别、场景描述、文字阅读、物体识别、货币计数、记事本扫描和条形码扫描。AIris标志着一个重大的转变,将人工智能提升到辅助技术中,使人们能够以类人般的方式与丰富的方式进行交互。
https://arxiv.org/abs/2405.07606
Care-giving and assistive robotics, driven by advancements in AI, offer promising solutions to meet the growing demand for care, particularly in the context of increasing numbers of individuals requiring assistance. This creates a pressing need for efficient and safe assistive devices, particularly in light of heightened demand due to war-related injuries. While cost has been a barrier to accessibility, technological progress is able to democratize these solutions. Safety remains a paramount concern, especially given the intricate interactions between assistive robots and humans. This study explores the application of reinforcement learning (RL) and imitation learning, in improving policy design for assistive robots. The proposed approach makes the risky policies safer without additional environmental interactions. Through experimentation using simulated environments, the enhancement of the conventional RL approaches in tasks related to assistive robotics is demonstrated.
照顾和辅助机器人技术,以人工智能的进步为基础,为满足不断增长的健康需求提供了有前景的解决方案,特别是在需要帮助的个体数量增加的情况下。这导致了对高效且安全的辅助设备的需求不断增加,尤其是在战争相关的伤害加剧的情况下。尽管成本是一个障碍,但技术进步能够实现这些解决方案的民主化。考虑到辅助机器人和人类之间的复杂交互,安全性始终是一个首要问题。本研究探讨了强化学习(RL)和模仿学习在改善辅助机器人政策设计中的应用。所提出的方法通过模拟环境实验证明了在辅助机器人任务方面传统RL方法的增强。
https://arxiv.org/abs/2405.07603
Personalization is crucial for the widespread adoption of advanced driver assistance system. To match up with each user's preference, the online evolution capability is a must. However, conventional evolution methods learn from naturalistic driving data, which requires a lot computing power and cannot be applied online. To address this challenge, this paper proposes a lesson learning approach: learning from driver's takeover interventions. By leveraging online takeover data, the driving zone is generated to ensure perceived safety using Gaussian discriminant analysis. Real-time corrections to trajectory planning rewards are enacted through apprenticeship learning. Guided by the objective of optimizing rewards within the constraints of the driving zone, this approach employs model predictive control for trajectory planning. This lesson learning framework is highlighted for its faster evolution capability, adeptness at experience accumulating, assurance of perceived safety, and computational efficiency. Simulation results demonstrate that the proposed system consistently achieves a successful customization without further takeover interventions. Accumulated experience yields a 24% enhancement in evolution efficiency. The average number of learning iterations is only 13.8. The average computation time is 0.08 seconds.
个性化对高级驾驶辅助系统的广泛采用至关重要。要满足每个用户的偏好,在线进化能力是必不可少的。然而,传统的进化方法从自然驾驶数据中学习,这需要大量的计算资源,并且无法在线应用。为解决这个问题,本文提出了一个经验学习方法:从驾驶员接管干预中学习。通过利用在线接管数据,生成驾驶区域以确保感知安全。通过自学驾驶过程中的轨迹规划奖励进行实时修正。在优化奖励在驾驶区域约束范围内的目标指导下,该方法采用模型预测控制进行轨迹规划。这种经验学习框架因其更快的进化能力、经验积累的熟练程度、感知安全的确保以及计算效率而受到突出。仿真结果表明,所提出的系统在不需要进一步接管干预的情况下,成功地实现了定制。累积经验使进化效率提高了24%。学习迭代平均只有13.8次。平均计算时间为0.08秒。
https://arxiv.org/abs/2405.07543
Robust 3D object detection remains a pivotal concern in the domain of autonomous field robotics. Despite notable enhancements in detection accuracy across standard datasets, real-world urban environments, characterized by their unstructured and dynamic nature, frequently precipitate an elevated incidence of false positives, thereby undermining the reliability of existing detection paradigms. In this context, our study introduces an advanced post-processing algorithm that modulates detection thresholds dynamically relative to the distance from the ego object. Traditional perception systems typically utilize a uniform threshold, which often leads to decreased efficacy in detecting distant objects. In contrast, our proposed methodology employs a Neural Network with a self-adaptive thresholding mechanism that significantly attenuates false negatives while concurrently diminishing false positives, particularly in complex urban settings. Empirical results substantiate that our algorithm not only augments the performance of 3D object detection models in diverse urban and adverse weather scenarios but also establishes a new benchmark for adaptive thresholding techniques in field robotics.
稳健的3D物体检测在自主领域机器人领域仍然是一个关键问题。尽管在标准数据集和现实世界城市环境中检测准确度的提升很大,但由于这些环境的特点是动态和无结构的,因此经常导致误检率的上升,从而削弱了现有检测范式的可靠性。在这种情况下,我们的研究引入了一种高级的后处理算法,该算法相对于自车对象距离自适应地调整检测阈值。传统的感知系统通常使用统一阈值,这往往导致在检测远处物体时效力下降。相反,我们提出的方法采用了一个具有自适应阈值机制的神经网络,该机制可以显著减弱误检率,同时降低误检率,特别是在复杂的城市环境中。实验结果证实,我们的算法不仅增强了各种城市和恶劣天气场景中3D物体检测模型的性能,而且为场机器人技术中的自适应阈值技术树立了新的基准。
https://arxiv.org/abs/2405.07479
Real-SR endeavors to produce high-resolution images with rich details while mitigating the impact of multiple degradation factors. Although existing methods have achieved impressive achievements in detail recovery, they still fall short when addressing regions with complex gradient arrangements due to the intensity-based linear weighting feature extraction manner. Moreover, the stochastic artifacts introduced by degradation cues during the imaging process in real LR increase the disorder of the overall image details, further complicating the perception of intrinsic gradient arrangement. To address these challenges, we innovatively introduce kernel-wise differential operations within the convolutional kernel and develop several learnable directional gradient convolutions. These convolutions are integrated in parallel with a novel linear weighting mechanism to form an Adaptive Directional Gradient Convolution (DGConv), which adaptively weights and fuses the basic directional gradients to improve the gradient arrangement perception capability for both regular and irregular textures. Coupled with DGConv, we further devise a novel equivalent parameter fusion method for DGConv that maintains its rich representational capabilities while keeping computational costs consistent with a single Vanilla Convolution (VConv), enabling DGConv to improve the performance of existing super-resolution networks without incurring additional computational expenses. To better leverage the superiority of DGConv, we further develop an Adaptive Information Interaction Block (AIIBlock) to adeptly balance the enhancement of texture and contrast while meticulously investigating the interdependencies, culminating in the creation of a DGPNet for Real-SR through simple stacking. Comparative results with 15 SOTA methods across three public datasets underscore the effectiveness and efficiency of our proposed approach.
为了生产具有丰富细节的高分辨率图像,同时减轻多个退化因素对图像的影响,Real-SR 致力于通过兼具高斯噪声和复杂梯度排列影响的方式,使现有的方法在细节恢复方面取得了令人印象深刻的成就。然而,由于强度基于线性加权特征提取方法,它们在处理具有复杂梯度排列的区域时仍然存在不足。此外,在成像过程中降解提示引入的随机噪声会加剧整体图像细节的混乱,进一步复杂对自适应梯度排列的感知。为了应对这些挑战,我们创新地引入了卷积内核中的平滑差分操作,并开发了几个可学习方向卷积。这些卷积与新的线性加权机制并行集成,形成了自适应方向卷积(DGConv),它可以自适应地权衡和融合基本方向卷积,从而提高对两种常规和奇异纹理的梯度排列感知能力。与 DGConv 结合,我们进一步设计了一种新的等效参数融合方法,使其保持丰富的表示能力,同时保持计算成本与单点卷积(VConv)一致,从而实现 DGConv 在不产生额外计算开销的情况下提高现有超分辨率网络的性能。为了更好地利用 DGConv 的优势,我们进一步开发了一个自适应信息交互层(AIIBlock),以精确平衡纹理和对比度,并深入研究其相互依赖关系,最终通过简单的堆叠创建了 DGPNet,用于实现 Real-SR。与三个公共数据集上的15个最先进的超分辨率方法进行比较结果,突显了我们提出方法的有效性和高效性。
https://arxiv.org/abs/2405.07023
This study introduces a novel data augmentation technique, ADLDA, aimed at mitigating the negative impact of data distribution shifts caused by the data augmentation process in computer vision task. ADLDA partitions augmented data into distinct subdomains and incorporates domain labels, combined with domain adaptation techniques, to optimize data representation in the model's feature space. Experimental results demonstrate that ADLDA significantly enhances model performance across multiple datasets, particularly in neural network architectures with complex feature extraction layers. Furthermore, ADLDA improves the model's ability to locate and recognize key features, showcasing potential in object recognition and image segmentation tasks. This paper's contribution provides an effective data augmentation regularization method for the field of computer vision aiding in the enhancement of robustness and accuracy in deep learning models.
本研究介绍了一种新的数据增强技术——ADLDA,旨在减轻数据增强过程中数据分布变化对计算机视觉任务造成的负面影响。ADLDA将增强的数据划分为不同的子域,并引入领域标签,结合领域自适应技术,优化模型在特征空间中的数据表示。实验结果表明,ADLDA在多个数据集上显著增强了模型的性能,特别是在具有复杂特征提取层的神经网络架构中。此外,ADLDA提高了模型在关键特征的定位和识别能力,展示了其在物体识别和图像分割任务中的潜力。本论文对计算机视觉领域提供了一种有效的数据增强规范方法,以增强深度学习模型的稳健性和准确性。
https://arxiv.org/abs/2405.06893
This work aims to study a scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. We explore the properties of Mamba by integrating it as the core model in both basic and advanced SE systems, along with utilizing signal-level distances as well as metric-oriented loss functions. SEMamba demonstrates promising results and attains a PESQ score of 3.55 on the VoiceBank-DEMAND dataset. When combined with the perceptual contrast stretching technique, the proposed SEMamba yields a new state-of-the-art PESQ score of 3.69.
这项工作旨在研究可扩展的状态空间模型(SSM),Mamba,在语音增强(SE)任务中的应用。我们利用基于Mamba的回归模型来描述语音信号并构建一个基于Mamba的SE系统,称为SEMamba。我们通过将Mamba作为基本和高级SE系统中的核心模型,并结合信号级距离以及面向指标的损失函数来探索Mamba的性质。SEMamba展示了积极的结果,在VoiceBank-DEMAND数据集上的PErsOQ得分为3.55。当与感知差分技术相结合时,所提出的SEMamba达到了3.69的PErsOQ得分,实现了新的最佳PErsOQ得分。
https://arxiv.org/abs/2405.06573
Continual Novel Class Discovery (CNCD) aims to continually discover novel classes without labels while maintaining the recognition capability for previously learned classes. The main challenges faced by CNCD include the feature-discrepancy problem, the inter-session confusion problem, etc. In this paper, we propose a novel Feature Enhancement and Adaptation method for the CNCD to tackle the above challenges, which consists of a guide-to-novel framework, a centroid-to-samples similarity constraint (CSS), and a boundary-aware prototype constraint (BAP). More specifically, the guide-to-novel framework is established to continually discover novel classes under the guidance of prior distribution. Afterward, the CSS is designed to constrain the relationship between centroid-to-samples similarities of different classes, thereby enhancing the distinctiveness of features among novel classes. Finally, the BAP is proposed to keep novel class features aware of the positions of other class prototypes during incremental sessions, and better adapt novel class features to the shared feature space. Experimental results on three benchmark datasets demonstrate the superiority of our method, especially in more challenging protocols with more incremental sessions.
Continual Novel Class Discovery(CNCD)旨在不断发现无标签的新类,同时保持之前学习到的类的识别能力。CNCD面临的主要挑战包括特征差异问题、会话混淆问题等。在本文中,我们提出了一种用于CNCD解决上述问题的新颖特征增强和适应方法,该方法包括指导-新颖框架、聚类中心到样本相似性约束(CSS)和边界感知原型约束(BAP)。具体来说,指导新颖框架是在先验分布的指导下不断发现新颖类。然后,CSS旨在约束不同类聚类中心到样本之间的相似性关系,从而增强新类之间特征的差异性。最后,BAP提出了一种方法,使新颖类特征在递增会话过程中保持对其他类原型位置的感知,并更好地将新颖类特征适应到共享特征空间。在三个基准数据集上的实验结果表明,我们的方法在更具挑战性的协议中具有优越性,尤其是在具有更多递增会话的更难的协议中。
https://arxiv.org/abs/2405.06389
Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed as the conditional generation model, where the autoencoder compresses images into latent representations for diffusion and then reconstructs images from these latent representations. Such a formulation poses a crucial challenge: VFI expects that the output is deterministically equal to the ground truth intermediate frame, but LDMs randomly generate a diverse set of different images when the model runs multiple times. The reason for the diverse generation is that the cumulative variance (variance accumulated at each step of generation) of generated latent representations in LDMs is large. This makes the sampling trajectory random, resulting in diverse rather than deterministic generations. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement.
近年来在视频帧插值(VFI)领域的研究试图将VFI建模为扩散为基础的条件下图像生成问题,通过对随机噪声和相邻帧的合成中间帧。由于视频具有较高的分辨率,我们采用潜在扩散模型(LDMs)作为条件生成模型,其中自动编码器对扩散的图像进行压缩,然后从这些压缩后的潜在表示中重构图像。这种建模提出了一个关键挑战:VFI期望输出与 ground truth 中间帧相等,但LDMs在运行多次模型时会随机生成一组不同的图像。导致多样生成的原因是,在LDMs中,生成潜在表示的累积方差(每个生成步骤中累积的方差)很大。这使得抽样轨迹随机,从而实现多样而非确定性的生成。为了应对这个问题,我们提出了我们独特的解决方案:连续布朗桥扩散。具体来说,我们提出了连续布朗桥扩散,它以确定性的初始值作为输入,从而生成方差较小的中间帧。我们的实验结果表明,我们的方法可以随着自动
https://arxiv.org/abs/2405.05953