With the benefit of deep learning techniques, recent researches have made significant progress in image compression artifacts reduction. Despite their improved performances, prevailing methods only focus on learning a mapping from the compressed image to the original one but ignore the intrinsic attributes of the given compressed images, which greatly harms the performance of downstream parsing tasks. Different from these methods, we propose to decouple the intrinsic attributes into two complementary features for artifacts reduction,ie, the compression-insensitive features to regularize the high-level semantic representations during training and the compression-sensitive features to be aware of the compression degree. To achieve this, we first employ adversarial training to regularize the compressed and original encoded features for retaining high-level semantics, and we then develop the compression quality-aware feature encoder for compression-sensitive features. Based on these dual complementary features, we propose a Dual Awareness Guidance Network (DAGN) to utilize these awareness features as transformation guidance during the decoding phase. In our proposed DAGN, we develop a cross-feature fusion module to maintain the consistency of compression-insensitive features by fusing compression-insensitive features into the artifacts reduction baseline. Our method achieves an average 2.06 dB PSNR gains on BSD500, outperforming state-of-the-art methods, and only requires 29.7 ms to process one image on BSD500. Besides, the experimental results on LIVE1 and LIU4K also demonstrate the efficiency, effectiveness, and superiority of the proposed method in terms of quantitative metrics, visual quality, and downstream machine vision tasks.
得益于深度学习技术的优势,近年来图像压缩伪影减少的研究取得了显著进展。尽管其性能有所提高,但现有的方法仅关注从压缩图像到原始图像的映射学习,而忽略了给定压缩图像的固有属性,这大大削弱了下游解码任务的性能。与这些方法不同,我们提出了一种将固有属性解耦为两个互补特征的方法,以便在图像压缩伪影减少中实现压缩敏感特征的感知。为了实现这一目标,我们首先使用对抗训练来对压缩和原始编码特征进行规范,保留高级语义表示,然后我们为压缩敏感特征开发了压缩质量感知特征编码器。基于这些互补特征,我们提出了一个双感知指导网络(DAGN)来在解码阶段利用这些感知特征作为变换指导。在我们的DAGN中,我们开发了一个跨特征融合模块,通过将压缩敏感特征与 artifacts reduction 基线融合来保持压缩感知特征的一致性。我们的方法在BSD500上的平均PSNR增益达到2.06 dB,超越了最先进的方法,并且仅在BSD500上处理一张图片就需要29.7毫秒。此外,LIVE1和LIU4K的实验结果也证明了我们在数量指标、视觉质量和下游机器视觉任务方面的方法的有效性和优越性。
https://arxiv.org/abs/2405.09291
As image recognition models become more prevalent, scalable coding methods for machines and humans gain more importance. Applications of image recognition models include traffic monitoring and farm management. In these use cases, the scalable coding method proves effective because the tasks require occasional image checking by humans. Existing image compression methods for humans and machines meet these requirements to some extent. However, these compression methods are effective solely for specific image recognition models. We propose a learning-based scalable image coding method for humans and machines that is compatible with numerous image recognition models. We combine an image compression model for machines with a compression model, providing additional information to facilitate image decoding for humans. The features in these compression models are fused using a feature fusion network to achieve efficient image compression. Our method's additional information compression model is adjusted to reduce the number of parameters by enabling combinations of features of different sizes in the feature fusion network. Our approach confirms that the feature fusion network efficiently combines image compression models while reducing the number of parameters. Furthermore, we demonstrate the effectiveness of the proposed scalable coding method by evaluating the image compression performance in terms of decoded image quality and bitrate.
随着图像识别模型越来越普遍,可扩展的机器和人类代码方法变得更加重要。图像识别模型的应用包括交通监测和农场管理。在这些应用中,可扩展的编码方法证明有效,因为这些任务需要人类偶尔检查图像。现有的人和机器图像压缩方法在一定程度上满足了这些要求。然而,这些压缩方法仅对特定的图像识别模型有效。我们提出了一种基于学习的可扩展图像编码方法,既适用于人类,也适用于机器,与众多图像识别模型兼容。我们将机器图像压缩模型与压缩模型相结合,为人类提供额外的信息以促进图像解码。这些压缩模型的特征使用特征融合网络进行融合,实现高效的图像压缩。我们调整了这种方法的信息压缩模型,通过允许在特征融合网络中组合不同大小的特征,从而减少参数数量。通过评估图像压缩性能,即解码图像的质量和对位率,我们证明了所提出的可扩展编码方法的有效性。
https://arxiv.org/abs/2405.09152
Content-adaptive compression is crucial for enhancing the adaptability of the pre-trained neural codec for various contents. Although these methods have been very practical in neural image compression (NIC), their application in neural video compression (NVC) is still limited due to two main aspects: 1), video compression relies heavily on temporal redundancy, therefore updating just one or a few frames can lead to significant errors accumulating over time; 2), NVC frameworks are generally more complex, with many large components that are not easy to update quickly during encoding. To address the previously mentioned challenges, we have developed a content-adaptive NVC technique called Group-aware Parameter-Efficient Updating (GPU). Initially, to minimize error accumulation, we adopt a group-aware approach for updating encoder parameters. This involves adopting a patch-based Group of Pictures (GoP) training strategy to segment a video into patch-based GoPs, which will be updated to facilitate a globally optimized domain-transferable solution. Subsequently, we introduce a parameter-efficient delta-tuning strategy, which is achieved by integrating several light-weight adapters into each coding component of the encoding process by both serial and parallel configuration. Such architecture-agnostic modules stimulate the components with large parameters, thereby reducing both the update cost and the encoding time. We incorporate our GPU into the latest NVC framework and conduct comprehensive experiments, whose results showcase outstanding video compression efficiency across four video benchmarks and adaptability of one medical image benchmark.
内容自适应压缩对于增强预训练神经编码器在各种内容下的适应性至关重要。尽管这些方法在神经图像压缩(NIC)方面已经非常实用,但它们在神经视频压缩(NVC)方面的应用仍然受到两个主要方面的限制:1)视频压缩严重依赖时间冗余,因此仅更新一个或几个帧可能导致随着时间的累积显著误差;2)NVC框架通常更复杂,有许多大组件,在编码过程中很难快速更新。为解决之前提到的挑战,我们开发了一种内容自适应NVC技术,称为Group-aware Parameter-Efficient Updating(GPU)。最初,为了最小化误差累积,我们采用了一种基于组体的更新策略来更新编码器参数。这包括采用基于补丁的图像组(GoP)训练策略,将视频分割为基于补丁的GoP,这将有利于实现全局优化的领域可转移解决方案。随后,我们引入了一种参数高效的微分调整策略,这是通过将多个轻量级适配器集成到编码过程的每个编码组件上,同时进行串行和并行配置来实现的。这种架构无关模块会刺激具有大参数的组件,从而降低更新成本和编码时间。我们将GPU集成到最新的NVC框架中,并进行全面的实验,其结果展示了在四个视频基准测试和医疗图像基准测试中的卓越视频压缩效率。
https://arxiv.org/abs/2405.04274
In lossy image compression, the objective is to achieve minimal signal distortion while compressing images to a specified bit rate. The increasing demand for visual analysis applications, particularly in classification tasks, has emphasized the significance of considering semantic distortion in compressed images. To bridge the gap between image compression and visual analysis, we propose a Rate-Distortion-Classification (RDC) model for lossy image compression, offering a unified framework to optimize the trade-off between rate, distortion, and classification accuracy. The RDC model is extensively analyzed both statistically on a multi-distribution source and experimentally on the widely used MNIST dataset. The findings reveal that the RDC model exhibits desirable properties, including monotonic non-increasing and convex functions, under certain conditions. This work provides insights into the development of human-machine friendly compression methods and Video Coding for Machine (VCM) approaches, paving the way for end-to-end image compression techniques in real-world applications.
在损失图像压缩中,目标是实现压缩图像到指定比特率的同时最小化信号畸变。随着视觉分析应用的需求,特别是在分类任务中,考虑语义畸变在压缩图像中具有重要性。为了在图像压缩和视觉分析之间弥合差距,我们提出了一个名为 Rate-Distortion-Classification (RDC) 的模型,为损失图像压缩提供了一个统一的框架,以优化在速率、畸变和分类精度之间的平衡。RDC模型在多分布源上进行了广泛分析,同时在广泛使用的MNIST数据集上进行了实验验证。研究结果表明,在某些条件下,RDC模型表现出良好的特性,包括单调非递增和凸函数。这项工作揭示了人机友好压缩方法和视频编码机器(VCM)途径的发展,为现实应用中的端到端图像压缩技术铺平了道路。
https://arxiv.org/abs/2405.03500
Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we propose a novel transformer-based entropy model called GroupedMixer, which enjoys both faster coding speed and better compression performance than previous transformer-based methods. Specifically, our approach builds upon group-wise autoregression by first partitioning the latent variables into groups along spatial-channel dimensions, and then entropy coding the groups with the proposed transformer-based entropy model. The global causal self-attention is decomposed into more efficient group-wise interactions, implemented using inner-group and cross-group token-mixers. The inner-group token-mixer incorporates contextual elements within a group while the cross-group token-mixer interacts with previously decoded groups. Alternate arrangement of two token-mixers enables global contextual reference. To further expedite the network inference, we introduce context cache optimization to GroupedMixer, which caches attention activation values in cross-group token-mixers and avoids complex and duplicated computation. Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed.
近年来,基于Transformer的熵模型因其在概率分布估计中捕获长距离依赖的优越性能而备受关注。然而,之前的基于Transformer的熵模型由于在推理过程中进行像素级自回归或重复计算而变得拖沓。在本文中,我们提出了一种名为GroupedMixer的新颖基于Transformer的熵模型,该模型比之前的基于Transformer的方法具有更快的编码速度和更好的压缩性能。具体来说,我们的方法首先沿着空间维度将潜在变量分组,然后利用提出的基于Transformer的熵模型对分组进行熵编码。全局因果自注意力的分解采用更有效的组内和跨组token-mixer实现。组内token-mixer包含组内的上下文元素,而跨组token-mixer与之前解码的组进行交互。通过交替排列两个token-mixer,可以实现全局上下文参考。为了进一步加速网络推理,我们引入了GroupedMixer的上下文缓存优化,将跨组token-mixer中的注意力激活值缓存起来,并避免复杂和重复计算。实验结果表明,与具有快速压缩速度的顶级性能相比,GroupedMixer具有最先进的速率失真性能。
https://arxiv.org/abs/2405.01170
Compressing images at extremely low bitrates (below 0.1 bits per pixel (bpp)) is a significant challenge due to substantial information loss. Existing extreme image compression methods generally suffer from heavy compression artifacts or low-fidelity reconstructions. To address this problem, we propose a novel extreme image compression framework that combines compressive VAEs and pre-trained text-to-image diffusion models in an end-to-end manner. Specifically, we introduce a latent feature-guided compression module based on compressive VAEs. This module compresses images and initially decodes the compressed information into content variables. To enhance the alignment between content variables and the diffusion space, we introduce external guidance to modulate intermediate feature maps. Subsequently, we develop a conditional diffusion decoding module that leverages pre-trained diffusion models to further decode these content variables. To preserve the generative capability of pre-trained diffusion models, we keep their parameters fixed and use a control module to inject content information. We also design a space alignment loss to provide sufficient constraints for the latent feature-guided compression module. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in terms of both visual performance and image fidelity at extremely low bitrates.
在低位比特率(低于0.1位每像素,bpp)压缩图像是一个显著的挑战,因为会导致大量信息的丢失。现有的极端图像压缩方法通常存在严重的压缩伪影或低保真的重构。为了解决这个问题,我们提出了一个新颖的极端图像压缩框架,将压缩VAE与预训练文本到图像扩散模型在端到端的方式相结合。具体来说,我们引入了一个基于压缩VAE的潜在特征指导压缩模块。这个模块通过压缩图像并最初将压缩信息解码为内容变量来压缩图像。为了增强内容变量与扩散空间之间的对齐,我们引入了外部的指导来调节中间特征图。随后,我们开发了一个条件扩散解码模块,它利用预训练的扩散模型进一步解码这些内容变量。为了保留预训练扩散模型的生成能力,我们保持其参数不变,并使用一个控制模块来注入内容信息。我们还设计了一个空间对齐损失,为潜在特征指导压缩模块提供足够的约束。大量实验证明,我们的方法在极端低位比特率下既具有卓越的视觉表现又具有卓越的图像保真度。
https://arxiv.org/abs/2404.18820
In Learned Image Compression (LIC), a model is trained at encoding and decoding images sampled from a source domain, often outperforming traditional codecs on natural images; yet its performance may be far from optimal on images sampled from different domains. In this work, we tackle the problem of adapting a pre-trained model to multiple target domains by plugging into the decoder an adapter module for each of them, including the source one. Each adapter improves the decoder performance on a specific domain, without the model forgetting about the images seen at training time. A gate network computes the weights to optimally blend the contributions from the adapters when the bitstream is decoded. We experimentally validate our method over two state-of-the-art pre-trained models, observing improved rate-distortion efficiency on the target domains without penalties on the source domain. Furthermore, the gate's ability to find similarities with the learned target domains enables better encoding efficiency also for images outside them.
在学习图像压缩(LIC)中,模型在从源域中训练和编码和解码图像时进行训练,通常在自然图像上优于传统的编码器;然而,在从不同域中抽样时,其性能可能离最优水平还有很大差距。在本文中,我们通过在解码器中插入适配器模块来将预训练的模型适应多个目标域。每个适配器都在特定的域上提高了解码器的性能,而不会让模型忘记在训练时看到的图像。门网络计算在解码时最优地融合适配器的贡献。我们在两个最先进的预训练模型上进行实验验证,发现在没有对源域罚款的情况下,目标域的压缩率-失真效率得到了提高。此外,门的发现与学习到的目标域相似的能力还使得对于这些域之外的照片,编码器的效率也得到了提高。
https://arxiv.org/abs/2404.15591
This paper investigates the challenging problem of learned image compression (LIC) with extreme low bitrates. Previous LIC methods based on transmitting quantized continuous features often yield blurry and noisy reconstruction due to the severe quantization loss. While previous LIC methods based on learned codebooks that discretize visual space usually give poor-fidelity reconstruction due to the insufficient representation power of limited codewords in capturing faithful details. We propose a novel dual-stream framework, HyrbidFlow, which combines the continuous-feature-based and codebook-based streams to achieve both high perceptual quality and high fidelity under extreme low bitrates. The codebook-based stream benefits from the high-quality learned codebook priors to provide high quality and clarity in reconstructed images. The continuous feature stream targets at maintaining fidelity details. To achieve the ultra low bitrate, a masked token-based transformer is further proposed, where we only transmit a masked portion of codeword indices and recover the missing indices through token generation guided by information from the continuous feature stream. We also develop a bridging correction network to merge the two streams in pixel decoding for final image reconstruction, where the continuous stream features rectify biases of the codebook-based pixel decoder to impose reconstructed fidelity details. Experimental results demonstrate superior performance across several datasets under extremely low bitrates, compared with existing single-stream codebook-based or continuous-feature-based LIC methods.
本文研究了在低比特率下进行学习图像压缩(LIC)的具有挑战性的问题。以前基于传输量化连续特征的LIC方法通常由于量化损失严重而导致模糊和噪声的重建。而以前基于学习码本的LIC方法在捕捉准确细节方面具有不足的表示能力,因此通常会导致低质量的重建。我们提出了一个新型的双流框架HyrbidFlow,它结合了基于连续特征和基于码本的流,以实现低比特率下的高感知质量和高保真度。基于码本的流利用高质量的学习码本先验来提供高质量和清晰度在重构图像中。连续特征流的目标是保持保真度细节。为了实现超低比特率,我们进一步提出了一个掩码标记的Transformer,其中我们仅传输码字索引的掩码部分,并通过连续特征流的标记来恢复缺失的索引。我们还开发了一个平滑修复网络,用于在像素解码中合并这两个流,以便进行最终图像重构。基于连续流特征的码字解码器的偏置被平滑修复网络中的连续流纠正。实验结果表明,在极低比特率下,与现有的单流码本或连续特征 based LIC 方法相比,具有卓越的性能。
https://arxiv.org/abs/2404.13372
To reduce network traffic and support environments with limited resources, a method for transmitting images with low amounts of transmission data is required. Machine learning-based image compression methods, which compress the data size of images while maintaining their features, have been proposed. However, in certain situations, reconstructing a part of semantic information of images at the receiver end may be sufficient. To realize this concept, semantic-information-based communication, called semantic communication, has been proposed, along with an image transmission method using semantic communication. This method transmits only the semantic information of an image, and the receiver reconstructs the image using an image-generation model. This method utilizes one type of semantic information, but reconstructing images similar to the original image using only it is challenging. This study proposes a multi-modal image transmission method that leverages diverse semantic information for efficient semantic communication. The proposed method extracts multi-modal semantic information from an image and transmits only it. Subsequently, the receiver generates multiple images using an image-generation model and selects an output based on semantic similarity. The receiver must select the output based only on the received features; however, evaluating semantic similarity using conventional metrics is challenging. Therefore, this study explored new metrics to evaluate the similarity between semantic features of images and proposes two scoring procedures. The results indicate that the proposed procedures can compare semantic similarities, such as position and composition, between semantic features of the original and generated images. Thus, the proposed method can facilitate the transmission and utilization of photographs through mobile networks for various service applications.
为了减少网络流量并支持资源有限的生态环境,需要一种能够传输图像且传输数据量较少的图像传输方法。基于机器学习的图像压缩方法,在压缩图像数据大小的同时保持其特征,已经被提出。然而,在某些情况下,在接收端重构图像的部分语义信息可能已经足够。为了实现这一概念,提出了基于语义信息的有向通信(称为语义通信)以及使用语义通信的图像传输方法。这种方法仅传输图像的语义信息,接收端使用图像生成模型重构图像。这种方法利用了一种类型的语义信息,但仅基于它重构类似于原始图像的图像具有挑战性。本研究提出了一个多模态图像传输方法,利用多样语义信息进行高效的语义通信。所提出的方法从图像中提取多模态语义信息并仅传输它。随后,接收端使用图像生成模型生成多个图像,并根据语义相似度选择一个输出。接收端只能基于接收到的特征选择输出;然而,使用传统指标评估语义相似度具有挑战性。因此,本研究探索了新的指标来评估图像语义特征之间的相似性,并提出了两个评分程序。结果显示,所提出的程序可以比较原始和生成图像的语义特征之间的相似性,如位置和构图。因此,所提出的方法可以为移动网络提供照片传输和各种服务应用程序的便利。
https://arxiv.org/abs/2404.11280
The burgeoning volume of digital content across diverse modalities necessitates efficient storage and retrieval methods. Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. In this paper, we proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. First we analyze the intricate relationship between compressibility and searchability, recognizing the pivotal role each plays in the efficiency of storage and retrieval systems. Through the usage of simple adapter is to bridge the feature of Learned Image Compression(LIC) and Contrastive Language-Image Pretraining(CLIP) while retaining semantic fidelity and retrieval of multi-modal data. Experimental evaluations on Kodak datasets demonstrate the efficacy of our approach, showcasing significant enhancements in compression efficiency and search accuracy compared to existing methodologies. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
数字内容的爆发式增长对高效存储和检索方法提出了需求。传统的解决方案很难应对多媒体数据的日益复杂和规模。在本文中,我们提出的框架通过将人工智能原生多模态搜索功能与神经图像压缩相结合来应对这一挑战。首先,我们分析了压缩性和搜索性之间的复杂关系,认识到每个在存储和检索系统的效率中都扮演着关键角色。通过使用简单的适配器来桥接Learned Image Compression(LIC)和Contrastive Language-Image Pre-training(CLIP)的特征,同时保留语义保真度和多模态数据的检索,我们提出了一种方法。在柯达数据集的实验评估中,我们展示了我们方法的有效性,显示了与现有方法相比,压缩效率和搜索精度都有显著提高。我们的工作在大型数据时代的可扩展和高效多模态搜索系统方面迈出了重要的一步。
https://arxiv.org/abs/2404.10234
Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive decoders robust to quantization errors in the conditioning signals, yet achieving competitive results in this manner requires costly training of the diffusion model and long inference times due to the iterative generative process. In this work we formulate the removal of quantization error as a denoising task, using diffusion to recover lost information in the transmitted image latent. Our approach allows us to perform less than 10\% of the full diffusion generative process and requires no architectural changes to the diffusion model, enabling the use of foundation models as a strong prior without additional fine tuning of the backbone. Our proposed codec outperforms previous methods in quantitative realism metrics, and we verify that our reconstructions are qualitatively preferred by end users, even when other methods use twice the bitrate.
将扩散模型融入图像压缩领域,有望产生真实和详细的重构,尤其是在极低比特率的情况下。以前的方法侧重于使用扩散模型作为具有条件信号量化误差稳健的表达编码器,然而以这种方式实现竞争力的结果需要对扩散模型进行昂贵的训练,并由于递归生成过程,导致推理时间较长。在这项工作中,我们将量化误差消除视为去噪任务,利用扩散来恢复在传输图像潜在中丢失的信息。我们的方法允许我们执行不到10%的完整扩散生成过程,并且不需要对扩散模型进行架构更改,使得基础模型可以作为强大的先验,无需额外对骨干模型进行微调。我们提出的编码在量化现实指标上优于以前的方法,而且我们验证,即使其他方法使用两倍的比特率,我们的重构仍然具有用户满意的质量。
https://arxiv.org/abs/2404.08580
Artificial intelligence (AI) and autonomous edge computing in space are emerging areas of interest to augment capabilities of nanosatellites, where modern sensors generate orders of magnitude more data than can typically be transmitted to mission control. Here, we present the hardware and software design of an onboard AI subsystem hosted on SpIRIT. The system is optimised for on-board computer vision experiments based on visible light and long wave infrared cameras. This paper highlights the key design choices made to maximise the robustness of the system in harsh space conditions, and their motivation relative to key mission requirements, such as limited compute resources, resilience to cosmic radiation, extreme temperature variations, distribution shifts, and very low transmission bandwidths. The payload, called Loris, consists of six visible light cameras, three infrared cameras, a camera control board and a Graphics Processing Unit (GPU) system-on-module. Loris enables the execution of AI models with on-orbit fine-tuning as well as a next-generation image compression algorithm, including progressive coding. This innovative approach not only enhances the data processing capabilities of nanosatellites but also lays the groundwork for broader applications to remote sensing from space.
人工智能(AI)和自主边缘计算在空间是一个正在兴起的兴趣领域,可以增强纳米卫星的性能,其中现代传感器产生的数据比通常发送到地面站的数据要大得多。在这里,我们介绍了在SpIRIT上托管的载有人工智能子系统的硬件和软件设计。该系统针对可见光和长波红外相机进行优化,以进行在轨计算机视觉实验。本文重点介绍了系统在恶劣空间条件下的关键设计选择以及这些选择与关键任务需求(如有限计算资源、宇宙辐射耐受性、极端温度变化、分布偏移和非常低传输带宽)之间的联系。载荷称为Loris,包括六个可见光相机、三个红外相机、一个相机控制板和GPU系统级模块。Loris不仅提高了纳米卫星的数据处理能力,还为从空间遥感的更广泛应用奠定了基础。
https://arxiv.org/abs/2404.08399
Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image classification systems should adapt to and manage data that continuously evolves. This is where continual learning plays an important role. A challenge in continual learning is catastrophic forgetting, where ML models tend to discard old knowledge upon learning new information. While memory-replay algorithms have shown promise in mitigating this problem by storing old data as exemplars, they are hampered by the limited capacity of memory buffers, leading to an imbalance between new and previously learned data. To address this, our work explores the use of neural image compression to extend buffer size and enhance data diversity. We introduced the concept of continuously learning a neural compression model to adaptively improve the quality of compressed data and optimize the bitrates per pixel (bpp) to store more exemplars. Our extensive experiments, including evaluations on food-specific datasets including Food-101 and VFN-74, as well as the general dataset ImageNet-100, demonstrate improvements in classification accuracy. This progress is pivotal in advancing more realistic food recognition systems that are capable of adapting to continually evolving data. Moreover, the principles and methodologies we've developed hold promise for broader applications, extending their benefits to other domains of continual machine learning systems.
食品图像分类系统在通过图像为基础的饮食评估技术对健康状况进行监测和饮食跟踪中发挥着关键作用。然而,现有的食品识别系统依赖于静态数据集,其特征是预定义的固定数量食品类别。这与现实中的食品消费情况存在很大差异,因为食品消费数据具有不断变化的特点。因此,食品图像分类系统应该适应并管理持续变化的数据。这正是持续学习发挥作用的地方。 持续学习的挑战之一是灾难性遗忘,即机器学习模型在学习新信息时倾向于丢弃旧知识。虽然记忆回放算法通过将旧数据存储为示例来减轻这个问题,但由于内存缓冲区的有限容量,导致新学习和旧学习数据之间的不平衡。为了解决这个问题,我们的工作探讨了使用神经图像压缩来扩展缓冲区大小和增强数据多样性的方法。我们引入了连续学习神经压缩模型的概念,以适应性地提高压缩数据的质量并优化每像素(bpp)以存储更多示例。 我们在包括Food-101和VFN-74等食品特定数据集以及ImageNet-100等通用数据集的广泛实验中进行了评估,证明了分类准确度的提高。这一进步对于推动更现实、能够适应不断变化数据的食品识别系统至关重要。此外,我们开发的原则和方法论对于更广泛的应用场景也具有潜在意义,将这些 benefits 扩展到其他连续机器学习系统中。
https://arxiv.org/abs/2404.07507
This study addresses the challenge of, without training or fine-tuning, controlling the global color aspect of images generated with a diffusion model. We rewrite the guidance equations to ensure that the outputs are closer to a known color map, and this without hindering the quality of the generation. Our method leads to new guidance equations. We show in the color guidance context that, the scaling of the guidance should not decrease but remains high throughout the diffusion process. In a second contribution, our guidance is applied in a compression framework, we combine both semantic and general color information on the image to decode the images at low cost. We show that our method is effective at improving fidelity and realism of compressed images at extremely low bit rates, when compared to other classical or more semantic oriented approaches.
本研究旨在解决在没有训练或微调的情况下,控制使用扩散模型生成的图像的全球色彩映射的挑战。我们将指导方程重新写为确保输出更接近已知色彩映射,同时不损害生成质量。我们的方法产生了新的指导方程。在色彩指导背景下,我们证明了,指导的缩放不应降低,而应保持高。 在第二贡献中,我们的指导应用于压缩框架中,将图像的语义和一般色彩信息结合在一起,以低代价解码图像。我们证明了,与其它经典方法或更语义导向方法相比,我们的方法在极低比特率下可以有效提高压缩图像的保真度和现实感。
https://arxiv.org/abs/2404.06865
Image harmonization, which involves adjusting the foreground of a composite image to attain a unified visual consistency with the background, can be conceptualized as an image-to-image translation task. Diffusion models have recently promoted the rapid development of image-to-image translation tasks . However, training diffusion models from scratch is computationally intensive. Fine-tuning pre-trained latent diffusion models entails dealing with the reconstruction error induced by the image compression autoencoder, making it unsuitable for image generation tasks that involve pixel-level evaluation metrics. To deal with these issues, in this paper, we first adapt a pre-trained latent diffusion model to the image harmonization task to generate the harmonious but potentially blurry initial images. Then we implement two strategies: utilizing higher-resolution images during inference and incorporating an additional refinement stage, to further enhance the clarity of the initially harmonized images. Extensive experiments on iHarmony4 datasets demonstrate the superiority of our proposed method. The code and model will be made publicly available at this https URL .
图像和谐,涉及调整合成图像的前景以实现与背景的统一视觉一致性,可以概念化为一个图像到图像的映射任务。最近,扩散模型推动了许多图像到图像映射任务的快速发展。然而,从零开始训练扩散模型计算量很大。对预训练的 latent 扩散模型的微调涉及处理图像压缩自动加密器引起的图像重建误差,这使得它不适合用于需要像素级评估指标的图像生成任务。为解决这些问题,本文首先将预训练的 latent 扩散模型适应到图像和谐任务中,生成和谐但可能有点模糊的初始图像。然后我们实现了两种策略:在推理过程中使用高分辨率图像,并包含一个额外的细化阶段,以进一步增强最初和谐图像的清晰度。在 iHarmony4 数据集上进行的大量实验证明了我们提出的方法的优越性。代码和模型将公开发布在本文的链接 URL 上。
https://arxiv.org/abs/2404.06139
The images produced by diffusion models can attain excellent perceptual quality. However, it is challenging for diffusion models to guarantee distortion, hence the integration of diffusion models and image compression models still needs more comprehensive explorations. This paper presents a diffusion-based image compression method that employs a privileged end-to-end decoder model as correction, which achieves better perceptual quality while guaranteeing the distortion to an extent. We build a diffusion model and design a novel paradigm that combines the diffusion model and an end-to-end decoder, and the latter is responsible for transmitting the privileged information extracted at the encoder side. Specifically, we theoretically analyze the reconstruction process of the diffusion models at the encoder side with the original images being visible. Based on the analysis, we introduce an end-to-end convolutional decoder to provide a better approximation of the score function $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ at the encoder side and effectively transmit the combination. Experiments demonstrate the superiority of our method in both distortion and perception compared with previous perceptual compression methods.
扩散模型的输出图像可以达到出色的感知质量。然而,扩散模型很难保证失真,因此将扩散模型与图像压缩模型集成还需要更全面的探索。本文提出了一种基于扩散的图像压缩方法,采用有偏的端到端解码器模型作为校正,在保证失真程度的同时实现更好的感知质量。我们构建了一个扩散模型,并设计了一个新范式,将扩散模型和端到端解码器相结合,其中后者的任务是在编码器侧提取的有偏信息进行传输。具体来说,我们理论分析了扩散模型在原始图像可见的情况下进行编码的重建过程。根据分析,我们引入了一个端到端的卷积解码器,在编码器侧提供对得分函数 $\nabla_{\mathbf{x}_t}\log p(\mathbf{x}_t)$ 的更好近似,并有效传输组合。实验证明,与之前的所有感知压缩方法相比,我们的方法在失真和感知方面都具有优越性。
https://arxiv.org/abs/2404.04916
In recent years, large-scale adoption of cloud storage solutions has revolutionized the way we think about digital data storage. However, the exponential increase in data volume, especially images, has raised environmental concerns regarding power and resource consumption, as well as the rising digital carbon footprint emissions. The aim of this research is to propose a methodology for cloud-based image storage by integrating image compression technology with SuperResolution Generative Adversarial Networks (SRGAN). Rather than storing images in their original format directly on the cloud, our approach involves initially reducing the image size through compression and downsizing techniques before storage. Upon request, these compressed images will be retrieved and processed by SRGAN to generate images. The efficacy of the proposed method is evaluated in terms of PSNR and SSIM metrics. Additionally, a mathematical analysis is given to calculate power consumption and carbon footprint assesment. The proposed data compression technique provides a significant solution to achieve a reasonable trade off between environmental sustainability and industrial efficiency.
近年来,大规模采用云存储解决方案彻底改变了我们对待数字数据存储的想法。然而,数据量的指数增长,特别是图像,引起了关于能源和资源消耗以及数字碳排放足迹的环保担忧。本研究旨在提出一种将图像压缩技术集成到超分辨率生成对抗网络(SRGAN)中的云图像存储方法。我们不直接将图像存储在云中,而是首先通过压缩和压缩裁剪等方法减小图像尺寸。在请求时,这些压缩图像将由SRGAN检索和处理以生成图像。所提出方法的有效性在PSNR和SSIM指标上进行评估。此外,还给出了计算能耗和碳足迹评估的数学分析。所提出的数据压缩技术为实现工业效率与环保之间的良好平衡提供了一个显著的解决方案。
https://arxiv.org/abs/2404.04642
This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at this https URL.
本文提出了一种针对英特尔数据中心GPU Max 1550的多层感知器(MLPs)实现,该实现针对数据集和并优化了英特尔数据中心GPU Max 1550。为了提高性能,我们的实现通过在通用寄存器和共享本地内存中最大化数据利用率来最小化全局内存访问的延迟。我们使用一个简单的屋顶线模型来证明,这导致算术强度的大幅增加,从而提高了性能,特别是推理。我们还将我们的实现与类似CUDA的MLP实现进行了比较,并证明了在推理和训练方面的性能均优于CUDA实现。此外,本文还展示了我们在三个重要领域:图像压缩、神经辐射场和物理建模机器学习方面的SYCL实现的效率。在所有情况下,我们的实现都胜过同一Intel GPU上的普通英特尔扩展PyTorch(IPEX)实现,其性能提高了30倍以上,而CUDA PyTorch版本在NVIDIA的H100 GPU上的性能提高了19倍。代码可以在该https URL上找到。
https://arxiv.org/abs/2403.17607
While replacing Gaussian decoders with a conditional diffusion model enhances the perceptual quality of reconstructions in neural image compression, their lack of inductive bias for image data restricts their ability to achieve state-of-the-art perceptual levels. To address this limitation, we adopt a non-isotropic diffusion model at the decoder side. This model imposes an inductive bias aimed at distinguishing between frequency contents, thereby facilitating the generation of high-quality images. Moreover, our framework is equipped with a novel entropy model that accurately models the probability distribution of latent representation by exploiting spatio-channel correlations in latent space, while accelerating the entropy decoding step. This channel-wise entropy model leverages both local and global spatial contexts within each channel chunk. The global spatial context is built upon the Transformer, which is specifically designed for image compression tasks. The designed Transformer employs a Laplacian-shaped positional encoding, the learnable parameters of which are adaptively adjusted for each channel cluster. Our experiments demonstrate that our proposed framework yields better perceptual quality compared to cutting-edge generative-based codecs, and the proposed entropy model contributes to notable bitrate savings.
在用条件扩散模型替换高斯解码器以提高神经图像压缩重建的感知质量的同时,它们的缺乏归纳偏见对于图像数据会限制其实现最优感知水平的能力。为了克服这一局限,我们在解码器端采用非均匀扩散模型。这个模型旨在通过区分频率内容来建立归纳偏见,从而促进高质量图像的生成。此外,我们的框架配备了一种新颖的熵模型,该模型通过利用潜在空间中的空间-通道关联精确建模了隐含表示的概率分布,同时加速熵解码步骤。这个通道层面的熵模型利用了每个通道块内的局部和全局空间上下文。全局空间上下文基于Transformer,这是专门为图像压缩任务而设计的。经过设计的Transformer采用了一个Laplacian形状的定位编码,其中可学习参数会根据每个通道簇进行自适应调整。我们的实验结果表明,与最先进的基于生成算法的压缩编码相比,我们所提出的框架具有更好的感知质量,并提出了一种有益的压缩比节省。
https://arxiv.org/abs/2403.16258
Image compression and denoising represent fundamental challenges in image processing with many real-world applications. To address practical demands, current solutions can be categorized into two main strategies: 1) sequential method; and 2) joint method. However, sequential methods have the disadvantage of error accumulation as there is information loss between multiple individual models. Recently, the academic community began to make some attempts to tackle this problem through end-to-end joint methods. Most of them ignore that different regions of noisy images have different characteristics. To solve these problems, in this paper, our proposed signal-to-noise ratio~(SNR) aware joint solution exploits local and non-local features for image compression and denoising simultaneously. We design an end-to-end trainable network, which includes the main encoder branch, the guidance branch, and the signal-to-noise ratio~(SNR) aware branch. We conducted extensive experiments on both synthetic and real-world datasets, demonstrating that our joint solution outperforms existing state-of-the-art methods.
图像压缩和去噪在图像处理中是一个基本挑战,许多实际应用都依赖于它。为了解决实际需求,当前的解决方案可以分为两个主要策略:1)序列方法;2)联合方法。然而,序列方法的一个缺点是,多个独立模型的信息损失会导致错误累积。最近,学术界开始尝试通过端到端的联合方法来解决这个问题。大多数方法忽略了噪声图像不同区域具有不同的特征。为了解决这些问题,本文提出的信号噪声比(SNR)感知联合解决方案同时利用了图像压缩和去噪的局部和非局部特征。我们设计了一个端到端的可训练网络,包括主要编码分支、指导分支和信号噪声比(SNR)感知分支。我们在合成和真实世界数据集上进行了广泛的实验,证明了我们的联合解决方案超越了现有最先进的方法。
https://arxiv.org/abs/2403.14135