Internal Language Model (LM)-based methods use permutation language modeling (PLM) to solve the error correction caused by conditional independence in external LM-based methods. However, random permutations of human interference cause fit oscillations in the model training, and Iterative Refinement (IR) operation to improve multimodal information decoupling also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance the location-context-image interaction capability, improving autoregressive generalization with internal LM. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks to dynamically exploit token dependencies. The adaptive masks increase the diversity of training data and prevent model dependency on a specific order. It reduces the training overhead of PLM while avoiding training fit oscillations. Second, we develop Cross-modal Hierarchical Attention mechanism (CHA) to couple context and image features. This processing establishes rich positional semantic dependencies between context and image while avoiding IR. Extensive experimental results show the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
内部语言模型(LM)方法使用变换语言建模(PLM)来解决外部LM方法中由条件独立性引起的错误纠正。然而,人类干扰的随机变换导致训练中的拟合振荡,迭代优化(IR)操作为了提高多模态信息解耦还会引入额外的开销。为了应对这些问题,本文提出了具有自适应变换的层次注意力自回归模型(HAAP),以增强位置上下文-图像交互能力,提高自回归模型的泛化。首先,我们提出隐式变换神经元(IPN)来生成自适应的注意力掩码以动态利用词依赖关系。自适应掩码增加了训练数据的多样性,防止了模型对特定顺序的依赖。同时,降低了PLM的训练开销,并避免了训练中的拟合振荡。其次,我们开发了跨模态层次注意力机制(CHA)来将上下文和图像特征耦合。这种处理建立了上下文和图像之间的丰富位置语义依赖关系,同时避免了IR。大量的实验结果表明,与最先进的(SOTA)性能相比,HAAP在准确性、复杂性和延迟方面都取得了卓越的成绩。
https://arxiv.org/abs/2405.09125
Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning, we can further improve layout analysis performance.
自深度学习崛起以来,场景文本检测模型已经取得了显著的进展,但在场景文本布局分析方面,旨在将检测到的文本实例分组为段落的布局分析并没有跟上。以前的工作不是使用单独的模型对文本检测和分组进行处理,就是使用统一的模型从零开始训练。它们都没有充分利用已经训练好的文本检测器,并轻松获得可用的检测数据。在本文中,我们提出了Text Grouping Adapter(TGA)模块,一种模块,可以启用各种预训练文本检测器的学习布局分析,使我们能够直接使用预训练的文本检测器或高效地进行微调。为了兼容各种文本检测器架构,TGA将检测到的文本区域和图像特征作为通用输入来组装文本实例特征。为了捕捉更广泛的上下文信息进行布局分析,我们提出从文本实例特征预测文本组掩码。我们全面的实验证明,即使在冻结预训练模型的情况下,将我们的TGA集成到各种预训练文本检测器和文本发现器中,也能实现卓越的布局分析性能,同时继承预训练的通用的文本检测能力。在参数微调的情况下,我们还可以进一步提高布局分析性能。
https://arxiv.org/abs/2405.07481
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks.
在文本识别中,自监督预训练成为减少对广泛注释的现实数据依赖的一个好方法。以前的研究主要通过利用掩膜图像建模或序列对比学习来关注局部视觉表示。然而,它们忽略了在文本图像中建模语言信息,这对于识别文本至关重要。为了同时捕捉视觉空间中的局部字符特征和语言信息,我们提出了对称超像素建模(SSM)。 SSM 的目标是对对称超像素输入进行重建。具体来说,我们将原始图像及其倒置版本添加到对称超像素输入中。在像素级别,我们重建原始和倒置图像以捕捉字符形状和文本级别语境。在特征级别,我们使用不同的增强来重建相同原始图像和倒置图像的特征,以建模语义级别的语境和局部字符识别。 在我们的设计中,我们打破了字符形状和语言规则。因此,双级别重构有助于从视觉纹理和特征语义的角度理解字符形状和语言信息。在各种文本识别基准实验中,SSM 的有效性和普遍性得到了证明,平均性能提高了4.1%,在联合14M基准测试中的平均词准确率达到了86.6%。
https://arxiv.org/abs/2405.05841
Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is this https URL.
近年来,基于扩散的生成建模的进步导致了许多基于文本提示的文本到视频(T2V)模型的开发。这些T2V模型通常会生成描述特定动作的视频片段(例如,`一只红熊猫爬树`)。然而,生成多场景视频(例如,`一只红熊猫爬树,然后它在树上睡觉`)是恰当的,因为它们在现实生活中非常普遍(例如,`一只红熊猫爬树` followed by `一只红熊猫在树上睡觉`)。为了从预训练的T2V模型中生成多场景视频,我们引入了时间同步捕获(TALC)框架。具体来说,我们增强T2V架构中文本条件机制,以识别视频场景和场景描述之间的时间对齐。例如,我们分别用第一场景描述(例如,`一只红熊猫爬树`)和第二场景描述(例如,`一只红熊猫在树上睡觉`)的表示条件视觉特征。结果表明,T2V模型可以生成符合多场景文本描述且具有视觉一致性的多场景视频(例如,实体和背景)。此外,我们使用TALC框架对预训练的T2V模型进行微调。我们发现,TALC微调的模型在整体得分上比基线方法高出15.5分,这通过人类评价来衡量视觉一致性和文本准确性。项目网站是https:// this URL。
https://arxiv.org/abs/2405.04682
Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition, Removal, and Editing.
场景文本图像不仅包含样式信息(字体,背景)还包含内容信息(字符,纹理)。不同的场景文本任务需要不同的信息,但之前的表现学习方法使用紧密耦合的特征来处理所有任务,导致在应对各种下游任务时性能较低。我们提出了一个解耦表示学习框架(DARLING),旨在解耦这两种类型的特征以提高在更好地解决各种下游任务时的适应性(选择您真正需要的)。具体来说,我们通过监督设计合成了一组具有相同风格的图像对,但具有不同内容的图像。根据这个数据集,我们通过监督设计解耦这两种类型的特征。显然,我们直接将视觉表示分为样式和内容特征。内容特征由文本识别损失进行监督,而风格特征通过图像解码器的提示进行对齐。然后,通过样式特征在图像对中重构对应图像。这种操作有效地将根据其独特属性解耦。据我们所知,这是场景文本领域首次解耦文本图像的固有属性。我们的方法在场景文本识别、去除和编辑方面取得了最先进的性能。
https://arxiv.org/abs/2405.04377
Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queries Generation Module and a Tasks-aware Adapter to effectively convert the original single-task model into a multi-task model suitable for both image and video scenarios with minimal additional parameters. The Prompt Queries Generation Module facilitates explicit interaction between different tasks, while the Tasks-aware Adapter helps the model dynamically learn suitable features for each task. Additionally, to further enable the model to learn temporal information at a lower cost, we propose a synthetic video text dataset (VTD-368k) by leveraging the Content Deformation Fields (CoDeF) algorithm. Notably, our method outperforms the state-of-the-art method by an average of 2.6% in six cross-domain benchmarks such as TT-to-IC15, CTW1500-to-TT, and TT-to-CTW1500. For video-level cross-domain adaption, our method even surpasses the previous end-to-end video spotting method in ICDAR2015 video and DSText v2 by an average of 5.5% on the MOTA metric, using only image-level data. We further demonstrate that existing Large Multimodal Models exhibit limitations in generating cross-domain scene text spotting, in contrast to our VimTS model which requires significantly fewer parameters and data. The code and datasets will be made available at the this https URL.
文本检测是一个涉及从图像或视频序列中提取文本信息的任务,在跨领域适应中面临挑战,例如图像到图像和图像到视频的泛化。在本文中,我们介绍了一种名为VimTS的新方法,通过实现不同任务之间的更好协同,增强了模型的泛化能力。通常,我们提出一个提示查询生成模块和一个任务感知适配器,有效地将原始的单任务模型转换为适合图像和视频场景的多任务模型,且最小化额外的参数。提示查询生成模块促进不同任务之间的显式交互,而任务感知适配器帮助模型动态学习每个任务所需的特征。此外,为了进一步降低模型学习时间成本,我们利用内容变形场(CoDeF)算法提出了合成视频文本数据集(VTD-368k)。值得注意的是,与最先进的 method 相比,我们的方法在六个跨领域基准测试(如TT-to-IC15,CTW1500-to-TT和TT-to-CTW1500)上的平均性能提高了2.6%。对于视频级别的跨领域适应,我们的方法甚至比ICDAR2015视频和DSText v2上的前端终点检测方法提高了平均5.5%的MOTA指标,仅使用图像级别数据。我们进一步证明了现有的大型多模态模型在生成跨领域场景文本检测方面存在局限性,而我们的VimTS模型则需要更少的参数和数据。代码和数据集将在此处https URL上提供。
https://arxiv.org/abs/2404.19652
Bottom-up text detection methods play an important role in arbitrary-shape scene text detection but there are two restrictions preventing them from achieving their great potential, i.e., 1) the accumulation of false text segment detections, which affects subsequent processing, and 2) the difficulty of building reliable connections between text segments. Targeting these two problems, we propose a novel approach, named ``MorphText", to capture the regularity of texts by embedding deep morphology for arbitrary-shape text detection. Towards this end, two deep morphological modules are designed to regularize text segments and determine the linkage between them. First, a Deep Morphological Opening (DMOP) module is constructed to remove false text segment detections generated in the feature extraction process. Then, a Deep Morphological Closing (DMCL) module is proposed to allow text instances of various shapes to stretch their morphology along their most significant orientation while deriving their connections. Extensive experiments conducted on four challenging benchmark datasets (CTW1500, Total-Text, MSRA-TD500 and ICDAR2017) demonstrate that our proposed MorphText outperforms both top-down and bottom-up state-of-the-art arbitrary-shape scene text detection approaches.
自下而上的文本检测方法在任意形状场景文本检测中扮演着重要的角色,但它们无法完全实现其巨大潜力,这是因为1)积累假文本分割检测,影响了后续处理,2)文本段之间的可靠连接难度。为了解决这两个问题,我们提出了一个名为“MorphText”的新方法,通过将深度形态学嵌入任意形状文本检测中,捕捉文本的规律。为实现这一目标,我们设计了两项功能强大的 deep morphological 模块来对文本段进行规范化和确定它们之间的联系。首先,我们构建了一个 Deep Morphological Opening (DMOP) 模块,用于消除在特征提取过程中产生的假文本分割检测。然后,我们提出了一个 Deep Morphological Closing (DMCL) 模块,允许各种形状的文本实例在其最显著的方向上伸展形态学,同时确定它们之间的联系。在四个具有挑战性的基准数据集(CTW1500,Total-Text,MSRA-TD500 和 ICDAR2017)上的大量实验证明,与自上而下和自下而上的状态最先进的任意形状场景文本检测方法相比,我们提出的 MorphText 具有优越的性能。
https://arxiv.org/abs/2404.17151
Text2Motion aims to generate human motions from texts. Existing datasets rely on the assumption that texts include action labels (such as "walk, bend, and pick up"), which is not flexible for practical scenarios. This paper redefines this problem with a more realistic assumption that the texts are arbitrary. Specifically, arbitrary texts include existing action texts composed of action labels (e.g., A person walks and bends to pick up something), and introduce scene texts without explicit action labels (e.g., A person notices his wallet on the ground ahead). To bridge the gaps between this realistic setting and existing datasets, we expand the action texts on the HumanML3D dataset to more scene texts, thereby creating a new HumanML3D++ dataset including arbitrary texts. In this challenging dataset, we benchmark existing state-of-the-art methods and propose a novel two-stage framework to extract action labels from arbitrary texts by the Large Language Model (LLM) and then generate motions from action labels. Extensive experiments are conducted under different application scenarios to validate the effectiveness of the proposed framework on existing and proposed datasets. The results indicate that Text2Motion in this realistic setting is very challenging, fostering new research in this practical direction. Our dataset and code will be released.
Text2Motion旨在从文本中生成人类动作。现有的数据集依赖于假设文本包括动作标签(如“步行,弯曲和捡起”),这并不灵活,因为实际场景中这种情况并不总是适用。本文通过更现实地假设文本是随机的,重新定义了这个问题。具体来说,随机文本包括由动作标签组成的现有动作文本(例如,一个人步行和弯曲来捡起东西),并引入没有明确动作标签的场景文本(例如,一个人注意到他面前的地面上有一张钞票)。为了在现实设置和现有数据集之间弥合差距,我们在HumanML3D数据集上扩展了动作文本,从而创建了一个包含任意文本的新HumanML3D++数据集。在这个具有挑战性的数据集中,我们基准了现有的最先进的方法,并提出了一个新颖的两阶段框架,通过Large Language Model(LLM)从任意文本中提取动作标签,然后生成动作。在不同的应用场景下进行了广泛的实验,以验证所提出的框架在现有和假设数据集上的有效性。结果表明,在现实设置下,Text2Motion非常具有挑战性,推动了这一领域的新研究。我们的数据集和代码将公开发布。
https://arxiv.org/abs/2404.14745
Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text tasks. Further research is also hindered by the lack of extremely low-light text datasets. To address these limitations, we propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction. Additionally, we present a Supervised Deep Curve Estimation (Supervised-DCE) model to synthesize extremely low-light images based on publicly available scene text datasets such as ICDAR15 (IC15). We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to allow for objective assessment of extremely low-light image enhancement through scene text tasks. Extensive experiments show that our model outperforms state-of-the-art methods in terms of both image quality and scene text metrics on the widely-used LOL, SID, and synthetic IC15 datasets. Code and dataset will be released publicly at this https URL.
极低光文本图像在自然场景中很常见,使得场景文本检测和识别变得具有挑战性。一种解决方案是在文本提取之前使用低光图像增强方法增强这些图像。然而,之前的方法通常没有特别关注低级别特征的重要性,这些特征对于下游场景文本任务具有关键作用。此外,缺乏极低光文本数据集也进一步阻碍了进一步的研究。为了克服这些限制,我们提出了一个新颖的编码器-解码器框架,配备边缘感知注意模块,以在增强过程中关注场景文本区域。我们的方法利用新的文本检测和边缘重构损失来强调低级别场景文本特征,从而实现成功的文本提取。此外,我们还提出了一个基于已知场景文本数据集如ICDAR15( see In the Dark,SID)的监督深度曲线估计(Supervised-DCE)模型,用于基于公开可用的场景文本数据合成极低光图像。我们还对极低光See In the Dark(SID)和普通Low-Light(LOL)数据集中的文本进行了标注,以使场景文本任务通过场景文本评估极低光图像增强。大量的实验结果表明,我们的模型在广泛使用的LOL、SID和合成IC15数据集上的图像质量和场景文本指标都优于最先进的方法。代码和数据集将在这个https:// URL上发布。
https://arxiv.org/abs/2404.14135
In this paper, we present a method for enhancing the accuracy of scene text recognition tasks by judging whether the image and text match each other. While previous studies focused on generating the recognition results from input images, our approach also considers the model's misrecognition results to understand its error tendencies, thus improving the text recognition pipeline. This method boosts text recognition accuracy by providing explicit feedback on the data that the model is likely to misrecognize by predicting correct or incorrect between the image and text. The experimental results on publicly available datasets demonstrate that our proposed method outperforms the baseline and state-of-the-art methods in scene text recognition.
在本文中,我们提出了一种通过判断图像和文本是否匹配来提高场景文本识别任务准确性的方法。与之前的研究不同,我们的方法考虑了模型的误识别结果,以了解其错误趋势,从而改善了文本识别流程。通过为模型提供关于其可能误识别的图像和文本之间的明确反馈,这种方法提高了文本识别的准确性。在公开数据集上进行的实验结果表明,与基线和最先进的方法相比,我们提出的方法在场景文本识别中表现出色。
https://arxiv.org/abs/2404.05967
This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed method achieves a significant improvement in accuracy compared to existing methods with an impressive accuracy of 5%. These results unequivocally demonstrate the efficacy of ensemble learning in the context of Vietnamese scene text spotting in urban environments, highlighting its potential for real world applications, such as text detection and recognition in urban signage, advertisements, and various text-rich urban scenes.
本文提出了一种简单的但高效的聚类学习框架,用于越南场景文本检测。利用聚类学习的优势,该方法结合多个模型以产生更准确的预测,旨在显著增强具有挑战性城市环境中的场景文本检测的性能。通过对VinText数据集的实验评估,与现有方法相比,我们提出的方法在准确性方面显著提高,达到令人印象深刻的5%的准确度。这些结果无条件地证明了在越南城市环境中使用聚类学习提高场景文本检测效果的可能性,并突出了其在城市标志、广告和各种充满文本的城市场景等现实应用中的潜力。
https://arxiv.org/abs/2404.00852
Scene text image super-resolution has significantly improved the accuracy of scene text recognition. However, many existing methods emphasize performance over efficiency and ignore the practical need for lightweight solutions in deployment scenarios. Faced with the issues, our work proposes an efficient framework called SGENet to facilitate deployment on resource-limited platforms. SGENet contains two branches: super-resolution branch and semantic guidance branch. We apply a lightweight pre-trained recognizer as a semantic extractor to enhance the understanding of text information. Meanwhile, we design the visual-semantic alignment module to achieve bidirectional alignment between image features and semantics, resulting in the generation of highquality prior guidance. We conduct extensive experiments on benchmark dataset, and the proposed SGENet achieves excellent performance with fewer computational costs. Code is available at this https URL
场景文本图像超分辨率显著提高了场景文本识别的准确性。然而,许多现有方法强调性能而非效率,并忽略了在部署场景中实现轻量级解决方案的实际需求。面对这些问题,我们的工作提出了一种高效的框架SGENet,以促进在资源受限平台上部署。SGENet包含两个分支:超分辨率分支和语义引导分支。我们使用轻量预训练识别器作为语义提取器来增强文本信息的理解。同时,我们设计了一个视觉语义对齐模块,以实现图像特征和语义之间的双向对齐,从而生成高质量的先验指导。我们在基准数据集上进行广泛的实验,与提出的SGENet相比,具有卓越的性能,但计算成本较低。代码可在此处下载:https://url
https://arxiv.org/abs/2403.13330
Road surface reconstruction plays a vital role in autonomous driving systems, enabling road lane perception and high-precision mapping. Recently, neural implicit encoding has achieved remarkable results in scene representation, particularly in the realistic rendering of scene textures. However, it faces challenges in directly representing geometric information for large-scale scenes. To address this, we propose EMIE-MAP, a novel method for large-scale road surface reconstruction based on explicit mesh and implicit encoding. The road geometry is represented using explicit mesh, where each vertex stores implicit encoding representing the color and semantic information. To overcome the difficulty in optimizing road elevation, we introduce a trajectory-based elevation initialization and an elevation residual learning method based on Multi-Layer Perceptron (MLP). Additionally, by employing implicit encoding and multi-camera color MLPs decoding, we achieve separate modeling of scene physical properties and camera characteristics, allowing surround-view reconstruction compatible with different camera models. Our method achieves remarkable road surface reconstruction performance in a variety of real-world challenging scenarios.
道路表面重构在自动驾驶系统中起着至关重要的作用,实现了道路车道的感知和精确地图绘制。最近,神经隐式编码在场景表示方面取得了显著的成果,特别是在真实感绘制场景纹理方面。然而,在直接表示大规模场景的几何信息方面,它面临挑战。为了应对这个问题,我们提出了EMIE-MAP,一种基于显式网格和隐式编码的大型道路表面重构新方法。道路几何信息通过显式网格表示,其中每个顶点存储隐式编码,表示颜色和语义信息。为了克服优化道路抬升的困难,我们引入了基于轨迹的抬升初始化和基于多层感知器(MLP)的抬升残差学习方法。此外,通过采用隐式编码和多相机颜色MLP解码,我们实现了场景物理特性和相机特性的单独建模,允许不同相机模型的环绕视图重建。我们的方法在各种现实世界的具有挑战性的场景中取得了出色的道路表面重构性能。
https://arxiv.org/abs/2403.11789
Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned language model on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even Large Language Models (LLMs).
现有的场景文本检测器旨在从图像中定位和转录文本。然而,对于检测器来说,同时实现精确的检测和识别场景文本是非常具有挑战性的。受到人类瞥视焦点检测流程和预训练语言模型(PLMs)在视觉任务上的出色表现启发,我们提出以下问题:1)“机器能否像人类一样准确地检测到文本,而无需进行精确的检测?”如果答案是肯定的,2)“文本块是否是场景文本检测的另一种选择,除了单词或字符?”为此,我们提出的场景文本检测器利用预训练语言模型(PLMs)增强性能,而无需进行微细检测。具体来说,我们首先使用简单的基于块级的文本检测器获得粗略的位置信息。然后,我们使用一个大型的OCR数据集对PLM进行微调,以实现精确的识别。得益于在预训练阶段获得的全面语言知识,基于PLM的识别模块有效地处理复杂的场景,包括多行、反向、遮挡和不完整的检测文本。利用场景识别基准测试中的微调后的语言模型以及文本块检测的范式,广泛的实验证明了我们在多个公共基准测试中的优越性能。此外,我们还尝试从整个场景图像中直接检测文本,以展示PLMs的潜力,即使是大语言模型(LLMs)。
https://arxiv.org/abs/2403.10047
Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.
视觉文本渲染对当代文本-图像生成模型来说是一个基本挑战,其核心问题在于文本编码器的不足。为了实现准确的文本渲染,我们提出了两个关键需求:字符意识和与字符相关的对齐。我们的解决方案是通过微调带有字符意识的ByT5编码器,使用精心挑选的配对字符-文本数据集进行微调,来创建一个自定义的文本编码器Glyph-ByT5。我们展示了将Glyph-ByT5与SDXL集成有效的方法,从而创建了Glyph-SDXL模型,用于设计图像生成。这使得文本渲染准确性大大提高,从不到20%提高到了几乎90%。值得注意的是,Glyph-SDXL在文本段落渲染方面表现出的新能力,实现了对数十到数百个字符的高拼写准确度,并采用自动多行布局。最后,通过用包含视觉文本的高质量、实拍图像微调Glyph-SDXL,我们在开放域真实图像中展示了场景文本渲染能力的重大改进。这些引人注目的结果鼓励我们在各种具有挑战性的任务中进一步探索自定义文本编码器的设计。
https://arxiv.org/abs/2403.09622
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.
场景文本视觉问答(ST-VQA)旨在理解图像中的场景文本,并回答与文本内容相关的问题。现有方法很大程度上依赖于光学字符识别(OCR)系统的准确性,而且基于有限的空间位置信息和错误的OCR文本信息进行激进的微调往往会导致不可预测的过拟合。在本文中,我们提出了一个具有空间感知能力的多模态对抗训练架构。具体来说,我们引入了一个对抗性OCR增强(AOE)模块,它利用OCR模态的嵌入空间中的对抗性训练来增强OCR文本的容错表示,从而减少由OCR错误引起的噪声。同时,我们还添加了一个空间感知自注意力(SASA)机制,帮助模型更好地捕捉OCR词汇之间的空间关系。各种实验结果表明,我们的方法在ST-VQA和TextVQA数据集上都取得了显著的性能提升,并为多模态对抗训练树立了新的范例。
https://arxiv.org/abs/2403.09288
The importance of Scene Text Recognition (STR) in today's increasingly digital world cannot be overstated. Given the significance of STR, data intensive deep learning approaches that auto-learn feature mappings have primarily driven the development of STR solutions. Several benchmark datasets and substantial work on deep learning models are available for Latin languages to meet this need. On more complex, syntactically and semantically, Indian languages spoken and read by 1.3 billion people, there is less work and datasets available. This paper aims to address the Indian space's lack of a comprehensive dataset by proposing the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages. A few works have addressed the same issue, but to the best of our knowledge, they focused on a small number of Indian languages. The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries, while its multilingualism will catalyse the development of robust text detection and recognition models. It was created specifically for a group of related languages with different scripts. The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language. Unlike previous datasets, the images cover a broader range of realistic conditions, including blur, illumination changes, occlusion, non-iconic texts, low resolution, perspective text etc. Along with the new dataset, we provide a high-performing baseline on three models - PARSeq, CRNN, and STARNet.
在当今日益数字化的世界中,场景文本识别(STR)的重要性不容忽视。考虑到STR的重要性,数据密集型深度学习方法自动学习特征映射已经在STR解决方案的发展中发挥了主要作用。目前有多个用于拉丁语的数据集和大量关于深度学习模型的研究,以满足这一需求。对于说和读13亿人口的印度语言来说,在更复杂、语义和语法方面,可用的工作和数据集较少。本文旨在通过提出最大的、最全面的真实数据集-IndicSTR12,来解决印度空间缺乏全面数据集的问题,并评估STR在12种主要印度语言上的性能。虽然已经有一些研究解决了同样的问题,但据我们所知,它们主要针对少数印度语言。拟议的数据集的大小和复杂性与现有拉丁语作品相当,而其多语言性将催生强大的文本检测和识别模型的开发。它特意为几种相关语言的一个群体而创建。数据集包含来自各种自然场景的超过27000个单词图像,每个语言都有超过1000个单词图像。与以前的数据集不同,图像涵盖了更广泛的现实情况,包括模糊、光照变化、遮挡、非典型文本、低分辨率、透视文本等。与新数据集一起,我们为PARSeq、CRNN和STARNet提供了高性能的基准。
https://arxiv.org/abs/2403.08007
Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV training data. To solve this problem, we first propose a pseudo label generation module that leverages character detection and image inpainting to produce substantial pseudo OOV training data from real-world images. Unlike previous synthetic data, our pseudo OOV data contains real characters and backgrounds to simulate real-world applications. Secondly, to reduce noises in pseudo data, we present a semantic checking mechanism to filter semantically meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the training with pseudo data. Our loss includes a margin-based part to enhance the classification ability, and a quality-aware part to penalize low-quality samples in both real and pseudo data. Extensive experiments demonstrate that our approach outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.
场景文本识别是计算机视觉中一个重要而具有挑战性的任务。然而,大多数先前的作品都专注于识别预定义的单词,而在现实应用中存在各种不在词汇表中的(OOV)单词。在本文中,我们提出了一个新颖的开放词汇文本识别框架,称为伪-OCR,以识别OOV单词。这个任务的关键挑战是缺乏OOV训练数据。为解决这个问题,我们首先提出了一个伪标签生成模块,利用字符检测和图像修复技术从现实世界的图像中产生大量伪OOV训练数据。与之前的合成数据不同,我们的伪OOV数据包含真实字符和背景,以模拟真实世界的应用。其次,为了减少伪数据中的噪声,我们提出了一个语义检查机制来过滤语义上有意义的数据。第三,我们引入了质量感知边距损失来提高带有伪数据的训练。我们的损失包括基于边距的质量和基于质量的损失。大量实验证明,我们的方法在八个数据集上的表现超过了现有技术的水平,在ICDAR2022挑战中获得了第一名的成绩。
https://arxiv.org/abs/2403.07518
Text detection is frequently used in vision-based mobile robots when they need to interpret texts in their surroundings to perform a given task. For instance, delivery robots in multilingual cities need to be capable of doing multilingual text detection so that the robots can read traffic signs and road markings. Moreover, the target languages change from region to region, implying the need of efficiently re-training the models to recognize the novel/new languages. However, collecting and labeling training data for novel languages are cumbersome, and the efforts to re-train an existing/trained text detector are considerable. Even worse, such a routine would repeat whenever a novel language appears. This motivates us to propose a new problem setting for tackling the aforementioned challenges in a more efficient way: "We ask for a generalizable multilingual text detection framework to detect and identify both seen and unseen language regions inside scene images without the requirement of collecting supervised training data for unseen languages as well as model re-training". To this end, we propose "MENTOR", the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
文本检测在基于视觉的移动机器人中非常常见,当它们需要在其周围环境中解释文本以执行特定任务时。例如,多语言城市中的送货机器人需要具备多语言文本检测能力,以便机器人能阅读交通标志和道路标线。此外,目标语言会根据地区发生变化,这意味着需要有效地重新训练模型以识别新/未知的语言。然而,收集和标注新语言的训练数据非常耗时,重新训练现有/训练好的文本检测模型的努力也很大。更糟糕的是,这种模式会重复出现,每次出现新语言都会如此。因此,我们激励自己提出一个更有效的解决方案来解决上述挑战: "我们要求一个通用的多语言场景文本检测框架,能够检测和识别场景图像中的已见和未见语言区域,无需收集未见语言的监督训练数据,也不需要重新训练模型。" 为实现这一目标,我们提出了"MENTOR",是第一个在零散学习和高散射学习之间实现学习策略的关于多语言场景文本检测的工作。
https://arxiv.org/abs/2403.07286
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at this https URL.
我们提出了TextMonkey,一个专为文本中心任务(包括文档问题回答和场景文本分析)而设计的大型多模态模型(LMM)。我们的方法在几个维度上进行了增强:通过采用Shifted Window Attention且初始化为零,我们在较高输入分辨率上实现了跨窗口连接并稳定了早期的训练;我们假设图像中可能包含冗余词,通过使用相似性过滤掉显著的词,我们不仅可以简化词长,而且还可以提高模型的性能。此外,通过扩展我们的模型的能力包括文本定位和 grounded,并将位置信息融入回答,我们提高了可解释性和减少了幻觉。此外,TextMonkey还可以微调以具有理解截图按键命令的能力。总的来说,我们的方法在各种基准数据集上显著提高了性能,在Scene Text-Centric VQA、Document Oriented VQA和KIE等任务中分别实现了5.2%、6.9%和2.8%的提高,尤其是OCRBench上的得分561,超过了之前开源的大型多模态模型对于文档理解的性能。代码将在此处发布:https:// this URL.
https://arxiv.org/abs/2403.04473