Relational triple extraction is crucial work for the automatic construction of knowledge graphs. Existing methods only construct shallow representations from a token or token pair-level. However, previous works ignore local spatial dependencies of relational triples, resulting in a weakness of entity pair boundary detection. To tackle this problem, we propose a novel Region-based Table Filling method (RTF). We devise a novel region-based tagging scheme and bi-directional decoding strategy, which regard each relational triple as a region on the relation-specific table, and identifies triples by determining two endpoints of each region. We also introduce convolution to construct region-level table representations from a spatial perspective which makes triples easier to be captured. In addition, we share partial tagging scores among different relations to improve learning efficiency of relation classifier. Experimental results show that our method achieves state-of-the-art with better generalization capability on three variants of two widely used benchmark datasets.
关系三元提取对于知识图的自动构建至关重要。现有的方法仅从词或词对级别构建浅层次表示。然而,之前的 works忽略了关系三元组的局部空间依赖性,导致实体对边界检测的弱点。为了解决这个问题,我们提出了一个新颖的区域为基础的表填充方法(RTF)。我们设计了一个新颖的区域为基础的标签方案和双向解码策略,将每个关系三元组视为关系特定表中的一个区域,通过确定每个区域的两个端点来确定三元组。我们还引入卷积来从空间角度构建区域级别的表表示,使得三元组更容易被捕捉。此外,我们还在不同关系之间共享部分标签得分,以提高关系分类器的学习效率。实验结果表明,我们的方法在三个广泛使用基准数据集上实现了最先进的性能,且在扩展性方面表现出色。
https://arxiv.org/abs/2404.19154
With the increasing prevalence of text generated by large language models (LLMs), there is a growing concern about distinguishing between LLM-generated and human-written texts in order to prevent the misuse of LLMs, such as the dissemination of misleading information and academic dishonesty. Previous research has primarily focused on classifying text as either entirely human-written or LLM-generated, neglecting the detection of mixed texts that contain both types of content. This paper explores LLMs' ability to identify boundaries in human-written and machine-generated mixed texts. We approach this task by transforming it into a token classification problem and regard the label turning point as the boundary. Notably, our ensemble model of LLMs achieved first place in the 'Human-Machine Mixed Text Detection' sub-task of the SemEval'24 Competition Task 8. Additionally, we investigate factors that influence the capability of LLMs in detecting boundaries within mixed texts, including the incorporation of extra layers on top of LLMs, combination of segmentation loss, and the impact of pretraining. Our findings aim to provide valuable insights for future research in this area.
随着大型语言模型(LLMs)生成的文本越来越多的普遍,人们越来越关注在LLM生成的和人类撰写的文本之间进行区分,以防止LLM的滥用,例如传播误导性信息和学术不端行为。之前的研究主要集中在将文本归类为完全由人类撰写或完全由LLM生成的两类,而忽略了检测包含两种内容的混合文本。本文探讨LLM在人类和机器生成混合文本中的边界识别能力。我们将其转化为一个词分类问题,将标签转折点视为边界。值得注意的是,我们的LLM模型在SemEval'24竞赛任务8的“人类-机器混合文本检测”子任务中获得了第一名的成绩。此外,我们研究了影响LLM在混合文本中检测边界能力的因素,包括在LLM之上集成额外的层、分段损失的组合以及预训练的影响。我们的研究结果旨在为该领域未来的研究提供宝贵的洞见。
https://arxiv.org/abs/2404.00899
Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.
近年来,Sign Language Recognition (SLR)已经从研究者那里获得了显著的关注,尤其是 Continuous Sign Language Recognition (CSLR) 领域,它比 Isolated Sign Language Recognition (ISLR) 具有更高的复杂性。CSLR 中的一个突出挑战是准确地检测连续视频流中孤立符号的边界。此外,现有模型对手工特征的依赖使得达到最优准确性的挑战加大。为了克服这些挑战,我们提出了一个利用 Transformer 模型的全新方法。与传统模型不同,我们的方法专注于提高准确度的同时消除手工特征的需要。Transformer 模型用于 ISLR 和 CSLR。训练过程包括使用孤立手势视频,其中从输入视频中提取的手关键点特征通过 Transformer 模型进行丰富。随后,这些丰富的特征被输入到最后一层分类层。训练好的模型与后处理方法相结合,然后应用于检测连续符号中的孤立符号边界。在两个不同的数据集上评估我们的模型,包括连续符号及其相应的孤立符号,证明了积极的结果。
https://arxiv.org/abs/2402.14720
Generic Event Boundary Detection (GEBD) task aims to recognize generic, taxonomy-free boundaries that segment a video into meaningful events. Current methods typically involve a neural model trained on a large volume of data, demanding substantial computational power and storage space. We explore two pivotal questions pertaining to GEBD: Can non-parametric algorithms outperform unsupervised neural methods? Does motion information alone suffice for high performance? This inquiry drives us to algorithmically harness motion cues for identifying generic event boundaries in videos. In this work, we propose FlowGEBD, a non-parametric, unsupervised technique for GEBD. Our approach entails two algorithms utilizing optical flow: (i) Pixel Tracking and (ii) Flow Normalization. By conducting thorough experimentation on the challenging Kinetics-GEBD and TAPOS datasets, our results establish FlowGEBD as the new state-of-the-art (SOTA) among unsupervised methods. FlowGEBD exceeds the neural models on the Kinetics-GEBD dataset by obtaining an F1@0.05 score of 0.713 with an absolute gain of 31.7% compared to the unsupervised baseline and achieves an average F1 score of 0.623 on the TAPOS validation dataset.
Generic Event Boundary Detection (GEBD)任务旨在识别将视频分割为有意义事件的通用、无类别的边界。目前的方法通常涉及在一大群数据上训练的神经模型,需要大量的计算能力和存储空间。我们探讨两个与GEBD相关的关键问题:非参数算法是否能够超越无监督的神经方法?是否仅凭运动信息就足以实现高性能?这个问题推动我们去算法地利用运动线索来识别视频中的通用事件边界。在这项工作中,我们提出了FlowGEBD,一种非参数、无监督的GEBD技术。我们的方法包括两个利用光流的方法:(i)像素跟踪和(ii)光流归一化。通过对Kinetics-GEBD和TAPOS数据集的深入实验,我们的结果使FlowGEBD成为无监督方法中的最先进的(SOTA)。FlowGEBD在Kinetics-GEBD数据集上超过了神经模型,获得了0.713的F1@0.05得分,相对增益为31.7%,同时在TAPOS验证数据集上的平均F1得分达到0.623。
https://arxiv.org/abs/2404.18935
This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 msec of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.
这篇文章研究了使用连接ist语音音素识别器输出的类熵来预测音位类之间的时间边界的可能性。推理是,熵的值应该在两个已经很好地建模(已知)的音位类之间的转换处增加,因为这是不确定性的度量。这种度量的优势在于,后验概率可以在连接ist语音素识别中直接访问。熵及其基于导数的度量被单独和结合使用。预测边界的方法从简单的阈值开始,到基于神经网络的程序。本文将不同的方法与它们的精度进行比较,即预测边界内参考边界数量与总预测边界数量之比。召回被测量为预测边界内参考边界数量与总预测边界数量之比。
https://arxiv.org/abs/2401.05717
Recent advancements in Large Language Models (LLMs) have led to high-quality Machine-Generated Text (MGT), giving rise to countless new use cases and applications. However, easy access to LLMs is posing new challenges due to misuse. To address malicious usage, researchers have released datasets to effectively train models on MGT-related tasks. Similar strategies are used to compile these datasets, but no tool currently unifies them. In this scenario, we introduce TextMachina, a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, or boundary detection. It provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets, such as LLM integrations, prompt templating, and bias mitigation. The quality of the datasets generated by TextMachina has been assessed in previous works, including shared tasks where more than one hundred teams trained robust MGT detectors.
近年来,大型语言模型(LLMs)的进步导致了高质量的机器生成文本(MGT),为无数新的用例和应用提供了可能。然而,由于滥用,轻松访问LLMs也带来了新的挑战。为了应对恶意使用,研究人员已经发布了用于有效训练与MGT相关的任务的 datasets。类似地,用于构建这些数据集的工具,但目前尚无工具能够统一它们。在这种情况下,我们介绍了TextMachina,一个模块化且可扩展的Python框架,旨在帮助创建高质量、无偏的 datasets,以构建 robust 模型,例如检测、归因或边界检测。它提供了一个用户友好的管道,抽象了构建MGT数据集的固有复杂性,例如LLM集成、提示模板化和偏差缓解。TextMachina生成的数据集的质量已在之前的 works中被评估,包括由超过100个团队共同训练的 robust MGT 检测器。
https://arxiv.org/abs/2401.03946
In this work, we investigate the means of using curiosity on replay buffers to improve offline multi-task continual reinforcement learning when tasks, which are defined by the non-stationarity in the environment, are non labeled and not evenly exposed to the learner in time. In particular, we investigate the use of curiosity both as a tool for task boundary detection and as a priority metric when it comes to retaining old transition tuples, which we respectively use to propose two different buffers. Firstly, we propose a Hybrid Reservoir Buffer with Task Separation (HRBTS), where curiosity is used to detect task boundaries that are not known due to the task agnostic nature of the problem. Secondly, by using curiosity as a priority metric when it comes to retaining old transition tuples, a Hybrid Curious Buffer (HCB) is proposed. We ultimately show that these buffers, in conjunction with regular reinforcement learning algorithms, can be used to alleviate the catastrophic forgetting issue suffered by the state of the art on replay buffers when the agent's exposure to tasks is not equal along time. We evaluate catastrophic forgetting and the efficiency of our proposed buffers against the latest works such as the Hybrid Reservoir Buffer (HRB) and the Multi-Time Scale Replay Buffer (MTR) in three different continual reinforcement learning settings. Experiments were done on classical control tasks and Metaworld environment. Experiments show that our proposed replay buffers display better immunity to catastrophic forgetting compared to existing works in most of the settings.
在这项研究中,我们研究了在重新播放缓冲区中使用好奇心来改善无标签环境和非均匀暴露于学习者时的离线多任务强化学习的方法。特别是,我们研究了好奇心在检测任务边界和保留旧转移元组方面的使用,我们分别使用这些元组来提出两种不同的缓冲器。首先,我们提出了一个带有任务分离的混合水库缓冲器(HRBTS),其中好奇心用于检测由于问题对任务无关性而无法确定的任务边界。其次,通过将好奇心用作保留旧转移元组的优先度度量,我们提出了一个混合好奇缓冲器(HCB)。最后,我们证明了这些缓冲器与标准的强化学习算法相结合可以缓解当前关于重新播放缓冲器状态的灾难性遗忘问题。我们评估了灾难性遗忘以及我们提出的缓冲器的效率,这些缓冲器在三个不同的连续强化学习环境中进行了实验。实验在经典控制任务和元世界环境中进行。实验结果表明,与现有作品相比,我们提出的缓冲器在大多数设置中具有更好的抗灾难性遗忘能力。
https://arxiv.org/abs/2312.03177
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.
早期的弱监督视频 groundeding (WSVG) 方法通常因为缺少时间边界的标注而无法处理不完整的边界检测。为了在视频级别和边界级别之间填补差距, explicit-supervision 方法,即为训练生成伪时间边界的显式监督方法,已经取得了巨大的成功。然而,这些方法中的数据增强可能会破坏关键的时间信息,从而产生低伪边。在本文中,我们提出了一种新方法,在保持原始时间内容完整的同时引入更多有价值的信息来扩展不完整的边界。为此,我们提出了 EtC (扩展然后澄清) 方法,首先利用额外的信息扩展初始不完整的伪边界,然后对其进行优化以实现精确的边界。为了进一步澄清扩展边界的噪声,我们结合相互学习和一个自适应的提议级对比目标,使用一种可学习的方法来平衡不完整但干净(初始)和全面但嘈杂(扩展)边之间的精确度。实验证明,我们的方法在两个具有挑战性的 WSVG 数据集上具有优越性。
https://arxiv.org/abs/2312.02483
Precise and rapid delineation of sharp boundaries and robust semantics is essential for numerous downstream robotic tasks, such as robot grasping and manipulation, real-time semantic mapping, and online sensor calibration performed on edge computing units. Although boundary detection and semantic segmentation are complementary tasks, most studies focus on lightweight models for semantic segmentation but overlook the critical role of boundary detection. In this work, we introduce Mobile-Seed, a lightweight, dual-task framework tailored for simultaneous semantic segmentation and boundary detection. Our framework features a two-stream encoder, an active fusion decoder (AFD) and a dual-task regularization approach. The encoder is divided into two pathways: one captures category-aware semantic information, while the other discerns boundaries from multi-scale features. The AFD module dynamically adapts the fusion of semantic and boundary information by learning channel-wise relationships, allowing for precise weight assignment of each channel. Furthermore, we introduce a regularization loss to mitigate the conflicts in dual-task learning and deep diversity supervision. Compared to existing methods, the proposed Mobile-Seed offers a lightweight framework to simultaneously improve semantic segmentation performance and accurately locate object boundaries. Experiments on the Cityscapes dataset have shown that Mobile-Seed achieves notable improvement over the state-of-the-art (SOTA) baseline by 2.2 percentage points (pp) in mIoU and 4.2 pp in mF-score, while maintaining an online inference speed of 23.9 frames-per-second (FPS) with 1024x2048 resolution input on an RTX 2080 Ti GPU. Additional experiments on CamVid and PASCAL Context datasets confirm our method's generalizability. Code and additional results are publicly available at \url{this https URL}.
精确和快速的边界描绘和语义分割是许多下游机器人任务的必要条件,例如机器人抓取和操作、实时语义映射以及在边缘计算单元上进行的在线传感器校准。尽管边界检测和语义分割是互补任务,但大多数研究关注语义分割的轻量级模型,而忽视了边界检测的关键作用。在这项工作中,我们引入了移动种子(Mobile-Seed),一个轻量级、双任务框架,专为同时进行语义分割和边界检测而设计。我们的框架包括两个通道的编码器、一个活动融合解码器(AFD)和一个双任务正则化方法。编码器分为两个路径:一个捕捉类相关的语义信息,另一个从多尺度特征中辨别边界。AFD模块通过学习通道级的关系动态适应语义和边界信息的融合,允许为每个通道精确分配权重。此外,我们引入了一个正则化损失项来减轻双任务学习中的冲突和深度多样监督。与现有方法相比,移动种子(Mobile-Seed)为同时提高语义分割性能和准确找到物体边界提供了一个轻量级的框架。在Cityscapes数据集上的实验结果表明,移动种子在mIoU和mF-score上实现了与最先进的(SOTA)基线2.2个百分点的改进(pp),同时具有与1024x2048分辨率输入的RTX 2080 Ti GPU上的在线推理速度为23.9帧每秒(FPS)。在CamVid和PASCAL Context数据集上的实验进一步证实了我们的方法的泛化能力。代码和附加结果可以从该链接公开获取:\url{this <https://this URL>}。
https://arxiv.org/abs/2311.12651
Due to the rapid development of text generation models, people increasingly often encounter texts that may start out as written by a human but then continue as machine-generated results of large language models. Detecting the boundary between human-written and machine-generated parts of such texts is a very challenging problem that has not received much attention in literature. In this work, we consider and compare a number of different approaches for this artificial text boundary detection problem, comparing several predictors over features of different nature. We show that supervised fine-tuning of the RoBERTa model works well for this task in general but fails to generalize in important cross-domain and cross-generator settings, demonstrating a tendency to overfit to spurious properties of the data. Then, we propose novel approaches based on features extracted from a frozen language model's embeddings that are able to outperform both the human accuracy level and previously considered baselines on the Real or Fake Text benchmark. Moreover, we adapt perplexity-based approaches for the boundary detection task and analyze their behaviour. We analyze the robustness of all proposed classifiers in cross-domain and cross-model settings, discovering important properties of the data that can negatively influence the performance of artificial text boundary detection algorithms.
由于自然语言生成模型的快速发展,人们越来越多地遇到可能最初由人类撰写,然后继续由大型语言模型生成的大规模语言模型的文本。检测这种文本中人类撰写的和机器生成的部分边界是一个非常有挑战性的问题,在文献中受到了很少的关注。在这项工作中,我们考虑并比较了多种不同的方法来解决这个人工文本边界检测问题,在不同类型的特征上进行了比较。我们发现,监督微调的RoBERTa模型在一般情况下对此任务表现良好,但在重要的跨领域和跨生成设置中表现不佳,表明对数据中伪特征的过度拟合。然后,我们提出了一种基于从冻语言模型嵌入中提取的特征的新型方法,能够超越人类准确水平,并显著地改善之前考虑的基线。此外,我们还对边界检测任务进行了基于干扰项的适应性分析,并分析了其行为。我们分析了一切提出的分类器在跨领域和跨模型设置中的鲁棒性,发现了可能对人工文本边界检测算法性能产生负面影响的重要数据属性。
https://arxiv.org/abs/2311.08349
Holistic scene understanding includes semantic segmentation, surface normal estimation, object boundary detection, depth estimation, etc. The key aspect of this problem is to learn representation effectively, as each subtask builds upon not only correlated but also distinct attributes. Inspired by visual-prompt tuning, we propose a Task-Specific Prompts Transformer, dubbed TSP-Transformer, for holistic scene understanding. It features a vanilla transformer in the early stage and tasks-specific prompts transformer encoder in the lateral stage, where tasks-specific prompts are augmented. By doing so, the transformer layer learns the generic information from the shared parts and is endowed with task-specific capacity. First, the tasks-specific prompts serve as induced priors for each task effectively. Moreover, the task-specific prompts can be seen as switches to favor task-specific representation learning for different tasks. Extensive experiments on NYUD-v2 and PASCAL-Context show that our method achieves state-of-the-art performance, validating the effectiveness of our method for holistic scene understanding. We also provide our code in the following link this https URL.
整体场景理解包括语义分割、表面法线估计、物体边界检测、深度估计等。这个问题的关键在于有效地学习表示,因为每个子任务不仅依赖于相关属性,而且还依赖于独特的属性。受到视觉提示调整的启发,我们提出了一个任务特定提示的Transformer,称之为TSP-Transformer,用于整体场景理解。它具有一个基本的变压器(在早期阶段)和一个任务特定提示的变压器(在横向阶段),其中任务特定提示被增强。通过这样做,变压器层从共享部分学习通用信息,并具备任务特定能力。首先,任务特定提示可以作为每个任务的诱发先验。此外,任务特定提示可以被视为对不同任务为任务特定表示学习提供开关。在NYUD-v2和PASCAL-Context的实验中,我们的方法实现了最先进的性能,验证了我们对整体场景理解的有效性。我们还在下面这个链接的代码中提供了我们的方法:<https://github.com/Vision_Transformer/TSP-Transformer>
https://arxiv.org/abs/2311.03427
One of the primary obstacles in the advancement of Natural Language Processing (NLP) technologies for low-resource languages is the lack of annotated datasets for training and testing machine learning models. In this paper, we present Antarlekhaka, a tool for manual annotation of a comprehensive set of tasks relevant to NLP. The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators. The system sports user-friendly interfaces for 8 categories of annotation tasks. These, in turn, enable the annotation of a considerably larger set of NLP tasks. The task categories include two linguistic tasks not handled by any other tool, namely, sentence boundary detection and deciding canonical word order, which are important tasks for text that is in the form of poetry. We propose the idea of sequential annotation based on small text units, where an annotator performs several tasks related to a single text unit before proceeding to the next unit. The research applications of the proposed mode of multi-task annotation are also discussed. Antarlekhaka outperforms other annotation tools in objective evaluation. It has been also used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali. The tool is available at this https URL.
自然语言处理(NLP)技术在低资源语言的发展中面临着一个主要障碍,那就是缺乏用于训练和测试机器学习模型的带注释数据集。在本文中,我们提出了Antarlekhaka,一个用于手动注释与NLP相关的全面任务的工具。该工具兼容Unicode,具有语言无关性,支持通过多个同时 annotator的分布式注释。系统具有用于8个类别的用户友好界面的 annotator。这些界面允许对NLP任务进行更大的注释。任务类别包括其他工具没有处理的两个非语言任务,即句子边界检测和决定规范的词序,这些任务对于形式为诗歌的文本非常重要。我们提出了基于小文本单元的序列注释的想法,其中annotator在前进到下一个单位之前执行多个与单个文本单元相关的任务。所提出的多任务注释模式的 research applications 也有所讨论。Antarlekhaka 在客观评估中优于其他注释工具。它还用于两个不同语言的现实生活中的两个任务,即梵语和孟加拉语。该工具可用於此https URL。
https://arxiv.org/abs/2310.07826
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at this https URL.
通用事件边界检测旨在定位通用、无分类事件的分割边界,将视频片段分割成块。现有的方法通常要求视频帧先解码才能输入网络,其中包含 significant spatio-temporal redundancy 并需要相当规模的计算资源和存储空间。为了解决这些问题,我们提出了一种用于事件边界检测的压缩视频表示学习方法,该方法完全端到端利用压缩域中丰富的信息,即RGB、运动向量、残留值和内部图片组(GOP)结构,而无需完全解码视频。具体来说,我们使用轻量级卷积神经网络提取GOP中的P帧特征,并使用空间通道注意力模块(SCAM)优化基于压缩信息的P帧特征表示,以双向信息流为基础。为了学习适合边界检测的特征表示,我们每个候选帧构建本地帧包,并使用长短期记忆(LSTM)模块捕捉时间关系。然后我们在时间域中计算群体相似度,以计算帧差异。该模块仅适用于本地窗口,这是事件边界检测的关键。最后,我们使用简单的分类器来确定视频序列的事件边界,以消除注释的歧义并加快训练过程。在Kinetics-GEBD和TAPOS数据集上的广泛实验表明,该方法在运行速度相同的情况下与以前的端到端方法相比取得了相当大的改进。代码在此httpsURL可用。
https://arxiv.org/abs/2309.15431
Recent span-based joint extraction models have demonstrated significant advantages in both entity recognition and relation extraction. These models treat text spans as candidate entities, and span pairs as candidate relationship tuples, achieving state-of-the-art results on datasets like ADE. However, these models encounter a significant number of non-entity spans or irrelevant span pairs during the tasks, impairing model performance significantly. To address this issue, this paper introduces a span-based multitask entity-relation joint extraction model. This approach employs the multitask learning to alleviate the impact of negative samples on entity and relation classifiers. Additionally, we leverage the Intersection over Union(IoU) concept to introduce the positional information into the entity classifier, achieving a span boundary detection. Furthermore, by incorporating the entity Logits predicted by the entity classifier into the embedded representation of entity pairs, the semantic input for the relation classifier is enriched. Experimental results demonstrate that our proposed this http URL model can effectively mitigate the adverse effects of excessive negative samples on the model performance. Furthermore, the model demonstrated commendable F1 scores of 73.61\%, 53.72\%, and 83.72\% on three widely employed public datasets, namely CoNLL04, SciERC, and ADE, respectively.
最近,基于跨度的联合实体和关系提取模型在实体识别和关系提取方面表现出了显著的优势。这些模型将文本跨度视为候选实体,并将跨度对作为候选关系元组,在类似ADE的数据集上取得了最先进的结果。然而,在这些任务中,这些模型会遇到大量非实体跨度或无关的跨度对,显著影响了模型性能。为了解决这个问题,本文介绍了基于跨度的多任务实体和关系提取模型。这种方法采用多任务学习来减轻负样本对实体和关系分类器的影响。此外,我们利用交集概念将位置信息引入实体分类器,实现了跨度边界检测。此外,通过将实体分类器预测的实体Logits嵌入到实体对的嵌入表示中,可以增加关系分类器的语义输入。实验结果显示,我们提出的这个http URL模型能够有效地减轻过度负样本对模型性能的不利影响。此外,该模型在三个广泛使用的公共数据集上(CoNLL04、SciERC和ADE)分别表现出令人赞叹的F1得分73.61%、53.72%和83.72%。
https://arxiv.org/abs/2309.09713
Accurate polyp delineation in colonoscopy is crucial for assisting in diagnosis, guiding interventions, and treatments. However, current deep-learning approaches fall short due to integrity deficiency, which often manifests as missing lesion parts. This paper introduces the integrity concept in polyp segmentation at both macro and micro levels, aiming to alleviate integrity deficiency. Specifically, the model should distinguish entire polyps at the macro level and identify all components within polyps at the micro level. Our Integrity Capturing Polyp Segmentation (IC-PolypSeg) network utilizes lightweight backbones and 3 key components for integrity ameliorating: 1) Pixel-wise feature redistribution (PFR) module captures global spatial correlations across channels in the final semantic-rich encoder features. 2) Cross-stage pixel-wise feature redistribution (CPFR) module dynamically fuses high-level semantics and low-level spatial features to capture contextual information. 3) Coarse-to-fine calibration module combines PFR and CPFR modules to achieve precise boundary detection. Extensive experiments on 5 public datasets demonstrate that the proposed IC-PolypSeg outperforms 8 state-of-the-art methods in terms of higher precision and significantly improved computational efficiency with lower computational consumption. IC-PolypSeg-EF0 employs 300 times fewer parameters than PraNet while achieving a real-time processing speed of 235 FPS. Importantly, IC-PolypSeg reduces the false negative ratio on five datasets, meeting clinical requirements.
在鼻镜检查中,准确的边界形成是协助诊断、指导干预和治疗的关键。然而,当前深度学习方法由于完整性不足而不足,这常常表现为 missing Lesion parts。本文介绍了在 macro 和 micro 级别上的完整性概念,旨在减轻完整性不足。具体来说,模型应该在 macro 级别上区分整个息肉,并在 micro 级别上识别息肉内部的所有组件。我们的完整性捕获息肉分割(IC-PolypSeg)网络使用轻量级骨架和三个关键组件来改善完整性:1)像素级特征重分配(PFR)模块捕获通道上的全局空间相关性,在最终的语义丰富的编码特征中。2)跨阶段像素级特征重分配(CPFR)模块动态地融合高层语义和低级别空间特征,以捕获上下文信息。3)粗到细校准模块将 PFR 和 CPFR 模块组合起来,以实现精确的边界检测。对 5 个公共数据集进行广泛的实验表明,所提出的 IC-PolypSeg 网络在更高的精度方面比 8 个先进的方法更好,同时减少了计算开销。IC-PolypSeg-EF0 使用比 PraNet 少 300 倍的参数,但实现了 235 FPS 的实时处理速度。重要的是,IC-PolypSeg 降低了 5 个数据集的 false negative 比率,符合临床要求。
https://arxiv.org/abs/2309.08234
Temporal action segmentation is typically achieved by discovering the dramatic variances in global visual descriptors. In this paper, we explore the merits of local features by proposing the unsupervised framework of Object-centric Temporal Action Segmentation (OTAS). Broadly speaking, OTAS consists of self-supervised global and local feature extraction modules as well as a boundary selection module that fuses the features and detects salient boundaries for action segmentation. As a second contribution, we discuss the pros and cons of existing frame-level and boundary-level evaluation metrics. Through extensive experiments, we find OTAS is superior to the previous state-of-the-art method by $41\%$ on average in terms of our recommended F1 score. Surprisingly, OTAS even outperforms the ground-truth human annotations in the user study. Moreover, OTAS is efficient enough to allow real-time inference.
时间动作分割通常通过发现全球视觉描述符的重大差异来实现。在本文中,我们提出了对象中心的时间动作分割(OTAS)框架,以探索本地特征的优点。OTAS广义地说包括自监督的全球和本地特征提取模块以及边界选择模块,将特征融合并检测运动分割的显著边界。作为第二贡献,我们讨论了现有帧级和边界级评估 metrics 的优缺点。通过广泛的实验,我们发现OTAS平均比先前的最先进的方法高出41%,在推荐F1得分方面表现更好。令人惊讶地,OTAS在用户研究中甚至优于真实值人类标注。此外,OTAS高效 enough 以允许实时推断。
https://arxiv.org/abs/2309.06276
Music Structure Analysis (MSA) is the task aiming at identifying musical segments that compose a music track and possibly label them based on their similarity. In this paper we propose a supervised approach for the task of music boundary detection. In our approach we simultaneously learn features and convolution kernels. For this we jointly optimize -- a loss based on the Self-Similarity-Matrix (SSM) obtained with the learned features, denoted by SSM-loss, and -- a loss based on the novelty score obtained applying the learned kernels to the estimated SSM, denoted by novelty-loss. We also demonstrate that relative feature learning, through self-attention, is beneficial for the task of MSA. Finally, we compare the performances of our approach to previously proposed approaches on the standard RWC-Pop, and various subsets of SALAMI.
音乐结构分析(MSA)的任务是确定构成音乐片段的部分,并可能根据它们的相似性将它们分类。在本文中,我们提出了一个 supervised 的方法,用于音乐边界检测任务。在我们的方法中,我们同时学习特征和卷积核。为此,我们共同优化两个损失函数:一个基于学习特征的 self-similarity-Matrix(SSM)损失,另一个基于学习核的新颖性损失,用 Novelty-loss 表示。我们还证明,通过自我关注,相对特征学习对 MSA 任务有益。最后,我们比较了我们的方法和之前提出的 approaches 在标准 RWC-Pop 音乐片段和SalAMI 各种子集上的性能。
https://arxiv.org/abs/2309.02243
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.
本论文介绍了一组匈牙利语文本处理模型,这些模型在平衡资源效率和准确性方面实现了接近最新性能。这些模型在spaCy框架中实现,扩展了HuSpaCyToolkit的结构并做了多项改进。与匈牙利语语料库现有的NLP工具相比,我们的管道线 feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependencyParsing and named entity recognition with high accuracy and throughput. 我们彻底评估了所提出的改进,比较了管道线和最先进的工具,并展示了新模型在所有文本预处理步骤中的 competitive performance。所有实验都可以重复,管道线可以免费使用具有宽松许可证。
https://arxiv.org/abs/2308.12635
We propose Boundary-RL, a novel weakly supervised segmentation method that utilises only patch-level labels for training. We envision the segmentation as a boundary detection problem, rather than a pixel-level classification as in previous works. This outlook on segmentation may allow for boundary delineation under challenging scenarios such as where noise artefacts may be present within the region-of-interest (ROI) boundaries, where traditional pixel-level classification-based weakly supervised methods may not be able to effectively segment the ROI. Particularly of interest, ultrasound images, where intensity values represent acoustic impedance differences between boundaries, may also benefit from the boundary delineation approach. Our method uses reinforcement learning to train a controller function to localise boundaries of ROIs using a reward derived from a pre-trained boundary-presence classifier. The classifier indicates when an object boundary is encountered within a patch, as the controller modifies the patch location in a sequential Markov decision process. The classifier itself is trained using only binary patch-level labels of object presence, which are the only labels used during training of the entire boundary delineation framework, and serves as a weak signal to inform the boundary delineation. The use of a controller function ensures that a sliding window over the entire image is not necessary. It also prevents possible false-positive or -negative cases by minimising number of patches passed to the boundary-presence classifier. We evaluate our proposed approach for a clinically relevant task of prostate gland segmentation on trans-rectal ultrasound images. We show improved performance compared to other tested weakly supervised methods, using the same labels e.g., multiple instance learning.
我们提出了boundary-RL,一种 novel 弱监督分割方法,仅使用块级标签进行训练。我们设想分割是一种边界检测问题,而不是在以前的作品中所采用的像素级分类问题。这种对分割的看法可以在遇到挑战的情况下实现边界分割,例如在兴趣区域(ROI)的边界内可能存在噪声效应,传统的基于像素级分类的弱监督方法可能无法有效地分割 ROI。尤其是,超声波图像,其强度值表示边界之间的 acoustic impedance 差异,也可能受益于边界分割方法。我们的方法使用强化学习来训练控制器函数,通过从预先训练的边界存在分类器中获取奖励,来定位 ROI 的边界。分类器表示何时对象边界在一个块内出现,因为控制器在Sequential 马尔可夫决策过程中对块位置进行修改。分类器本身仅使用二进制块级标签表示物体存在,这些标签是在整个边界分割框架训练中使用的唯一的标签,并成为告诉边界分割的弱信号。使用控制器函数确保了整个图像不必进行滑动窗口。它还可以防止可能的假阳性或阴性情况,通过最小化传递给边界存在分类器的块的数量。我们评估了我们所提出的方法在跨Rectified房地产分割相关任务中的表现。我们表现出与其他测试的弱监督方法相比更好的性能,使用相同的标签,例如多实例学习。
https://arxiv.org/abs/2308.11376
This paper introduces our system designed for Track 2, which focuses on locating manipulated regions, in the second Audio Deepfake Detection Challenge (ADD 2023). Our approach involves the utilization of multiple detection systems to identify splicing regions and determine their authenticity. Specifically, we train and integrate two frame-level systems: one for boundary detection and the other for deepfake detection. Additionally, we employ a third VAE model trained exclusively on genuine data to determine the authenticity of a given audio clip. Through the fusion of these three systems, our top-performing solution for the ADD challenge achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%. This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.
本论文介绍了我们为第二个音频深度伪造检测挑战(ADD 2023)设计的系统,该系统专注于确定剪辑区域,重点关注如何定位修改区域。我们的方法是利用多个检测系统来识别剪辑区域并确定其真实性。具体而言,我们训练并整合了两个帧级别的系统:一个用于边界检测,另一个用于深度伪造检测。此外,我们使用训练唯一地基于真实数据的第三个VAE模型来确定给定音频片段的真实性。通过将这些三个系统的融合,我们ADD挑战中表现最佳的解决方案取得了令人印象深刻的82.23%语句准确性和60.66%的F1得分。这导致最终ADD得分为0.6713,确保了ADD 2023 track 2的第一排名。
https://arxiv.org/abs/2308.10281