Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at this https URL.
食管癌是全球最常见的癌症之一,在癌症相关死亡率中排名第六。准确的多模态癌症分期诊断方法可以帮助医生有效地个性化治疗方案。目前,基于CT的癌症诊断方法因全面检查患者状况而受到广泛关注。然而,基于多模态的方法可能会引入信息冗余,导致性能下降。此外,需要进一步探索多模态表示之间的高效且有效的交互作用,缺乏对多模态特征的预测关联的深入探讨。在这项工作中,我们基于CT图像的 multi-modal 异质图条件特征指导扩散模型研究淋巴结转移的诊断,并使用临床测量和放射学数据。为了探索多模态特征之间的复杂关系,我们构建了一个异质图。接下来,应用条件特征引导扩散方法消除信息冗余。此外,我们提出了一种遮罩关系表示学习策略,旨在揭示原发肿瘤和淋巴结图像表示的潜在预后相关性和优先级。各种实验结果证实了我们提出方法的的有效性。代码可在此处访问:https://www.xxx.com
https://arxiv.org/abs/2405.09539
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911
Current video summarization methods primarily depend on supervised computer vision techniques, which demands time-consuming manual annotations. Further, the annotations are always subjective which make this task more challenging. To address these issues, we analyzed the feasibility in transforming the video summarization into a text summary task and leverage Large Language Models (LLMs) to boost video summarization. This paper proposes a novel self-supervised framework for video summarization guided by LLMs. Our method begins by generating captions for video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the frame captions and the text summary. It's worth noting that we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames whose captions are similar with the text summary. Our model achieves competitive results against other state-of-the-art methods and paves a novel pathway in video summarization.
目前,主要的视频摘要方法依赖于监督计算机视觉技术,这需要耗时的人工标注。此外,标注总是主观的,使这项任务更具挑战性。为了应对这些问题,我们分析了将视频摘要转换为文本摘要的可行性,并利用大型语言模型(LLMs)提高视频摘要。本文提出了一种新的自监督框架,用于指导LLMs的 video summarization。我们的方法首先为视频帧生成字幕,然后由LLMs将其合成为文本摘要。接下来,我们测量视频帧字幕与文本摘要之间的语义距离。值得注意的是,我们提出了一个新颖的损失函数,根据视频的多样性优化我们的模型。最后,可以根据文本摘要选择具有相似文本摘要的帧来生成摘要视频。我们的模型在与其他最先进的 methods竞争的同时,在视频摘要领域开辟了新的途径。
https://arxiv.org/abs/2405.08890
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at this https URL
目前用于长视频理解的数据集往往无法提供真正的长视频理解挑战,因为许多来源于这些数据集的任务只要分析视频中的一两个随机帧就可以成功解决。为解决这个问题,我们提出了一个名为CinePile的新数据集和基准,专门为真正的长视频理解而设计。本文详细介绍了我们创造问题-答案数据集的方法,利用先进的神经网络与人类交互并基于人类生成的原始数据。我们全面的数据集包括305,000个多选题问题(MCQs),涵盖各种视觉和多模态方面,包括时间理解、理解人-对象交互和推理场景中事件或动作。此外,我们还评估了我们的数据集中的最新视频相关LLM,包括开源和专有版本,在测试集上。研究结果表明,即使是最先进的视频相关LLM在这些任务上也无法与人类 performance相媲美,这突出了视频理解的复杂性和挑战性。该数据集可在此处访问:<https://this URL>
https://arxiv.org/abs/2405.08813
Large multimodal models (LMMs) have proven flexible and generalisable across many tasks and fields. Although they have strong potential to aid scientific research, their capabilities in this domain are not well characterised. A key aspect of scientific research is the ability to understand and interpret figures, which serve as a rich, compressed source of complex information. In this work, we present SciFIBench, a scientific figure interpretation benchmark. Our main benchmark consists of a 1000-question gold set of multiple-choice questions split between two tasks across 12 categories. The questions are curated from CS arXiv paper figures and captions, using adversarial filtering to find hard negatives and human verification for quality control. We evaluate 26 LMMs on SciFIBench, finding it to be a challenging benchmark. Finally, we investigate the alignment and reasoning faithfulness of the LMMs on augmented question sets from our benchmark. We release SciFIBench to encourage progress in this domain.
大规模多模态模型(LMMs)已经在许多任务和领域证明了自己的灵活性和通用性。尽管它们在科学研究方面具有强大的潜力,但它们在这个领域的表现尚不明确。科学研究的關鍵方面是理解和解释数据,这些数据作为複雜信息的豐富、壓縮資源。在本文中,我們介紹了SciFIBench,一個科學數據解釋 benchmark。主要的基准包括一個由1000個多选题問題組成的金寶石數據集,分佈在12個類別中跨越兩個任務。問題是從CS arXiv論文圖像和標題中策展的,使用 adversarial 過濾來發現強硬負面和人類驗證來進行質量控制。我們在SciFIBench上評估了26個LMM,發現它是一個具有挑戰性的標記。最後,我們研究了LMM在擴展問題集上的對 align 和reasoning faithfulness。我們發布SciFIBench,以鼓勵這個領域的進步。
https://arxiv.org/abs/2405.08807
We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at this http URL
我们提出了Hunyuan-DiT,一种具有对英语和中文细粒度理解的自监督文本到图像扩散变换器。为了构建Hunyuan-DiT,我们精心设计了变换器结构、文本编码器和位置编码。我们还从零开始构建了一个完整的数据管道,用于迭代模型优化。为了深入理解语言,我们训练了一个多模态大型语言模型,以优化图像的摘要。最后,Hunyuan-DiT可以与用户进行多轮多模态对话,根据上下文生成和优化图像。通过我们的全面人类评估协议(超过50名专业人类评估员)的全面评估,Hunyuan-DiT在中文到图像生成方面达到了最先进水平,与其他开源模型相比。代码和预训练模型可在该http URL公开获取。
https://arxiv.org/abs/2405.08748
Current architectures for video understanding mainly build upon 3D convolutional blocks or 2D convolutions with additional operations for temporal modeling. However, these methods all regard the temporal axis as a separate dimension of the video sequence, which requires large computation and memory budgets and thus limits their usage on mobile devices. In this paper, we propose to squeeze the time axis of a video sequence into the channel dimension and present a lightweight video recognition network, term as \textit{SqueezeTime}, for mobile video understanding. To enhance the temporal modeling capability of the proposed network, we design a Channel-Time Learning (CTL) Block to capture temporal dynamics of the sequence. This module has two complementary branches, in which one branch is for temporal importance learning and another branch with temporal position restoring capability is to enhance inter-temporal object modeling ability. The proposed SqueezeTime is much lightweight and fast with high accuracies for mobile video understanding. Extensive experiments on various video recognition and action detection benchmarks, i.e., Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14, demonstrate the superiority of our model. For example, our SqueezeTime achieves $+1.2\%$ accuracy and $+80\%$ GPU throughput gain on Kinetics400 than prior methods. Codes are publicly available at this https URL and this https URL.
目前用于视频理解的架构主要基于3D卷积块或2D卷积块,并添加了用于时间建模的操作。然而,这些方法都将时间轴视为视频序列的单独维度,需要大量的计算和内存预算,因此限制了它们在移动设备上的使用。在本文中,我们提出了一种将视频序列的时间轴压缩到通道维度,并提出了一个轻量级的移动视频理解网络,称为\textit{SqueezeTime},用于移动视频理解。为了增强所提出的网络的时序建模能力,我们设计了一个通道时学习(CTL)模块来捕捉序列的时变动态。这个模块有两个互补的分支,其中一个是用于时间重要性学习,另一个是用于时间位置恢复能力的分支,以增强跨时间物体建模能力。所提出的SqueezeTime在移动视频理解中非常轻便且快速,具有很高的准确性。在各种视频识别和动作检测基准上进行的广泛实验(即Kinetics400、Kinetics600、HMDB51、AVA2.1和THUMOS14)证明了我们的模型的优越性。例如,我们的SqueezeTime在Kinetics400上实现了比 prior 方法 $+1.2\%$ 的准确性和$+80\%$的GPU吞吐量增益。代码公开可用,请访问以下链接:https://this URL 和 https://this URL。
https://arxiv.org/abs/2405.08344
Can web-based image processing and visualization tools easily integrate into existing websites without significant time and effort? Our Boostlet.js library addresses this challenge by providing an open-source, JavaScript-based web framework to enable additional image processing functionalities. Boostlet examples include kernel filtering, image captioning, data visualization, segmentation, and web-optimized machine-learning models. To achieve this, Boostlet.js uses a browser bookmark to inject a user-friendly plugin selection tool called PowerBoost into any host website. Boostlet also provides on-site access to a standard API independent of any visualization framework for pixel data and scene manipulation. Web-based Boostlets provide a modular architecture and client-side processing capabilities to apply advanced image-processing techniques using consumer-level hardware. The code is open-source and available.
基于Web的图像处理和可视化工具能否轻松地集成到现有的网站中而不需要大量的时间和精力?我们的Boostlet.js库通过提供一个基于JavaScript的开源框架来解决这一挑战,从而使图像处理功能得以增强。Boostlet示例包括内核滤波、图像标题、数据可视化、分割和针对网络优化的机器学习模型。为了实现这一点,Boostlet.js使用浏览器书签在任意主机网站上注入一个用户友好的插件选择工具,名为PowerBoost。Boostlet还提供了一个标准的API,使其具有与任何可视化框架(针对像素数据和场景操作)独立的本地访问。基于Web的Boostlet提供了一种模块化架构和客户端处理能力,使消费者水平硬件上应用高级图像处理技术。代码是开源的,可以获取。
https://arxiv.org/abs/2405.07868
The application of Automatic Speech Recognition (ASR) technology in soccer offers numerous opportunities for sports analytics. Specifically, extracting audio commentaries with ASR provides valuable insights into the events of the game, and opens the door to several downstream applications such as automatic highlight generation. This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts, enhancing video content with rich layers of textual information derived from the game audio using ASR. These textual commentaries, generated using the Whisper model and translated with Google Translate, extend the usefulness of the SoccerNet dataset in diverse applications such as enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games. We detail the methods involved in the curation of this dataset and the integration of ASR. We also highlight the implications of a multimodal approach in sports analytics, and how the enriched dataset can support diverse applications, thus broadening the scope of research and development in the field of sports analytics.
自动语音识别(ASR)技术在足球中的应用带来了许多体育分析的机会。特别是,通过提取来自足球比赛直播的音频评论,使用ASR提取音频评论提供了对比赛事件的宝贵见解,并打开了几个下游应用的大门,如自动高光提取。本文介绍了SoccerNet-Echoes,一种通过从足球比赛直播中自动生成音频评论来扩充SoccerNet数据集的增强视频内容,利用ASR生成的文本信息丰富地增强了视频内容的应用。这些文本评论使用Whisper模型生成,并使用谷歌翻译翻译。通过结合文本数据和视觉和听觉内容,SoccerNet-Echoes旨在成为开发专门捕捉足球比赛动态的算法的全面资源。我们详细介绍了这个数据集的编辑方法和ASR的集成。我们还强调了体育分析中多模态方法的意义,以及丰富数据集如何支持各种应用,从而扩大体育分析领域的研究和开发范围。
https://arxiv.org/abs/2405.07354
We present SLIP (SAM+CLIP), an enhanced architecture for zero-shot object segmentation. SLIP combines the Segment Anything Model (SAM) \cite{kirillov2023segment} with the Contrastive Language-Image Pretraining (CLIP) \cite{radford2021learning}. By incorporating text prompts into SAM using CLIP, SLIP enables object segmentation without prior training on specific classes or categories. We fine-tune CLIP on a Pokemon dataset, allowing it to learn meaningful image-text representations. SLIP demonstrates the ability to recognize and segment objects in images based on contextual information from text prompts, expanding the capabilities of SAM for versatile object segmentation. Our experiments demonstrate the effectiveness of the SLIP architecture in segmenting objects in images based on textual cues. The integration of CLIP's text-image understanding capabilities into SAM expands the capabilities of the original architecture and enables more versatile and context-aware object segmentation.
我们提出了SLIP(SAM+CLIP),一种增强的零 shot 对象分割架构。SLIP将段 Anything 模型(SAM)与对比性语言图像预训练(CLIP)相结合。通过使用 CLIP 将文本提示集成到 SAM 中,SLIP 使得在没有特定类或类别的先验训练的情况下进行对象分割。我们在精灵宝可梦数据集上微调 CLIP,使其能够学习有意义的图像文本表示。SLIP证明了基于文本提示识别和分割图像中物体的能力,擴展了 SAM 的多樣化物體分割能力。我们的實驗證明了 SLIP 架構在基于文本提示分割图像物体方面的有效性。將 CLIP 的文本理解能力集成到 SAM 中擴展了原始架構的功能,並使更有多樣化和具有上下文意識的物體分割。
https://arxiv.org/abs/2405.07284
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose to take advantage of existing pre-trained large-scale vision and language models to directly generate captions with test time adaptation. Specifically, we bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP as well as frozen CLIP. Differing from the conventional way to train these tokens with training data, we update these tokens with pseudo-targets of the inference data under several carefully crafted loss functions which enable the tokens to absorb video information catered for GPT-2. This procedure can be done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
尽管在完全监督的视频标题中取得了显著的进展,但零样本方法仍然没有被充分利用。在本文中,我们提出了一种利用现有的大规模视觉和语言预训练模型直接生成适应测试时间调整的标题的方法。具体来说,我们通过三个关键模型:一个通用的视频理解模型XCLIP、一个通用的图像理解模型CLIP和一个文本生成模型GPT-2来桥接视频和文本,因为它们具有源代码可用性。主要挑战是如何让文本生成模型对给定视频的内容有足够的意识,从而生成相应的标题。为解决这个问题,我们提出了一种使用可学习标记作为在冻定的GPT-2和冻定的XCLIP以及冻定的CLIP之间的通信媒介的方法。与传统利用训练数据来训练这些标记的方式不同,我们在多个精心设计的损失函数下使用推理数据的伪目标更新这些标记。这个过程可以在几轮迭代后完成(我们在实验中使用了16轮)并且不需要地面真实数据。在三个广泛使用的数据集MSR-VTT、MSVD和VATEX上的实验结果表明,与现有最先进的方法相比,提高幅度在4%到20%之间。
https://arxiv.org/abs/2405.07046
In this paper, we show that transferring knowledge from other domains of video understanding combined with large-scale learning can improve robustness of Video Object Segmentation (VOS) under complex circumstances. Namely, we focus on integrating scene global motion knowledge to improve large-scale semi-supervised Video Object Segmentation. Prior works on VOS mostly rely on direct comparison of semantic and contextual features to perform dense matching between current and past frames, passing over actual motion structure. On the other hand, Optical Flow Estimation task aims to approximate the scene motion field, exposing global motion patterns which are typically undiscoverable during all pairs similarity search. We present WarpFormer, an architecture for semi-supervised Video Object Segmentation that exploits existing knowledge in motion understanding to conduct smoother propagation and more accurate matching. Our framework employs a generic pretrained Optical Flow Estimation network whose prediction is used to warp both past frames and instance segmentation masks to the current frame domain. Consequently, warped segmentation masks are refined and fused together aiming to inpaint occluded regions and eliminate artifacts caused by flow field imperfects. Additionally, we employ novel large-scale MOSE 2023 dataset to train model on various complex scenarios. Our method demonstrates strong performance on DAVIS 2016/2017 validation (93.0% and 85.9%), DAVIS 2017 test-dev (80.6%) and YouTube-VOS 2019 validation (83.8%) that is competitive with alternative state-of-the-art methods while using much simpler memory mechanism and instance understanding logic.
在本文中,我们证明了将来自其他领域视频理解的知识与大规模学习相结合可以提高视频对象分割(VOS)在复杂情况下的鲁棒性。具体来说,我们关注将场景全局运动知识集成到大型半监督视频对象分割中,以提高大规模半监督视频对象分割的鲁棒性。先前的VOS研究主要依赖于对当前和过去帧进行语义和上下文特征的直接比较来进行密集匹配,而忽略了实际运动结构。另一方面,光学流估计任务旨在近似场景运动场,揭示在所有相似度搜索过程中通常不可见的全局运动模式。我们提出了WarpFormer,一种半监督视频对象分割架构,利用现有的运动理解知识进行平滑传播和更精确的匹配。我们的框架采用了一个通用的预训练的光学流估计网络,其预测用于将过去帧和实例分割掩码 warp 到当前帧领域。因此,warped分割掩码经过精细化和融合,旨在修复遮挡区域并消除由流场不完美引起的伪影。此外,我们还使用新的大型MOSE 2023数据集来训练模型,以训练其在各种复杂场景中的性能。我们的方法在DAVIS 2016/2017验证集(93.0%和85.9%)和DAVIS 2017测试集(80.6%)以及YouTube-VOS 2019验证集(83.8%)上的表现与最先进的替代方法竞争,同时使用更简单的内存机制和实例理解逻辑。
https://arxiv.org/abs/2405.07031
In Magnetic Resonance Imaging (MRI), image acquisitions are often undersampled in the measurement domain to accelerate the scanning process, at the expense of image quality. However, image quality is a crucial factor that influences the accuracy of clinical diagnosis; hence, high-quality image reconstruction from undersampled measurements has been a key area of research. Recently, deep learning (DL) methods have emerged as the state-of-the-art for MRI reconstruction, typically involving deep neural networks to transform undersampled MRI images into high-quality MRI images through data-driven processes. Nevertheless, there is clear and significant room for improvement in undersampled DL MRI reconstruction to meet the high standards required for clinical diagnosis, in terms of eliminating aliasing artifacts and reducing image noise. In this paper, we introduce a self-supervised pretraining procedure using contrastive learning to improve the accuracy of undersampled DL MRI reconstruction. We use contrastive learning to transform the MRI image representations into a latent space that maximizes mutual information among different undersampled representations and optimizes the information content at the input of the downstream DL reconstruction models. Our experiments demonstrate improved reconstruction accuracy across a range of acceleration factors and datasets, both quantitatively and qualitatively. Furthermore, our extended experiments validate the proposed framework's robustness under adversarial conditions, such as measurement noise, different k-space sampling patterns, and pathological abnormalities, and also prove the transfer learning capabilities on MRI datasets with completely different anatomy. Additionally, we conducted experiments to visualize and analyze the properties of the proposed MRI contrastive learning latent space.
在磁共振成像(MRI)中,图像采集通常在测量域中 undersampled 以加速扫描过程,但以牺牲图像质量为代价。然而,图像质量是影响临床诊断准确性至关重要的一个因素,因此,高质量的 undersampled 测量图像重建一直是一个研究热点。最近,深度学习(DL)方法已成为 MRI 重建的最新技术,通常涉及使用深度神经网络将 undersampled MRI 图像转换为高质量 MRI 图像,通过数据驱动的过程。然而,在 undersampled DL MRI 重建中,仍然存在显而易见的改进空间,以满足临床诊断的高标准,即消除混叠伪影并减少图像噪声。在本文中,我们使用对比学习引入自监督预训练程序来提高 undersampled DL MRI 重建的准确性。我们使用对比学习将 MRI 图像表示转换为具有最大相互信息的不同 undersampled 表示的潜在空间,并优化输入下游 DL 重建模型的信息内容。我们的实验结果表明,在不同的加速因子和数据集上,图像重建准确性得到了显著提高,无论是定量还是定性。此外,我们的扩展实验证明了所提出的框架在逆境条件下的鲁棒性,例如测量噪声、不同的 k-空间采样模式和病理性异常,以及证明其在具有完全不同解剖学结构的 MRI 数据上的迁移学习能力。此外,我们还进行了实验来可视化和分析所提出的 MRI 对比学习潜在空间的性质。
https://arxiv.org/abs/2306.00530
Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.
动作识别是构建交互式虚拟现实的关键技术。随着深度学习的快速发展,动作识别方法也取得了很大的进步。研究人员设计并实现了多个观点的动作识别骨骼,这导致了方法的多样性和遇到的新挑战。本文回顾了几种基于深度神经网络的动作识别方法。我们在第一部分介绍了两个通道网络及其变体,尤其是在本文中,使用RGB视频帧和光学流模式作为输入;第二部分介绍了3D卷积网络,它们致力于利用RGB模式直接提取不同运动信息;第三部分介绍了基于Transformer的方法,将自然语言处理模型的思想引入计算机视觉和视频理解。我们在本文的回顾中提供了客观的看法,并希望为未来的研究提供参考。
https://arxiv.org/abs/2405.05584
An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.
文本分类模型开发的一个瓶颈是训练数据的标注需要,这对多语言分类器来说更是如此。幸运的是,当代机器翻译模型既易于获取,又具有可靠的翻译质量,因此可以将一个语言的带标签训练数据翻译到另一个语言。在这里,我们探讨了使用机器翻译对跨多个语言分类任务进行微调的效果。我们还研究了使用图像标题领域中提出的一种新型技术来解释模型对翻译数据潜在负面影响的益处。我们发现,翻译数据具有足够的质量,可以用于微调跨多个语言分类器,并且这种新型损失技术能够提供比没有它更好的效果。
https://arxiv.org/abs/2405.05478
In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
在本文中,我们提出了LOC-ZSON,一种新的面向对象的图像表示方法,用于解决复杂场景中的目标导航任务。我们提出了一个以对象为中心的图像表示和相应的损失,用于视觉语言模型(VLM)的微调,可以处理复杂的对象级查询。此外,我们还设计了一种新的LLM-基的增强和提示模板,用于训练和零散推理中的稳定性。我们将在Astro机器人上实现我们的方法,并在模拟和现实世界的环境中进行零散对象导航。我们证明了我们的方法在检索任务中的文本到图像召回率上可以实现1.38 - 13.38%的改进。对于对象导航,我们在模拟和现实世界中展示了我们的方法的益处,分别实现了5%和16.67%的导航成功率的增长。
https://arxiv.org/abs/2405.05363
The diagnosis and treatment of chest diseases play a crucial role in maintaining human health. X-ray examination has become the most common clinical examination means due to its efficiency and cost-effectiveness. Artificial intelligence analysis methods for chest X-ray images are limited by insufficient annotation data and varying levels of annotation, resulting in weak generalization ability and difficulty in clinical dissemination. Here we present EVA-X, an innovative foundational model based on X-ray images with broad applicability to various chest disease detection tasks. EVA-X is the first X-ray image based self-supervised learning method capable of capturing both semantic and geometric information from unlabeled images for universal X-ray image representation. Through extensive experimentation, EVA-X has demonstrated exceptional performance in chest disease analysis and localization, becoming the first model capable of spanning over 20 different chest diseases and achieving leading results in over 11 different detection tasks in the medical field. Additionally, EVA-X significantly reduces the burden of data annotation in the medical AI field, showcasing strong potential in the domain of few-shot learning. The emergence of EVA-X will greatly propel the development and application of foundational medical models, bringing about revolutionary changes in future medical research and clinical practice. Our codes and models are available at: this https URL.
胸部疾病的诊断和治疗在维护人类健康中起着关键作用。X光检查由于其高效和成本效益成为最常用的临床检查手段。用于胸部X光图像的人工智能分析方法由于缺乏注释数据和注释水平的差异,导致泛化能力差和临床传播困难。在这里,我们介绍了EVA-X,一种基于胸部X光图像的创新基础模型,具有广泛的应用各种胸部疾病检测任务的能力。EVA-X是第一个基于无标签图像捕捉语义和几何信息的全局X光图像自监督学习方法。通过广泛的实验,EVA-X在胸部疾病分析和定位方面表现出色,成为第一个在医疗领域跨越20种不同胸部疾病并实现11个不同检测任务领先结果的模型。此外,EVA-X显著减轻了医学人工智能领域数据注释的负担,展示了在领域内的少量样本学习巨大潜力。EVA-X的出现将极大地推动基础医疗模型的开发和应用,带来未来医学研究和临床实践的革命性变化。我们的代码和模型可在此处访问:https://this URL。
https://arxiv.org/abs/2405.05237
Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
文本到图像人物识别(ReID)根据文本描述检索行人图像。手动标注文本描述费时,限制了现有数据集中的规模,因此限制了ReID模型的泛化能力。因此,我们研究可迁移的文本到图像ReID问题,在这个问题上,我们在提出的 large-scale 数据库上训练一个模型,然后直接部署到各种数据集上进行评估。我们通过多模态大型语言模型(MLLMs)获得了大量训练数据。此外,我们解决了利用获得的文本描述的两个关键挑战。首先,一个 MLLM 倾向于生成具有相似结构的描述,导致模型过拟合特定的句法模式。因此,我们提出了一种新颖的方法,使用 MLLMs 根据各种模板给图像 caption。这些模板是在与大型语言模型(LLM)的多轮对话中获得的。因此,我们可以构建一个具有多样文本描述的大型数据集。其次,一个 MLLM 可能产生错误的描述。因此,我们引入了一种新颖的方法,该方法会自动识别描述中与图像不匹配的单词。这个方法基于文本和图像中所有补丁词向量的相似性。然后,我们在后续的训练 epoch 中将这些单词的概率增大,减轻了噪音文本描述的影响。实验结果表明,我们的方法显著提高了直接迁移文本到图像 ReID 的性能。利用预训练模型权重,我们在传统评估设置中也取得了最先进的性能。
https://arxiv.org/abs/2405.04940