The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: this https URL
计算机视觉模型的系统评估和理解需要大量具有全面和定制标签的数据,而现实世界的视觉数据集很少满足这一要求。尽管当前的合成数据生成器提供了一个有前途的替代方案,特别是对于具身人工智能任务,但它们在计算机视觉任务上往往表现不足,原因包括资产和渲染质量低,缺乏多样性以及不现实的物理属性。我们引入了BEHAVIOR Vision Suite(BVS),一套用于生成针对系统评估计算机视觉模型的完全定制合成数据的工具和资产,基于新开发的具身人工智能基准BEHAVIOR-1K。BVS在场景级别支持大量可调整的参数(例如照明、物体放置),物体级别支持(例如关节配置,具有"填充"和"折叠"等属性的属性),以及相机级别支持(例如视野范围,焦距)。研究人员可以在数据生成过程中任意调整这些参数进行控制实验。我们展示了三个应用场景:在不同的连续轴域上评估模型的鲁棒性,对同一组图像评估场景理解模型,以及为新颖视觉任务进行建模和评估:原初和二进制状态预测。项目网站:https:// this URL
https://arxiv.org/abs/2405.09546
Aerial imagery is increasingly used in Earth science and natural resource management as a complement to labor-intensive ground-based surveys. Aerial systems can collect overlapping images that provide multiple views of each location from different perspectives. However, most prediction approaches (e.g. for tree species classification) use a single, synthesized top-down "orthomosaic" image as input that contains little to no information about the vertical aspects of objects and may include processing artifacts. We propose an alternate approach that generates predictions directly on the raw images and accurately maps these predictions into geospatial coordinates using semantic meshes. This method$\unicode{x2013}$released as a user-friendly open-source toolkit$\unicode{x2013}$enables analysts to use the highest quality data for predictions, capture information about the sides of objects, and leverage multiple viewpoints of each location for added robustness. We demonstrate the value of this approach on a new benchmark dataset of four forest sites in the western U.S. that consists of drone images, photogrammetry results, predicted tree locations, and species classification data derived from manual surveys. We show that our proposed multiview method improves classification accuracy from 53% to 75% relative to an orthomosaic baseline on a challenging cross-site tree species classification task.
无人机影像在地球科学和自然资源管理中作为劳动密集型地面调查的补充,越来越受到关注。无人机系统可以收集重叠的图像,从不同的角度提供每个地点的多个视图。然而,大多数预测方法(例如树木种类分类)使用单个合成顶部的“正射影像”作为输入,其中包含少量的关于物体垂直方面的信息,并可能包括处理伪影。我们提出了一种替代方法,直接在原始图像上生成预测,并使用语义网格将预测准确地映射到地理坐标中。这个用户友好、开源的工具包$\unicode{x2013}$的发布使得分析师可以使用最高质量的数据进行预测,捕获物体的一侧信息,并利用每个地点的多个视角来增加稳健性。我们在美国西部四个森林站的基准数据集上证明了这种方法的价值,该数据集包括无人机影像、地形测量结果、预测树木位置和来自手动调查的树木种类分类数据。我们显示,与正射影像基线相比,我们提出的多视角方法将分类准确性从53%提高到了75%。
https://arxiv.org/abs/2405.09544
Esophageal cancer is one of the most common types of cancer worldwide and ranks sixth in cancer-related mortality. Accurate computer-assisted diagnosis of cancer progression can help physicians effectively customize personalized treatment plans. Currently, CT-based cancer diagnosis methods have received much attention for their comprehensive ability to examine patients' conditions. However, multi-modal based methods may likely introduce information redundancy, leading to underperformance. In addition, efficient and effective interactions between multi-modal representations need to be further explored, lacking insightful exploration of prognostic correlation in multi-modality features. In this work, we introduce a multi-modal heterogeneous graph-based conditional feature-guided diffusion model for lymph node metastasis diagnosis based on CT images as well as clinical measurements and radiomics data. To explore the intricate relationships between multi-modal features, we construct a heterogeneous graph. Following this, a conditional feature-guided diffusion approach is applied to eliminate information redundancy. Moreover, we propose a masked relational representation learning strategy, aiming to uncover the latent prognostic correlations and priorities of primary tumor and lymph node image representations. Various experimental results validate the effectiveness of our proposed method. The code is available at this https URL.
食管癌是全球最常见的癌症之一,在癌症相关死亡率中排名第六。准确的多模态癌症分期诊断方法可以帮助医生有效地个性化治疗方案。目前,基于CT的癌症诊断方法因全面检查患者状况而受到广泛关注。然而,基于多模态的方法可能会引入信息冗余,导致性能下降。此外,需要进一步探索多模态表示之间的高效且有效的交互作用,缺乏对多模态特征的预测关联的深入探讨。在这项工作中,我们基于CT图像的 multi-modal 异质图条件特征指导扩散模型研究淋巴结转移的诊断,并使用临床测量和放射学数据。为了探索多模态特征之间的复杂关系,我们构建了一个异质图。接下来,应用条件特征引导扩散方法消除信息冗余。此外,我们提出了一种遮罩关系表示学习策略,旨在揭示原发肿瘤和淋巴结图像表示的潜在预后相关性和优先级。各种实验结果证实了我们提出方法的的有效性。代码可在此处访问:https://www.xxx.com
https://arxiv.org/abs/2405.09539
This study evaluates the performance of Recurrent Neural Network (RNN) and Transformer in replicating cross-language structural priming: a key indicator of abstract grammatical representations in human language processing. Focusing on Chinese-English priming, which involves two typologically distinct languages, we examine how these models handle the robust phenomenon of structural priming, where exposure to a particular sentence structure increases the likelihood of selecting a similar structure subsequently. Additionally, we utilize large language models (LLM) to measure the cross-lingual structural priming effect. Our findings indicate that Transformer outperform RNN in generating primed sentence structures, challenging the conventional belief that human sentence processing primarily involves recurrent and immediate processing and suggesting a role for cue-based retrieval mechanisms. Overall, this work contributes to our understanding of how computational models may reflect human cognitive processes in multilingual contexts.
本研究评估了循环神经网络(RNN)和Transformer在复制跨语言结构预处理方面的性能:这是人类语言处理中抽象语义表示的关键指标。我们重点研究了汉语-英语预处理,涉及两种具有不同类型的典型语言,并探讨了这些模型如何处理结构预定的稳健现象,即暴露于特定句子结构会增加随后选择类似结构的概率。此外,我们还利用大型语言模型(LLM)测量跨语言结构预处理效果。我们的研究结果表明,Transformer在生成预置句子结构方面优于RNN,挑战了传统观念,即人类句子处理主要涉及循环和即时的处理,并表明了基于提示的检索机制的作用。总的来说,这项工作对我们的理解在多语言环境中计算模型如何反映人类认知过程做出了贡献。
https://arxiv.org/abs/2405.09508
We present QueryNER, a manually-annotated dataset and accompanying model for e-commerce query segmentation. Prior work in sequence labeling for e-commerce has largely addressed aspect-value extraction which focuses on extracting portions of a product title or query for narrowly defined aspects. Our work instead focuses on the goal of dividing a query into meaningful chunks with broadly applicable types. We report baseline tagging results and conduct experiments comparing token and entity dropping for null and low recall query recovery. Challenging test sets are created using automatic transformations and show how simple data augmentation techniques can make the models more robust to noise. We make the QueryNER dataset publicly available.
我们提出了QueryNER,一个手动标注的电商查询分割数据集和相应的模型,用于电商查询分割。之前的工作在电商序列标注中主要解决了面向值的提取,即针对狭窄定义的方面提取产品标题或查询的部分。我们的工作重点在于将查询划分为有意义的片段,具有广泛应用的类型。我们报告了基线标签结果,并研究了对于空和低召回度的查询恢复,通过自动转换进行实验。具有挑战性的测试集是用自动转换创建的,展示了简单的数据增强技术如何使模型对噪声更加鲁棒。我们将QueryNER数据集公开发布。
https://arxiv.org/abs/2405.09507
We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.
我们介绍了一个名为ParaNames的多语言并行命名资源,由140 million个跨越400多种语言的名字组成。为16,800,000个实体提供了名字,并且每个实体都从复杂的类型层次结构映射到标准类型(PER/LOC/ORG)。作为Wikidata的来源,我们创建了有史以来最大的此类资源。我们描述了我们对数据进行过滤和标准化的方法,以提供最佳质量。ParaNames对于多语言语言处理都非常有用,不仅在定义名称翻译/转写任务中,还作为命名实体识别和链接等任务的补充数据。我们在两个任务中展示了ParaNames的实用性。首先,在英语和其他17种语言之间进行规范化的命名翻译。其次,将其用作多语言命名实体识别的目录,在所有评估语言上都实现了性能提升。
https://arxiv.org/abs/2405.09496
The primary color profile of the same identity is assumed to remain consistent in typical Person Re-identification (Person ReID) tasks. However, this assumption may be invalid in real-world situations and images hold variant color profiles, because of cross-modality cameras or identity with different clothing. To address this issue, we propose Color Space Learning (CSL) for those Cross-Color Person ReID problems. Specifically, CSL guides the model to be less color-sensitive with two modules: Image-level Color-Augmentation and Pixel-level Color-Transformation. The first module increases the color diversity of the inputs and guides the model to focus more on the non-color information. The second module projects every pixel of input images onto a new color space. In addition, we introduce a new Person ReID benchmark across RGB and Infrared modalities, NTU-Corridor, which is the first with privacy agreements from all participants. To evaluate the effectiveness and robustness of our proposed CSL, we evaluate it on several Cross-Color Person ReID benchmarks. Our method surpasses the state-of-the-art methods consistently. The code and benchmark are available at: this https URL
相同身份的主要颜色剖面假定在典型的人识别(Person ReID)任务中保持一致。然而,在现实世界的场景和图像中,这一假设可能是无效的,因为跨模态相机或由于不同服装而具有不同的身份。为了应对这个问题,我们为跨颜色人物识别问题提出了颜色空间学习(CSL)。具体来说,CSL通过两个模块来指导模型:图像级别的颜色增强和像素级别的颜色变换。第一个模块增加了输入的色度多样性并指导模型更关注非颜色信息。第二个模块将输入图像的每个像素投影到一个新的颜色空间。此外,我们还引入了一个新的跨RGB和红外模态的人识别基准,NTU-Corridor,它是第一个隐私协议得到所有参与者认可的基准。为了评估我们提出的CSL的有效性和鲁棒性,我们在多个跨颜色人物识别基准上进行了评估。我们的方法超越了最先进的方法。代码和基准可用于此链接:https:// this URL
https://arxiv.org/abs/2405.09487
Using large language models (LLMs) for educational applications like dialogue-based teaching is a hot topic. Effective teaching, however, requires teachers to adapt the difficulty of content and explanations to the education level of their students. Even the best LLMs today struggle to do this well. If we want to improve LLMs on this adaptation task, we need to be able to measure adaptation success reliably. However, current Static metrics for text difficulty, like the Flesch-Kincaid Reading Ease score, are known to be crude and brittle. We, therefore, introduce and evaluate a new set of Prompt-based metrics for text difficulty. Based on a user study, we create Prompt-based metrics as inputs for LLMs. They leverage LLM's general language understanding capabilities to capture more abstract and complex features than Static metrics. Regression experiments show that adding our Prompt-based metrics significantly improves text difficulty classification over Static metrics alone. Our results demonstrate the promise of using LLMs to evaluate text adaptation to different education levels.
使用大型语言模型(LLMs)进行教育应用,如以对话为基础的教学,是一个热门话题。然而,有效的教学需要教师根据学生的教育水平调整内容的难度和解释。即使是最先进的LLM today,也很难做到这一点。如果我们希望在这个适应任务上改进LLMs,我们需要能够可靠地测量适应成功的程度。然而,目前静态衡量文本难度的指标,如Flesch-Kincaid阅读难度分数,已经被证明是不够准确和脆弱的。因此,我们引入并评估了一组基于提示的文本难度指标。根据用户研究,我们将基于提示的指标作为LLM的输入。它们利用LLM的通用语言理解能力来捕捉比静态指标更抽象和复杂的特点。回归实验表明,仅使用我们的基于提示的指标显著提高了文本难度分类。我们的结果证明了使用LLM评估文本在不同教育水平上的适应具有巨大潜力。
https://arxiv.org/abs/2405.09482
With the advent of image super-resolution (SR) algorithms, how to evaluate the quality of generated SR images has become an urgent task. Although full-reference methods perform well in SR image quality assessment (SR-IQA), their reliance on high-resolution (HR) images limits their practical applicability. Leveraging available reconstruction information as much as possible for SR-IQA, such as low-resolution (LR) images and the scale factors, is a promising way to enhance assessment performance for SR-IQA without HR for reference. In this letter, we attempt to evaluate the perceptual quality and reconstruction fidelity of SR images considering LR images and scale factors. Specifically, we propose a novel dual-branch reduced-reference SR-IQA network, \ie, Perception- and Fidelity-aware SR-IQA (PFIQA). The perception-aware branch evaluates the perceptual quality of SR images by leveraging the merits of global modeling of Vision Transformer (ViT) and local relation of ResNet, and incorporating the scale factor to enable comprehensive visual perception. Meanwhile, the fidelity-aware branch assesses the reconstruction fidelity between LR and SR images through their visual perception. The combination of the two branches substantially aligns with the human visual system, enabling a comprehensive SR image evaluation. Experimental results indicate that our PFIQA outperforms current state-of-the-art models across three widely-used SR-IQA benchmarks. Notably, PFIQA excels in assessing the quality of real-world SR images.
随着图像超分辨率(SR)算法的出现,如何评估生成SR图像的质量已成为一个紧迫的任务。尽管全参考方法在SR图像质量评估(SR-IQA)中表现良好,但它们对高分辨率(HR)图像的依赖限制了其实用性。充分利用SR-IQA中可用的重建信息,如低分辨率(LR)图像和比例因子,是一种提高SR-IQA性能而无需参考HR图像的方法。在本文中,我们试图评估SR图像的感知质量和重构准确性,同时考虑LR图像和比例因子。具体来说,我们提出了一个新颖的双分支减少参考SR-IQA网络,即感知-和可靠性-感知SR-IQA(PFIQA)。感知分支通过利用Vision Transformer(ViT)的全局建模优点和ResNet的局部关系,以及包含比例因子来提高综合视觉 perception。同时,可靠性分支通过它们的视觉感知评估LR和SR图像之间的重构准确性。两个分支的组合使得PFIQA与人类视觉系统高度契合,实现了全面的SR图像评估。实验结果表明,我们的PFIQA在三个广泛使用的SR-IQA基准测试中超过了最先进的模型。值得注意的是,PFIQA在评估真实世界SR图像的质量方面表现出色。
https://arxiv.org/abs/2405.09472
In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
鉴于自动语音识别(ASR)系统的广泛应用,其安全性问题比以往任何时候都受到更多的关注,主要原因是深度神经网络的易受性。以前的研究表明,在约束条件下悄悄地生成对抗扰动能够操纵语音识别系统,从而产生恶意命令。这些攻击方法主要需要在$\ell_p$范数约束下添加噪声扰动,不可避免地留下了手动修改的残影。最近的研究通过将文本到语音(TTS)合成音频的对抗样本,缓解了这一限制。然而,基于优化目标的风格修改会显著降低音频风格的可控性和可编辑性。在本文中,我们提出了基于用户自定义风格迁移的ASR系统攻击。我们首先测试了顺序风格迁移攻击(STA)的效果。然后,作为改进,我们提出了一个迭代式风格码攻击(SCA)来保持音频质量。实验结果表明,我们的方法可以满足用户自定义风格的需求,攻击成功率为82%,同时保持声音的自然度。
https://arxiv.org/abs/2405.09470
Accurate detection of vulvovaginal candidiasis is critical for women's health, yet its sparse distribution and visually ambiguous characteristics pose significant challenges for accurate identification by pathologists and neural networks alike. Our eye-tracking data reveals that areas garnering sustained attention - yet not marked by experts after deliberation - are often aligned with false positives of neural networks. Leveraging this finding, we introduce Gaze-DETR, a pioneering method that integrates gaze data to enhance neural network precision by diminishing false positives. Gaze-DETR incorporates a universal gaze-guided warm-up protocol applicable across various detection methods and a gaze-guided rectification strategy specifically designed for DETR-based models. Our comprehensive tests confirm that Gaze-DETR surpasses existing leading methods, showcasing remarkable improvements in detection accuracy and generalizability.
准确检测外阴阴道念珠菌病对女性健康至关重要,但它的稀疏分布和视觉上模糊的特点对病理学家和神经网络鉴定者来说都带来了重大挑战。我们的眼动数据表明,获得持续关注却未被专家肯定的区域通常与神经网络的假阳性结果一致。利用这一发现,我们引入了Gaze-DETR,一种开创性的方法,将眼动数据集成到神经网络中,通过降低假阳性结果来提高检测精度。Gaze-DETR采用了一个通用的眼动引导预热协议,适用于各种检测方法,并专门为基于DETR模型的检测方法设计了一个眼动引导校正策略。我们全面的测试证实,Gaze-DETR超越了现有领先方法,展示了在检测准确性和泛化方面显著的改进。
https://arxiv.org/abs/2405.09463
Glass largely blurs the boundary between the real world and the reflection. The special transmittance and reflectance quality have confused the semantic tasks related to machine vision. Therefore, how to clear the boundary built by glass, and avoid over-capturing features as false positive information in deep structure, matters for constraining the segmentation of reflection surface and penetrating glass. We proposed the Fourier Boundary Features Network with Wider Catchers (FBWC), which might be the first attempt to utilize sufficiently wide horizontal shallow branches without vertical deepening for guiding the fine granularity segmentation boundary through primary glass semantic information. Specifically, we designed the Wider Coarse-Catchers (WCC) for anchoring large area segmentation and reducing excessive extraction from a structural perspective. We embed fine-grained features by Cross Transpose Attention (CTA), which is introduced to avoid the incomplete area within the boundary caused by reflection noise. For excavating glass features and balancing high-low layers context, a learnable Fourier Convolution Controller (FCC) is proposed to regulate information integration robustly. The proposed method has been validated on three different public glass segmentation datasets. Experimental results reveal that the proposed method yields better segmentation performance compared with the state-of-the-art (SOTA) methods in glass image segmentation.
玻璃在很大程度上模糊了现实世界和反射之间的边界。特殊的传输和反射质量使机器视觉相关的语义任务变得混乱。因此,如何清除玻璃构建的边界,并避免在深度结构中捕获到的特征作为假阳性信息,对于限制反射表面分割和穿透玻璃是有关于的。我们提出了Fourier边界特征网络(FBWC),这可能是第一个利用足够宽的横向浅分支,而不会导致垂直加深,指导通过主要玻璃语义信息进行细粒度分割边界的尝试。具体来说,我们设计了一个Wider粗 catch器(WCC),用于锚定大面积分割,并减少从结构角度引起的过度提取。我们通过跨变换注意(CTA)嵌入细粒度特征,这是为了避免反射噪声引起的不完整区域。为了挖掘玻璃特征并平衡高低层上下文,我们提出了一个可学习的Fourier卷积控制器(FCC)来调节信息整合的稳健性。所提出的方法已经在三个不同公开玻璃分割数据集上进行了验证。实验结果表明,与最先进的玻璃图像分割方法相比,所提出的方法具有更好的分割性能。
https://arxiv.org/abs/2405.09459
This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
这篇论文对通过一系列实验对可解释性事实核查进行全面分析,重点关注大型语言模型验证公共卫生主张及其真实性评估的能力。我们研究了零/少抽样提示和参数高效的微调在各种开源和闭源模型上的有效性,并探讨了其在真实性预测和解释生成的单独和联合任务上的表现。重要的是,我们采用了一种包括先前确定的自动指标和通过人类评价的新颖标准的双重评估方法。自动评估表明,在零抽样场景中,GPT-4脱颖而出,但在少抽样和参数高效的微调环境中,开源模型表明其不仅弥合了性能差距,而且在某些情况下甚至超过了GPT-4。人类评估揭示了更多的细节,并表明了黄金解释存在的一些潜在问题。
https://arxiv.org/abs/2405.09454
Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid human-AI technologies, and 3) investigates what these technologies can reveal about individual perspectives in online discussions. We propose a three-layered hierarchy for representing perspectives that can be obtained by a mixture of human intelligence and large language models. We illustrate how these representations can draw insights into the diversity of perspectives and allow us to investigate interactions in online discussions.
现代民主国家面临着公民参与度下降的一个关键问题。在线讨论论坛是提高公民参与度的重要途径。本论文提纲1)识别了促进大规模在线讨论与自然语言处理(NLP)相关的挑战,2)提出了通过融合人机智能技术来解决这些挑战的建议,3)研究了这些技术如何揭示关于在线讨论中个体观点的信息。我们提出了一个三层表示视角的三层结构,可以通过混合人类智慧和大型语言模型的方式获得。我们说明了这些表示如何揭示视角的多样性,并允许我们研究在线讨论中的互动。
https://arxiv.org/abs/2405.09439
3D content creation plays a vital role in various applications, such as gaming, robotics simulation, and virtual reality. However, the process is labor-intensive and time-consuming, requiring skilled designers to invest considerable effort in creating a single 3D asset. To address this challenge, text-to-3D generation technologies have emerged as a promising solution for automating 3D creation. Leveraging the success of large vision language models, these techniques aim to generate 3D content based on textual descriptions. Despite recent advancements in this area, existing solutions still face significant limitations in terms of generation quality and efficiency. In this survey, we conduct an in-depth investigation of the latest text-to-3D creation methods. We provide a comprehensive background on text-to-3D creation, including discussions on datasets employed in training and evaluation metrics used to assess the quality of generated 3D models. Then, we delve into the various 3D representations that serve as the foundation for the 3D generation process. Furthermore, we present a thorough comparison of the rapidly growing literature on generative pipelines, categorizing them into feedforward generators, optimization-based generation, and view reconstruction approaches. By examining the strengths and weaknesses of these methods, we aim to shed light on their respective capabilities and limitations. Lastly, we point out several promising avenues for future research. With this survey, we hope to inspire researchers further to explore the potential of open-vocabulary text-conditioned 3D content creation.
3D内容创作在各种应用中发挥着重要作用,如游戏、机器人模拟和虚拟现实。然而,该过程费力且耗时,需要熟练的设计师投入大量精力创作单个3D资产。为应对这一挑战,文本到3D生成技术作为一种有前途的自动化3D创作的解决方案应运而生。通过利用大型视觉语言模型的成功,这些技术旨在根据文本描述生成3D内容。尽管在最近一段时间内这一领域取得了进展,但现有的解决方案在生成质量和效率方面仍然存在显著的限制。在本次调查中,我们深入研究了最新的文本到3D创作方法。我们提供了关于文本到3D创作的全面背景,包括讨论训练和评估指标所使用的数据集以及用于评估生成3D模型的质量的评估指标。接着,我们深入探讨了作为3D生成过程基础的各种3D表示。此外,我们还对迅速发展的关于生成管道的研究进行了全面的比较,并将它们分为前馈生成、基于优化的生成和视图重构方法。通过分析这些方法的优缺点,我们希望揭示它们各自的潜能和局限。最后,我们指出了未来研究的几个有前景的方向。通过这次调查,我们希望激励研究人员进一步探索开放词汇文本条件下3D内容创作的潜力。
https://arxiv.org/abs/2405.09431
Recent advances in aerial robotics have enabled the use of multirotor vehicles for autonomous payload transportation. Resorting only to classical methods to reliably model a quadrotor carrying a cable-slung load poses significant challenges. On the other hand, purely data-driven learning methods do not comply by design with the problem's physical constraints, especially in states that are not densely represented in training data. In this work, we explore the use of physics informed neural networks to learn an end-to-end model of the multirotor-slung-load system and, at a given time, estimate a sequence of the future system states. An LSTM encoder decoder with an attention mechanism is used to capture the dynamics of the system. To guarantee the cohesiveness between the multiple predicted states of the system, we propose the use of a physics-based term in the loss function, which includes a discretized physical model derived from first principles together with slack variables that allow for a small mismatch between expected and predicted values. To train the model, a dataset using a real-world quadrotor carrying a slung load was curated and is made available. Prediction results are presented and corroborate the feasibility of the approach. The proposed method outperforms both the first principles physical model and a comparable neural network model trained without the physics regularization proposed.
近年来,随着无人机遥控技术的进步,多旋翼车辆已被用于自主承载电缆吊重物的运输。仅仅依靠经典方法来可靠地建模四旋翼携带电缆吊重物存在重大挑战。另一方面,完全基于数据驱动的学习方法在设计上并不符合问题固有约束,尤其是在训练数据中没有很好地表示的状态。在本文中,我们探讨了使用受物理学启发的神经网络来学习多旋翼吊重系统端到端模型的应用,并在给定时间估计未来系统状态。为了捕捉系统的动态,我们使用了LSTM编码器-解码器模型,并引入了注意机制来控制多个预测状态之间的连贯性。为了保证系统中多个预测状态的连贯性,我们在损失函数中引入了一个基于物理学的项,包括从基本原理导出的离散化物理模型和允许预期值和预测值之间的小误差的可缩放变量。为了训练模型,我们挑选了一个使用真实世界四旋翼运输电缆吊重物的数据集,并提供了用于训练的数据集。预测结果被呈现,并证实了该方法的有效性。与无物理学 regularization的第一原理物理模型和没有物理学的神经网络模型相比,所提出的方法优越。
https://arxiv.org/abs/2405.09428
This paper introduces the Global-Local Image Perceptual Score (GLIPS), an image metric designed to assess the photorealistic image quality of AI-generated images with a high degree of alignment to human visual perception. Traditional metrics such as FID and KID scores do not align closely with human evaluations. The proposed metric incorporates advanced transformer-based attention mechanisms to assess local similarity and Maximum Mean Discrepancy (MMD) to evaluate global distributional similarity. To evaluate the performance of GLIPS, we conducted a human study on photorealistic image quality. Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores. Additionally, we introduce the Interpolative Binning Scale (IBS), a refined scaling method that enhances the interpretability of metric scores by aligning them more closely with human evaluative standards. The proposed metric and scaling approach not only provides more reliable assessments of AI-generated images but also suggest pathways for future enhancements in image generation technologies.
本文介绍了全局局部图像感知分数(GLIPS),一种用于评估高度与人类视觉感知一致的AI生成的图像的图像指标。传统指标如FID和KID得分与人类评价的关系不密切。所提出的指标采用先进的Transformer基注意力机制来评估局部相似度和最大均方差(MMD)以评估全局分布相似度。为了评估GLIPS的性能,我们在图像质量方面进行了一项人类研究。对各种生成模型进行的全局测试表明,GLIPS在与人类评分的一致性方面显著优于现有指标如FID、SSIM和MS-SSIM。此外,我们还引入了平滑分割尺度(IBS),一种通过更紧密地与人类评价标准对齐来提高指标分数解释性的平滑分割方法。所提出的指标和分割方法不仅为AI生成图像提供了更可靠的评估,还提出了未来图像生成技术改进的路径。
https://arxiv.org/abs/2405.09426
Objective: Federated Learning (FL) enables collaborative model training while keeping data locally. Currently, most FL studies in radiology are conducted in simulated environments due to numerous hurdles impeding its translation into practice. The few existing real-world FL initiatives rarely communicate specific measures taken to overcome these hurdles, leaving behind a significant knowledge gap. Minding efforts to implement real-world FL, there is a notable lack of comprehensive assessment comparing FL to less complex alternatives. Materials & Methods: We extensively reviewed FL literature, categorizing insights along with our findings according to their nature and phase while establishing a FL initiative, summarized to a comprehensive guide. We developed our own FL infrastructure within the German Radiological Cooperative Network (RACOON) and demonstrated its functionality by training FL models on lung pathology segmentation tasks across six university hospitals. We extensively evaluated FL against less complex alternatives in three distinct evaluation scenarios. Results: The proposed guide outlines essential steps, identified hurdles, and proposed solutions for establishing successful FL initiatives conducting real-world experiments. Our experimental results show that FL outperforms less complex alternatives in all evaluation scenarios, justifying the effort required to translate FL into real-world applications. Discussion & Conclusion: Our proposed guide aims to aid future FL researchers in circumventing pitfalls and accelerating translation of FL into radiological applications. Our results underscore the value of efforts needed to translate FL into real-world applications by demonstrating advantageous performance over alternatives, and emphasize the importance of strategic organization, robust management of distributed data and infrastructure in real-world settings.
目标:联邦学习(FL)可以在保留数据本地的情况下实现协作模型训练。目前,由于许多阻碍将其转化为实践的障碍,大多数放射学领域的FL研究都是在模拟环境中进行的。少数现有现实世界的FL倡议很少详细介绍为克服这些障碍所采取的具体措施,留下了相当大的知识空白。关注实施现实世界FL,在比较FL与其他更简单选项的全局评估方面存在显著的不足。材料和方法:我们广泛审查了FL文献,根据其性质和阶段将见解进行分类,同时建立FL倡议,并将其总结为一本全面的指南。我们还在德国放射学合作网络(RACOON)内开发了自己的FL基础设施,并通过在六所大学医院的肺病理分割任务上训练FL模型来展示其功能性。我们在三个不同的评估场景对FL与更简单的替代方案进行了广泛评估。结果:所提出的指南概述了建立成功FL倡议的必要步骤,识别了障碍并提出了解决方案。我们的实验结果表明,FL在所有评估场景中都优于更简单的替代方案,从而为将FL融入放射学应用付出了所需的努力。讨论与结论:我们的指南旨在帮助未来的FL研究人员避免陷阱,加速FL向放射学应用的转化。我们的结果强调了将FL融入现实世界应用所需的努力,通过证明其相对于替代方案的优越性能,并着重强调了在现实环境中的战略组织、分布式数据和基础设施的稳健管理的重要性。
https://arxiv.org/abs/2405.09409
Contrastive pretraining provides robust representations by ensuring their invariance to different image transformations while simultaneously preventing representational collapse. Equivariant contrastive learning, on the other hand, provides representations sensitive to specific image transformations while remaining invariant to others. By introducing equivariance to time-induced transformations, such as disease-related anatomical changes in longitudinal imaging, the model can effectively capture such changes in the representation space. In this work, we pro-pose a Time-equivariant Contrastive Learning (TC) method. First, an encoder embeds two unlabeled scans from different time points of the same patient into the representation space. Next, a temporal equivariance module is trained to predict the representation of a later visit based on the representation from one of the previous visits and the corresponding time interval with a novel regularization loss term while preserving the invariance property to irrelevant image transformations. On a large longitudinal dataset, our model clearly outperforms existing equivariant contrastive methods in predicting progression from intermediate age-related macular degeneration (AMD) to advanced wet-AMD within a specified time-window.
对比性预训练通过确保其对不同图像变换的不变性来提供稳健的表示,同时防止表示坍缩。另一方面,等价对比学习提供对特定图像变换的敏感性,同时保持对其他变换的不变性。通过引入对时间引起的变换的等价性,例如纵向成像中与疾病相关的解剖变化,该模型可以有效地捕捉表示空间中的这些变化。在这篇工作中,我们提出了一个时间等价的对比学习(TC)方法。首先,将同一患者不同时间点的两个未标记的扫描嵌入到表示空间中。然后,训练一个时间等价模块,根据之前一次访问的表示和相应的时间间隔,预测后续访问的表示,并使用新的正则化损失函数保留对无关图像变换的不变性。在一个大型纵向数据集上,我们的模型在预测中老年相关黄斑变性(AMD)的中间年龄到高级湿性AMD的进展方面明显优于现有的等价对比方法。
https://arxiv.org/abs/2405.09404
A fundamental tenet of pattern recognition is that overlap between training and testing sets causes an optimistic accuracy estimate. Deep CNNs for face recognition are trained for N-way classification of the identities in the training set. Accuracy is commonly estimated as average 10-fold classification accuracy on image pairs from test sets such as LFW, CALFW, CPLFW, CFP-FP and AgeDB-30. Because train and test sets have been independently assembled, images and identities in any given test set may also be present in any given training set. In particular, our experiments reveal a surprising degree of identity and image overlap between the LFW family of test sets and the MS1MV2 training set. Our experiments also reveal identity label noise in MS1MV2. We compare accuracy achieved with same-size MS1MV2 subsets that are identity-disjoint and not identity-disjoint with LFW, to reveal the size of the optimistic bias. Using more challenging test sets from the LFW family, we find that the size of the optimistic bias is larger for more challenging test sets. Our results highlight the lack of and the need for identity-disjoint train and test methodology in face recognition research.
模式识别的一个基本信念是,训练集和测试集之间的覆盖会导致乐观估计的准确性估计。用于面部识别的深度卷积神经网络通过N路分类对训练集中的身份进行训练。通常将准确性估计为像LFW、CALFW、CPLFW、CFP-FP和AgeDB-30等测试集中的图像对的平均10倍分类准确度。因为训练和测试集是独立组装的,所以每个测试集中的图像和身份可能在任何训练集中找到。特别是,我们的实验揭示了LFW家族测试集和MS1MV2训练集之间身份和图像重叠的令人惊讶的程度。我们的实验还揭示了MS1MV2中身份标签噪声。我们比较了与LFW相同大小的子集获得的准确度,这些子集不与LFW具有相同的身份,以揭示乐观估计的规模。使用LFW家族更具有挑战性的测试集,我们发现,对于具有更大挑战性的测试集,乐观估计的规模更大。我们的结果突出了在面部识别研究中缺乏身份-不相同的训练和测试方法的问题,并强调了需要改进这个问题。
https://arxiv.org/abs/2405.09403