Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.
视频试穿是一个具有挑战性的任务,在以前的工作中尚未得到很好的解决。主要障碍在于保留服装的细节并同时模拟连贯的动作。面对这些困难,我们通过提出名为“隧道试穿”的扩散基框架来解决视频试穿问题。核心思想是在输入视频中挖掘一个“焦点隧道”,然后在隧道中拍摄周围服装区域的近景。我们缩小隧道区域的焦点以更好地保留服装的细节。为了生成连贯的动作,我们首先利用Kalman滤波器在焦点隧道中构建平滑的裁剪,并将隧道的位置嵌入引入注意力层以提高生成的视频的连续性。此外,我们还开发了一个环境编码器来提取隧道外的上下文信息作为额外的提示。凭借这些技术,隧道试穿保留了服装的细节并生成了稳定平滑的视频。展示出了显著的进步,隧道试穿可以被视为虚拟试穿在视频中的商业级应用的首次尝试。
https://arxiv.org/abs/2404.17571
This paper aims to generate materials for 3D meshes from text descriptions. Unlike existing methods that synthesize texture maps, we propose to generate segment-wise procedural material graphs as the appearance representation, which supports high-quality rendering and provides substantial flexibility in editing. Instead of relying on extensive paired data, i.e., 3D meshes with material graphs and corresponding text descriptions, to train a material graph generative model, we propose to leverage the pre-trained 2D diffusion model as a bridge to connect the text and material graphs. Specifically, our approach decomposes a shape into a set of segments and designs a segment-controlled diffusion model to synthesize 2D images that are aligned with mesh parts. Based on generated images, we initialize parameters of material graphs and fine-tune them through the differentiable rendering module to produce materials in accordance with the textual description. Extensive experiments demonstrate the superior performance of our framework in photorealism, resolution, and editability over existing methods. Project page: this https URL
本文旨在从文本描述中生成3D网格材料。与现有的方法不同,我们提出了一种生成段级程序性材料图作为外观表示的方法,该方法支持高品质渲染并提供了编辑方面的很大灵活性。我们不依赖于大量的配对数据,即具有材料图和相应文本描述的3D网格,来训练一个材料图生成模型。相反,我们利用预训练的2D扩散模型作为连接文本和材料图的桥梁。具体来说,我们的方法将形状分解为一系列段,并设计了一个段控扩散模型来合成与网格部分对齐的2D图像。基于生成的图像,我们通过不同的渲染模块初始化材料图的参数,并通过微调来产生与文本描述相符的材料。大量的实验证明,与其他方法相比,我们的框架在真实感、分辨率和平滑度方面具有卓越的性能。项目页面:此链接
https://arxiv.org/abs/2404.17569
Change detection (CD) is a fundamental task in remote sensing (RS) which aims to detect the semantic changes between the same geographical regions at different time stamps. Existing convolutional neural networks (CNNs) based approaches often struggle to capture long-range dependencies. Whereas recent transformer-based methods are prone to the dominant global representation and may limit their capabilities to capture the subtle change regions due to the complexity of the objects in the scene. To address these limitations, we propose an effective Siamese-based framework to encode the semantic changes occurring in the bi-temporal RS images. The main focus of our design is to introduce a change encoder that leverages local and global feature representations to capture both subtle and large change feature information from multi-scale features to precisely estimate the change regions. Our experimental study on two challenging CD datasets reveals the merits of our approach and obtains state-of-the-art performance.
变化检测(CD)是遥感(RS)中的一个基本任务,旨在在不同的时间戳之间检测相同地理位置区域之间的语义变化。现有的卷积神经网络(CNN)基于的方法往往难以捕捉长距离依赖关系。而最近基于Transformer的方法容易受到全局表示的支配,可能限制其捕捉场景中对象复杂性的微妙变化区域。为了克服这些限制,我们提出了一个有效的Siamese-基框架来编码生物时间序列遥感图像中发生的语义变化。我们设计的主要目标是引入一个利用局部和全局特征表示来捕捉多尺度特征中微妙和较大变化信息的变化编码器,从而精确估计变化区域。我们对两个具有挑战性的CD数据集的实验研究揭示了我们方法的优势,并获得了最先进的性能。
https://arxiv.org/abs/2404.17565
This paper presents a semi-automatic approach to create a diachronic corpus of voices balanced for speaker's age, gender, and recording period, according to 32 categories (2 genders, 4 age ranges and 4 recording periods). Corpora were selected at French National Institute of Audiovisual (INA) to obtain at least 30 speakers per category (a total of 960 speakers; only 874 have be found yet). For each speaker, speech excerpts were extracted from audiovisual documents using an automatic pipeline consisting of speech detection, background music and overlapped speech removal and speaker diarization, used to present clean speaker segments to human annotators identifying target speakers. This pipeline proved highly effective, cutting down manual processing by a factor of ten. Evaluation of the quality of the automatic processing and of the final output is provided. It shows the automatic processing compare to up-to-date process, and that the output provides high quality speech for most of the selected excerpts. This method shows promise for creating large corpora of known target speakers.
本文提出了一种半自动化的方法来创建一个平衡说话者年龄、性别和录音周期的语料库。根据32个类别(2个性别,4个年龄范围和4个录音周期),从法国国家视听学院选择了至少30个说话者。对于每个说话者,使用自动管道从音频视频文件中提取演讲片段,该管道包括语音检测、背景音乐和重叠演讲消除以及说话者识别,用于向人类注释者呈现干净的说话者片段。这条管道证明效果显著,将手动处理量降低了十倍。还提供了自动处理和最终输出的质量评估。它表明自动处理与现有过程相比具有优势,并为大多数选定的片段提供了高质量的发音。这种方法具有创建已知目标说话者大型语料库的潜力。
https://arxiv.org/abs/2404.17552
Real world testing is of vital importance to the success of automated driving. While many players in the business design purpose build testing vehicles, we designed and build a modular platform that offers high flexibility for any kind of scenario. CoCar NextGen is equipped with next generation hardware that addresses all future use cases. Its extensive, redundant sensor setup allows to develop cross-domain data driven approaches that manage the transfer to other sensor setups. Together with the possibility of being deployed on public roads, this creates a unique research platform that supports the road to automated driving on SAE Level 5.
现实世界的测试对自动驾驶的成功至关重要。虽然许多汽车制造商在业务设计中构建了测试车辆,我们设计并构建了一个模块化平台,为各种场景提供高度的灵活性。CoCar NextGen配备了下一代硬件,解决了所有未来用例。其广泛的冗余传感器设置允许开发跨域数据驱动方法来管理向其他传感器设置的转移。与部署在公共道路上的可能性相结合,这创造了一个独特的研究平台,支持在SAE Level 5上实现自动驾驶。
https://arxiv.org/abs/2404.17550
Numerous capability and safety techniques of Large Language Models (LLMs), including RLHF, automated red-teaming, prompt engineering, and infilling, can be cast as sampling from an unnormalized target distribution defined by a given reward or potential function over the full sequence. In this work, we leverage the rich toolkit of Sequential Monte Carlo (SMC) for these probabilistic inference problems. In particular, we use learned twist functions to estimate the expected future value of the potential at each timestep, which enables us to focus inference-time computation on promising partial sequences. We propose a novel contrastive method for learning the twist functions, and establish connections with the rich literature of soft reinforcement learning. As a complementary application of our twisted SMC framework, we present methods for evaluating the accuracy of language model inference techniques using novel bidirectional SMC bounds on the log partition function. These bounds can be used to estimate the KL divergence between the inference and target distributions in both directions. We apply our inference evaluation techniques to show that twisted SMC is effective for sampling undesirable outputs from a pretrained model (a useful component of harmlessness training and automated red-teaming), generating reviews with varied sentiment, and performing infilling tasks.
大语言模型(LLMs)具有许多能力和安全技术,包括强化学习(RLHF)、自动红色代理、提示工程和填充,可以将其视为对给定奖励或潜在函数定义的规范化目标分布的采样。在这项工作中,我们利用Sequential Monte Carlo(SMC)的丰富工具箱解决这些概率推理问题。特别是,我们使用学习来的扭曲函数来估计每个时间步的潜在价值的期望,这使得我们在推理时间内专注于有前景的局部序列。我们提出了一个新颖的对比学习方法来学习扭曲函数,并建立了与软强化学习丰富文献的联系。作为我们扭曲SMC框架的补充应用,我们提出了使用新颖的双向SMC边界来评估语言模型推理技术准确性的方法。这些边界可用于在两个方向上估计推理和目标分布之间的KL散度。我们将推理评估技术应用于表明,扭曲SMC对于从预训练模型( harmlessness培训和自动红色代理的有用组件)中采样不良输出(有用的训练和自动红色代理的一个有用组件)和生成带有不同情感的评论以及执行填充任务非常有效。
https://arxiv.org/abs/2404.17546
The recently introduced class of architectures known as Neural Operators has emerged as highly versatile tools applicable to a wide range of tasks in the field of Scientific Machine Learning (SciML), including data representation and forecasting. In this study, we investigate the capabilities of Neural Implicit Flow (NIF), a recently developed mesh-agnostic neural operator, for representing the latent dynamics of canonical systems such as the Kuramoto-Sivashinsky (KS), forced Korteweg-de Vries (fKdV), and Sine-Gordon (SG) equations, as well as for extracting dynamically relevant information from them. Finally we assess the applicability of NIF as a dimensionality reduction algorithm and conduct a comparative analysis with another widely recognized family of neural operators, known as Deep Operator Networks (DeepONets).
最近引入的类称为神经操作符的架构已经被证明是一种非常具有多才多艺的工具,适用于科学机器学习(SciML)领域的大多数任务,包括数据表示和预测。在这项研究中,我们研究了神经隐含流(NIF)作为一种新近开发的网格无关神经操作符,在表示库姆托夫-西弗辛格(KS)、强制库尔特韦格-德弗里斯(fKdV)和正弦- Gordon(SG)等规范系统的潜在动力学方面的能力,以及从它们中提取动态相关信息的能力。最后,我们评估了NIF作为维度降维算法的适用性,并对其与另一个被广泛认可的神经操作符家族——深度操作网络(DeepONets)进行了比较分析。
https://arxiv.org/abs/2404.17535
Large Vision-Language Models (LVLMs) are gaining traction for their remarkable ability to process and integrate visual and textual data. Despite their popularity, the capacity of LVLMs to generate precise, fine-grained textual descriptions has not been fully explored. This study addresses this gap by focusing on \textit{distinctiveness} and \textit{fidelity}, assessing how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features. We proposed the Textual Retrieval-Augmented Classification (TRAC) framework, which, by leveraging its generative capabilities, allows us to delve deeper into analyzing fine-grained visual description generation. This research provides valuable insights into the generation quality of LVLMs, enhancing the understanding of multimodal language models. Notably, MiniGPT-4 stands out for its better ability to generate fine-grained descriptions, outperforming the other two models in this aspect. The code is provided at \url{https://anonymous.4open.science/r/Explore_FGVDs-E277}.
大视图语言模型(LVLMs)因其在处理和整合视觉和文本数据方面的显著能力而受到欢迎。尽管它们颇受欢迎,但LVLMs生成精确、细粒度的文本描述的能力尚未得到充分探讨。本研究旨在填补这一空白,通过关注\textit{显著性}和\textit{可靠性},评估类似模型(如Open-Flamingo、IDEFICS和MiniGPT-4)如何区分相似物体并准确描述视觉特征。我们提出了Textual Retrieval-Augmented Classification(TRAC)框架,通过利用其生成能力,使我们可以更深入地分析细粒度视觉描述生成。这项研究为LVLMs的生成质量提供了宝贵的见解,促进了多模态语言模型的理解。值得注意的是,MiniGPT-4在细粒度描述生成方面表现出色,超越了其他两个模型。代码可在\url{https://anonymous.4open.science/r/Explore_FGVDs-E277}处获取。
https://arxiv.org/abs/2404.17534
Generalizable NeRF aims to synthesize novel views for unseen scenes. Common practices involve constructing variance-based cost volumes for geometry reconstruction and encoding 3D descriptors for decoding novel views. However, existing methods show limited generalization ability in challenging conditions due to inaccurate geometry, sub-optimal descriptors, and decoding strategies. We address these issues point by point. First, we find the variance-based cost volume exhibits failure patterns as the features of pixels corresponding to the same point can be inconsistent across different views due to occlusions or reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to amplify the contribution of consistent pixel pairs and suppress inconsistent ones. Unlike previous methods that solely fuse 2D features into descriptors, our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D context into descriptors through spatial and inter-view interaction. When decoding the descriptors, we observe the two existing decoding strategies excel in different areas, which are complementary. A Consistency-Aware Fusion (CAF) strategy is proposed to leverage the advantages of both. We incorporate the above ACA, SVA, and CAF into a coarse-to-fine framework, termed Geometry-aware Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains state-of-the-art performance across multiple datasets. Code is available at this https URL .
可扩展的 NeRF 的目标是生成未曾见过的场景的新视角。常见的做法包括基于方差构建形状重构的成本体积和为解码新视图编码 3D 描述符。然而,由于不准确的几何、次优描述符和解码策略,现有方法在具有挑战性的条件下表现出有限的泛化能力。我们逐一解决这些问题。首先,我们发现基于方差的成本体积表现出失败模式,因为对应于同一点的像素特征在不同视图中可能是不一致的。我们引入了一种自适应成本聚合(ACA)方法来放大一致像素对描述符的贡献并抑制不一致的像素。与之前方法仅将 2D 特征融合到描述符中不同,我们的方法引入了一种空间视图聚合器(SVA),通过空间和视图交互将 3D 上下文融入描述符中。在解码描述符时,我们观察到两种现有解码策略在不同的领域表现出优异性能,这些策略是互补的。我们提出了一个一致性感知融合(CAF)策略,利用两种策略的优势。我们将上述的 ACA、SVA 和 CAF 融入到粗-到-细的框架中,称之为几何感知重构和优化渲染(GeFu)。GeFu 在多个数据集上取得最先进的性能。代码可以从该链接下载:https://www.researchgate.net/publication/333003621_Geometry-aware_Reconstruction_and_Fusion_Rendering
https://arxiv.org/abs/2404.17528
Conventional mechanical design paradigms rely on experts systematically refining concepts through experience-guided modification and FEA to meet specific requirements. However, this approach can be time-consuming and heavily dependent on prior knowledge and experience. While numerous machine learning models have been developed to streamline this intensive and expert-driven iterative process, these methods typically demand extensive training data and considerable computational resources. Furthermore, methods based on deep learning are usually restricted to the specific domains and tasks for which they were trained, limiting their applicability across different tasks. This creates a trade-off between the efficiency of automation and the demand for resources. In this study, we present a novel approach that integrates pre-trained LLMs with a FEM module. The FEM module evaluates each design and provides essential feedback, guiding the LLMs to continuously learn, plan, generate, and optimize designs without the need for domain-specific training. We demonstrate the effectiveness of our proposed framework in managing the iterative optimization of truss structures, showcasing its capability to reason about and refine designs according to structured feedback and criteria. Our results reveal that these LLM-based agents can successfully generate truss designs that comply with natural language specifications with a success rate of up to 90%, which varies according to the applied constraints. By employing prompt-based optimization techniques we show that LLM based agents exhibit optimization behavior when provided with solution-score pairs to iteratively refine designs to meet specifications. This ability of LLM agents to produce viable designs and optimize them based on their inherent reasoning capabilities highlights their potential to develop and implement effective design strategies autonomously.
传统机械设计范式依赖于专家通过经验指导的修改和有限元分析(FEA)来系统地优化概念以满足具体需求。然而,这种方法耗时且高度依赖于先验知识和经验。虽然已经开发了许多机器学习模型来简化这一密集和专家驱动的迭代过程,但这些方法通常需要大量的训练数据和相当大的计算资源。此外,基于深度学习的方法通常仅限于所训练的具体领域和任务,限制了它们在不同任务上的应用。这导致自动化效率与资源需求之间存在平衡。在这项研究中,我们提出了一个新方法,将预训练的LLM与有限元分析(FEM)模块相结合。FEM模块评估每个设计并提供关键反馈,指导LLM持续学习、规划、生成和优化设计,而无需进行领域特定训练。我们证明了所提出的框架在管理悬索结构迭代优化方面的有效性,展示了其根据结构化反馈和标准评估进行推理和优化设计的能力。我们的研究结果表明,基于LLM的代理商可以成功生成符合自然语言规范的悬索结构设计,成功率高达90%,具体取决于应用的限制条件。通过采用提示式优化技术,我们证明了LLM代理商在获得解决方案评分对迭代优化设计以满足规范具有优化行为。这种基于LLM代理商产生可行设计和优化它们的能力,突出表明了它们可以自主开发和实施有效设计策略的潜力。
https://arxiv.org/abs/2404.17525
Capability ontologies are increasingly used to model functionalities of systems or machines. The creation of such ontological models with all properties and constraints of capabilities is very complex and can only be done by ontology experts. However, Large Language Models (LLMs) have shown that they can generate machine-interpretable models from natural language text input and thus support engineers / ontology experts. Therefore, this paper investigates how LLMs can be used to create capability ontologies. We present a study with a series of experiments in which capabilities with varying complexities are generated using different prompting techniques and with different LLMs. Errors in the generated ontologies are recorded and compared. To analyze the quality of the generated ontologies, a semi-automated approach based on RDF syntax checking, OWL reasoning, and SHACL constraints is used. The results of this study are very promising because even for complex capabilities, the generated ontologies are almost free of errors.
能力元理模型越来越多地用于系统或机器的功能建模。创建具有所有能力和约束的所有属性与能力元理模型非常复杂,只能由元理专家完成。然而,大型语言模型(LLMs)已经表明,它们可以从自然语言文本输入中生成机器可解释的模型,从而支持工程师/元理专家。因此,本文研究了LLMs如何用于创建能力元理模型。我们提出了一个系列实验来研究使用不同提示技术和不同LLM生成具有不同复杂性的能力元理模型。记录生成的元理模型的错误并进行了比较。为了分析生成的元理模型的质量,采用基于RDF语法检查、OWL推理和SHACL约束的半自动方法。本研究的结果非常有前途,因为即使对于复杂的能力,生成的元理模型也几乎没有错误。
https://arxiv.org/abs/2404.17524
This research explores the application of Large Language Models (LLMs) for automating the extraction of requirement-related legal content in the food safety domain and checking legal compliance of regulatory artifacts. With Industry 4.0 revolutionizing the food industry and with the General Data Protection Regulation (GDPR) reshaping privacy policies and data processing agreements, there is a growing gap between regulatory analysis and recent technological advancements. This study aims to bridge this gap by leveraging LLMs, namely BERT and GPT models, to accurately classify legal provisions and automate compliance checks. Our findings demonstrate promising results, indicating LLMs' significant potential to enhance legal compliance and regulatory analysis efficiency, notably by reducing manual workload and improving accuracy within reasonable time and financial constraints.
这项研究探讨了在食品安全领域应用大型语言模型(LLMs)自动提取相关法律内容以及检查法规文件的法律合规性。随着工业4.0的颠覆性以及《通用数据保护条例》(GDPR)对隐私政策和个人数据处理协议的重新塑造,法规分析和最近的技术进步之间的差距越来越大。本研究旨在通过利用LLMs(如BERT和GPT模型)准确分类法律条文并自动检查合规性,从而弥合这一差距。我们的研究结果表明,LLMs显著增强法律合规性和法规分析效率,特别是在降低手动工作负担和提高准确性方面。
https://arxiv.org/abs/2404.17522
Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task representations. We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting these challenges through two key innovations: a novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability; and an agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object. Ag2Manip's empirical validation across simulated benchmarks like FrankaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieved without domain-specific demonstrations. Ablation studies underline the essential contributions of the visual and action representations to this success. Extending our evaluations to the real world, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across both simulated and physical environments.
自主机器人系统能够学习新的操作任务,有望从制造业向服务业自动化转型。然而,现代方法(如VIP和R3M)仍然面临重大挑战,尤其是在机器人表述之间的领域差距和特定动作空间中成功任务执行的稀疏性,导致任务表示失调和模糊。我们引入了Ag2Manip(面向机器人的代理表示),一个旨在通过两个关键创新克服这些挑战的框架:基于人类操作视频的新颖代理-agnostic视觉表示,具体表述的细节被隐藏以增强泛化性;以及一个代理-agnostic动作表示,将机器的刚度抽象为通用代理,强调末端执行器和物体之间的关键相互作用。Ag2Manip在模拟基准测试如FrankaKitchen、ManiSkill和PartManip上的实证验证显示,性能提高了325%。消融研究强调了视觉和动作表示对这一成功的关键贡献。将我们的评估扩展到现实世界,Ag2Manip显著提高了从50%到77.5%的模仿学习成功率,证明了其在模拟和实物环境中的有效性和泛化性。
https://arxiv.org/abs/2404.17521
As autonomous driving technology progresses, the need for precise trajectory prediction models becomes paramount. This paper introduces an innovative model that infuses cognitive insights into trajectory prediction, focusing on perceived safety and dynamic decision-making. Distinct from traditional approaches, our model excels in analyzing interactions and behavior patterns in mixed autonomy traffic scenarios. It represents a significant leap forward, achieving marked performance improvements on several key datasets. Specifically, it surpasses existing benchmarks with gains of 16.2% on the Next Generation Simulation (NGSIM), 27.4% on the Highway Drone (HighD), and 19.8% on the Macao Connected Autonomous Driving (MoCAD) dataset. Our proposed model shows exceptional proficiency in handling corner cases, essential for real-world applications. Moreover, its robustness is evident in scenarios with missing or limited data, outperforming most of the state-of-the-art baselines. This adaptability and resilience position our model as a viable tool for real-world autonomous driving systems, heralding a new standard in vehicle trajectory prediction for enhanced safety and efficiency.
随着自动驾驶技术的不断发展,精确轨迹预测模型的重要性变得越来越突出。本文介绍了一种创新模型,将认知洞察力融入轨迹预测中,重点关注感知安全性和动态决策。与传统方法不同,我们的模型在混合自主交通场景中分析互动和行为模式方面表现出色。这标志着一个重大的跃升,在多个关键数据集上取得了显著的性能改进。具体来说,它超越了现有基准,在Next Generation Simulation(NGSIM)数据集上的增益为16.2%,在Highway Drone(HighD)数据集上的增益为27.4%,在Macao Connected Autonomous Driving(MoCAD)数据集上的增益为19.8%。我们提出的模型在处理角点方面表现出卓越的技能,这对于现实世界的应用至关重要。此外,在缺失或有限数据的场景中,其稳健性显然超过了最先进的基准方法。这种适应性和韧性使我们的模型成为现实世界自动驾驶系统的可行工具,为提高安全性和效率预示着一个新的标准。
https://arxiv.org/abs/2404.17520
Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the inter-event relations and the reasoning paradigms. How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown. To mitigate this disparity, we comprehensively evaluate the abilities of event reasoning of LLMs. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. EV2 consists of two levels of evaluation of schema and instance and is comprehensive in relations and reasoning paradigms. We conduct extensive experiments on EV2. We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory. We also notice the imbalance of event reasoning abilities in LLMs. Besides, LLMs have event schema knowledge, however, they're not aligned with humans on how to utilize the knowledge. Based on these findings, we introduce two methods to guide the LLMs to utilize the event schema knowledge. Both methods achieve improvements.
事件推理是一个基本的能力,许多应用都建立在它之上。它需要事件模式知识来执行全局推理,并需要处理事件间关系的多样性和推理范式。LLM在各种关系和推理范式上实现事件推理的能力仍然是一个未知的问题。为了减轻这种不平等,我们全面评估了LLM在各种关系和推理范式上的事件推理能力。我们引入了一个新的基准EV2用于评估事件推理能力。EV2由模式和实例的两个级别的评估组成,在关系和推理范式上都是全面的。我们在EV2上进行了广泛的实验。我们发现,LLM具有实现事件推理的能力,但他们的表现并不令人满意。我们还注意到了LLM在事件推理能力方面的不平衡。此外,LLM具有事件模式知识,然而,他们在如何利用这些知识上与人类并不一致。基于这些发现,我们引入了两种方法来引导LLM利用事件模式知识。两种方法都取得了改进。
https://arxiv.org/abs/2404.17513
In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our approach leverages hyperbolic embeddings and the concept of entailment cones to evaluate and filter out samples with meaningless or underspecified semantics, focusing on enhancing the specificity of each data sample. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark when combined with existing filtering techniques. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models. Additionally, the image specificity $\epsilon_{i}$ can be independently applied to induce an image-only dataset from an image-text or image-only data pool for training image-only self-supervised models and showed superior performance when compared to the dataset induced by CLIP score.
在数据驱动自我监督学习的时代,数据语义的具体性和清晰度在模型训练中起着关键作用。为了解决这个问题,我们引入了HYPerbolic Entailment filtering(HYPE),一种旨在仔细提取广泛、嘈杂图像-文本对数据集中的模态意义和良好对齐的数据的新方法。我们的方法利用了双曲嵌入和等价线概念来评估和过滤具有无意义或欠specified语义的数据样本,重点关注增强每个数据样本的特定性。HYPE不仅展示了在过滤效率方面的显著改进,而且在结合现有过滤技术后,在DataComp基准中达到了新的最先进水平。这一突破展示了HYPE改进数据选择过程的潜力,从而为更准确、高效的自我监督学习模型的发展做出了贡献。此外,图像特定性 $\epsilon_i$ 可以独立应用,从图像-文本或图像-only数据池中生成图像仅数据集,用于训练图像仅自我监督模型,并且与基于CLIP分数的数据集相比表现出优异的性能。
https://arxiv.org/abs/2404.17507
Imaging through fog significantly impacts fields such as object detection and recognition. In conditions of extremely low visibility, essential image information can be obscured, rendering standard extraction methods ineffective. Traditional digital processing techniques, such as histogram stretching, aim to mitigate fog effects by enhancing object light contrast diminished by atmospheric scattering. However, these methods often experience reduce effectiveness under inhomogeneous illumination. This paper introduces a novel approach that adaptively filters background illumination under extremely low visibility and preserve only the essential signal information. Additionally, we employ a visual optimization strategy based on image gradients to eliminate grayscale banding. Finally, the image is transformed to achieve high contrast and maintain fidelity to the original information through maximum histogram equalization. Our proposed method significantly enhances signal clarity in conditions of extremely low visibility and outperforms existing algorithms.
雾中的图像成像对诸如目标检测和识别等领域产生了显著影响。在极度低能见度的情况下,关键图像信息可能会被遮挡,导致标准提取方法变得无效。传统的数字处理技术,如直方图伸缩,试图通过增强物体光线对比度来减轻雾的影响。然而,这些方法在非均匀光照条件下往往效果减弱。本文提出了一种新方法,可以在极度低能见度条件下自适应地过滤背景光照,并仅保留关键信号信息。此外,我们还采用基于图像梯度的视觉优化策略来消除灰度带。最后,通过最大直方图均衡,将图像变换以实现高对比度并保持原始信息的完整性。与现有算法相比,我们提出的方法在极度低能见度条件下显著增强了信号清晰度,并表现出色。
https://arxiv.org/abs/2404.17503
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.
我们描述了一个研究文本到视频检索训练的协议,使用未标记的视频,其中我们假设(i)没有访问任何视频的标签,即没有访问地面真实字幕集,但(ii)访问以文本形式表示的标记图像。由于给图像专家模型建立现实场景是合理的,因为给图像贴标签比给视频贴标签更便宜,因此具有可扩展性,与昂贵的视频标注方案相比,更具有可行性。最近,零散式图像专家,如CLIP,已经为视频理解任务建立了新的强基准。在本文中,我们利用这一进步并使用两种模型实例化图像专家:一种是从文本到图像检索模型,为初始骨架提供支持,另一种是图像标题模型,为未标记的视频提供监督信号。我们证明了自动给视频帧贴上图像标题标签可以让文本到视频检索训练。这一过程在不进行手动注释的情况下适应了目标领域,从而在CLIP零散式基准之外表现出色。在训练期间,我们从多个视频帧中采样相应的标题,并对帧表示进行时间池化,根据每个标题对帧进行评分。我们进行了广泛的实验,以提供有关此简单框架的有效性的见解,并通过在ActivityNet、MSR-VTT和MSVD等三个标准数据集上实现文本到视频检索的CLIP零散式基准的超越来证实其有效性。
https://arxiv.org/abs/2404.17498
Multi-armed bandits (MAB) and causal MABs (CMAB) are established frameworks for decision-making problems. The majority of prior work typically studies and solves individual MAB and CMAB in isolation for a given problem and associated data. However, decision-makers are often faced with multiple related problems and multi-scale observations where joint formulations are needed in order to efficiently exploit the problem structures and data dependencies. Transfer learning for CMABs addresses the situation where models are defined on identical variables, although causal connections may differ. In this work, we extend transfer learning to setups involving CMABs defined on potentially different variables, with varying degrees of granularity, and related via an abstraction map. Formally, we introduce the problem of causally abstracted MABs (CAMABs) by relying on the theory of causal abstraction in order to express a rigorous abstraction map. We propose algorithms to learn in a CAMAB, and study their regret. We illustrate the limitations and the strengths of our algorithms on a real-world scenario related to online advertising.
多臂老虎机(MAB)和因果多臂老虎机(CMAB)是用于决策问题的框架。大部分先前的研究通常针对给定问题和相关数据分别研究并解决单个MAB和CMAB。然而,决策者通常面临多个相关问题和多尺度观察,需要联合形式化地表述问题结构和数据依赖关系。对于CMAB,迁移学习有助于解决模型定义在相同变量上的情况,尽管因果关系可能存在差异。在本文中,我们将扩展到涉及可能具有不同变量的CMAB的设置,通过一个抽象映射进行相关。正式地,我们引入了因果抽象MABs(CAMABs)的问题,通过依赖理论进行推理以表达严谨的抽象映射。我们提出了在CAMAB中学习的算法,并研究了它们的遗憾。我们通过一个与在线广告相关的真实世界场景,展示了我们算法的局限性和优势。
https://arxiv.org/abs/2404.17493
The open-source CARFAC (Cascade of Asymmetric Resonators with Fast-Acting Compression) cochlear model is upgraded to version 2, with improvements to the Matlab implementation, and with new Python/NumPy and JAX implementations -- but C++ version changes are still pending. One change addresses the DC (direct current, or zero frequency) quadratic distortion anomaly previously reported; another reduces the neural synchrony at high frequencies; the others have little or no noticeable effect in the default configuration. A new feature allows modeling a reduction of cochlear amplifier function, as a step toward a differentiable parameterized model of hearing impairment. In addition, the integration into the Auditory Model Toolbox (AMT) has been extensively improved, as the prior integration had bugs that made it unsuitable for including CARFAC in multi-model comparisons.
开源的CARFAC(级联非对称谐波压缩)耳模型的2.0版本升级包括对Matlab实现的改进和新Python/NumPy和JAX实现的开发,但对于C++版本的变化仍然有待处理。其中一个变化解决了之前报告的DC(直接电流,零频率)平方失真异常;另一个减少了高频中的神经同步;其他的变化对默认配置来说没有太大的影响。一个新的功能允许建模为听觉放大器功能减少的步骤,作为实现可导参数化听力损伤模型的一个进展。此外,对Auditory Model Toolbox(AMT)的集成也有了很大的改进,因为之前集成的错误使得它不适合包括CARFAC在多模态比较中。
https://arxiv.org/abs/2404.17490