Accurate detection of vulvovaginal candidiasis is critical for women's health, yet its sparse distribution and visually ambiguous characteristics pose significant challenges for accurate identification by pathologists and neural networks alike. Our eye-tracking data reveals that areas garnering sustained attention - yet not marked by experts after deliberation - are often aligned with false positives of neural networks. Leveraging this finding, we introduce Gaze-DETR, a pioneering method that integrates gaze data to enhance neural network precision by diminishing false positives. Gaze-DETR incorporates a universal gaze-guided warm-up protocol applicable across various detection methods and a gaze-guided rectification strategy specifically designed for DETR-based models. Our comprehensive tests confirm that Gaze-DETR surpasses existing leading methods, showcasing remarkable improvements in detection accuracy and generalizability.
准确检测外阴阴道念珠菌病对女性健康至关重要,但它的稀疏分布和视觉上模糊的特点对病理学家和神经网络鉴定者来说都带来了重大挑战。我们的眼动数据表明,获得持续关注却未被专家肯定的区域通常与神经网络的假阳性结果一致。利用这一发现,我们引入了Gaze-DETR,一种开创性的方法,将眼动数据集成到神经网络中,通过降低假阳性结果来提高检测精度。Gaze-DETR采用了一个通用的眼动引导预热协议,适用于各种检测方法,并专门为基于DETR模型的检测方法设计了一个眼动引导校正策略。我们全面的测试证实,Gaze-DETR超越了现有领先方法,展示了在检测准确性和泛化方面显著的改进。
https://arxiv.org/abs/2405.09463
The discovery of linear embedding is the key to the synthesis of linear control techniques for nonlinear systems. In recent years, while Koopman operator theory has become a prominent approach for learning these linear embeddings through data-driven methods, these algorithms often exhibit limitations in generalizability beyond the distribution captured by training data and are not robust to changes in the nominal system dynamics induced by intrinsic or environmental factors. To overcome these limitations, this study presents an adaptive Koopman architecture capable of responding to the changes in system dynamics online. The proposed framework initially employs an autoencoder-based neural network that utilizes input-output information from the nominal system to learn the corresponding Koopman embedding offline. Subsequently, we augment this nominal Koopman architecture with a feed-forward neural network that learns to modify the nominal dynamics in response to any deviation between the predicted and observed lifted states, leading to improved generalization and robustness to a wide range of uncertainties and disturbances compared to contemporary methods. Extensive tracking control simulations, which are undertaken by integrating the proposed scheme within a Model Predictive Control framework, are used to highlight its robustness against measurement noise, disturbances, and parametric variations in system dynamics.
线性嵌入的发现是 nonlinear系统线性控制技术合成的关键。近年来,虽然Koopman操作子理论通过数据驱动方法学习这些线性嵌入取得了突出地位,但这些算法在泛化能力上常常存在局限性,不仅限于训练数据所捕获的分布,而且对由内生或环境因素引起的拟合系统动态变化不具有鲁棒性。为了克服这些限制,本研究提出了一个自适应Koopman架构,能够在线系统动态变化发生时响应变化。所提出的框架首先采用了一个基于自动编码器的神经网络,利用名义系统的输入-输出信息来学习相应的Koopman嵌入。随后,我们通过一个前馈神经网络来增强这种名义Koopman架构,使其能够根据预测和观察到的抬升状态对名义动态进行修改,从而提高泛化能力和对当代方法的鲁棒性。为了检验这种方法对测量噪声、干扰和系统动态参数变化等的鲁棒性,我们在Model预测控制框架中进行了广泛的跟踪控制仿真。
https://arxiv.org/abs/2405.09101
The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features and the feasibility based on temporal features of realizing effective detection. According to this analysis, we use a multi-frame as a detection unit and propose a detection method based on temporal energy selective scaling (TESS). Specifically, we investigated the composition of intensity temporal profiles (ITPs) formed by pixels on a multi-frame detection unit. For the target-present pixel, the target passing through the pixel will bring a weak transient disturbance on the ITP and introduce a change in the statistical properties of ITP. We use a well-designed function to amplify the transient disturbance, suppress the background and noise components, and output the trajectory of the target on the multi-frame detection unit. Subsequently, to solve the contradiction between the detection rate and the false alarm rate brought by the traditional threshold segmentation, we associate the temporal and spatial features of the output trajectory and propose a trajectory extraction method based on the 3D Hough transform. Finally, we model the trajectory of the target and propose a trajectory-based multi-target tracking method. Compared with the various state-of-the-art detection and tracking methods, experiments in multiple scenarios prove the superiority of our proposed methods.
被动光学遥感(PORS)中检测和跟踪小目标具有广泛的应用价值。然而,之前提出的大多数方法很少利用目标运动产生的丰富时变特征,导致低信号-噪声比(SCR)目标检测和跟踪性能较差。在本文中,我们分析基于空间特征和基于时变特征实现有效检测的难度,并根据分析结果提出了一种基于时变能量选择性缩放(TESS)的检测方法。具体来说,我们研究了多帧中像素产生的强度时变轮廓(ITP)的组成。对于目标存在的像素,穿过像素的目标会对ITP产生弱暂态干扰,并改变ITP的统计特性。我们使用一个精心设计的函数来放大暂态干扰,抑制背景和噪声分量,并输出目标在多帧检测单元上的轨迹。为了解决传统阈值分割带来的检测率和误报警率之间的矛盾,我们将输出轨迹的时域和空间特征相关联,并提出了基于3D Hough变换的轨迹提取方法。最后,我们建模了目标轨迹,并提出了基于轨迹的多目标跟踪方法。与各种最先进的检测和跟踪方法相比,多个场景下的实验证明了我们提出方法的优越性。
https://arxiv.org/abs/2405.09054
We investigate the problem of pixelwise correspondence for deformable objects, namely cloth and rope, by comparing both classical and learning-based methods. We choose cloth and rope because they are traditionally some of the most difficult deformable objects to analytically model with their large configuration space, and they are meaningful in the context of robotic tasks like cloth folding, rope knot-tying, T-shirt folding, curtain closing, etc. The correspondence problem is heavily motivated in robotics, with wide-ranging applications including semantic grasping, object tracking, and manipulation policies built on top of correspondences. We present an exhaustive survey of existing classical methods for doing correspondence via feature-matching, including SIFT, SURF, and ORB, and two recently published learning-based methods including TimeCycle and Dense Object Nets. We make three main contributions: (1) a framework for simulating and rendering synthetic images of deformable objects, with qualitative results demonstrating transfer between our simulated and real domains (2) a new learning-based correspondence method extending Dense Object Nets, and (3) a standardized comparison across state-of-the-art correspondence methods. Our proposed method provides a flexible, general formulation for learning temporally and spatially continuous correspondences for nonrigid (and rigid) objects. We report root mean squared error statistics for all methods and find that Dense Object Nets outperforms baseline classical methods for correspondence, and our proposed extension of Dense Object Nets performs similarly.
我们通过比较基于经典方法和基于学习的方法来研究可变形物体的像素级对应问题,包括布料和绳索。我们选择布料和绳索是因为它们是传统上最难以用大配置空间进行分析建模的变形物体之一,而且在机器人任务如布料折叠、绳索结结、T恤折叠、窗帘关闭等背景下具有重要意义。对应问题在机器人领域受到广泛关注,包括通过特征匹配的语义抓取、物体跟踪和操作策略等。我们全面调查了通过特征匹配实现对应的传统经典方法,包括SIFT、SURF和ORB,以及两篇最近发表的学习方法TimeCycle和Dense Object Nets。我们做出了三个主要贡献:(1)通过模拟和渲染变形物体的合成图像,展示了模拟和真实领域之间的转移;(2)扩展了Dense Object Nets的新学习方法;(3)对最先进的对应方法进行了标准化比较。我们提出的方法为学习非刚性(和刚性)对象的时域和空间连续对应提供了一个灵活、通用的公式。我们报告了所有方法的所有者的根均方误差统计,并发现,Dense Object Nets基线经典方法在对应方面优越,而我们的对Dense Object Nets的扩展也具有相似的性能。
https://arxiv.org/abs/2405.08996
Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm, detecting objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms. Code is available at this https URL.
许多基于查询的3D多对象跟踪(MOT)方法采用了关注点的跟踪范式,利用跟踪查询进行身份一致的检测,利用对象查询进行身份无关的跟踪生成。然而,关注点的跟踪范式将检测和跟踪查询在同一个嵌入中纠缠在一起,对于检测和跟踪任务来说不是最优解。其他方法类似于跟踪-by-detection范式,使用解耦的跟踪和检测查询然后进行后续的相关联来检测物体。然而,这些方法并未利用检测和关联任务之间的协同作用。通过结合这两种范式的优势,我们引入了ADA-Track,一种从多视角摄像机视角的3D MOT的新型端到端框架。我们基于边缘增强交叉注意力的可学习数据关联模块,利用外观和几何特征。此外,我们将该关联模块集成到基于DETR的3D检测器的解码层中,实现同时检测和查询到图像的交叉注意。通过堆叠这些解码层,查询在检测和关联任务上进行 alternating refine,有效利用了任务依赖关系。我们在nuScenes数据集上评估我们的方法,并证明了与前两种范式相比,我们的方法具有优势。代码可在此处下载:https://www.xxx.com/。
https://arxiv.org/abs/2405.08909
Non-prehensile manipulation enables fast interactions with objects by circumventing the need to grasp and ungrasp as well as handling objects that cannot be grasped through force closure. Current approaches to non-prehensile manipulation focus on static contacts, avoiding the underactuation that comes with sliding. However, the ability to control sliding contact, essentially removing the no-slip constraint, opens up new possibilities in dynamic manipulation. In this paper, we explore a challenging dynamic non-prehensile manipulation task that requires the consideration of the full spectrum of hybrid contact modes. We leverage recent methods in contact-implicit MPC to handle the multi-modal planning aspect of the task. We demonstrate, with careful consideration of integration between the simple model used for MPC and the low-level tracking controller, how contact-implicit MPC can be adapted to dynamic tasks. Surprisingly, despite the known inaccuracies of frictional rigid contact models, our method is able to react to these inaccuracies while still quickly performing the task. Moreover, we do not use common aids such as reference trajectories or motion primitives, highlighting the generality of our approach. To the best of our knowledge, this is the first application of contact-implicit MPC to a dynamic manipulation task in three dimensions.
非抓取操作使通过绕开抓取和解抓取的需要,以及处理无法通过力闭合来抓取的对象,实现了与物体的高速互动。目前,非抓取操作方法主要关注静态接触,避免与滑动相关的松动。然而,控制滑动接触的能力,本质上消除松动约束,为动态操作带来了新的可能性。在本文中,我们探讨了一个具有挑战性的动态非抓取操作任务,需要考虑混合接触模式的完整范围。我们利用最近在接触隐式MPC中的方法来处理任务的 Multi-modal 规划方面。我们证明了,在仔细考虑简单模型用于MPC和低级跟踪控制器之间的集成的情况下,接触隐式MPC可以适应动态任务。令人惊讶的是,尽管已知摩擦刚性接触模型的不准确度,但我们的方法仍然能够应对这些不准确度,同时仍然快速地完成任务。此外,我们没有使用常见的辅助工具,如参考轨迹或运动原型,这突出了我们方法的普遍性。据我们所知,这是在三维空间中第一次将接触隐式MPC应用于动态操作任务。
https://arxiv.org/abs/2405.08731
Tissue tracking in echocardiography is challenging due to the complex cardiac motion and the inherent nature of ultrasound acquisitions. Although optical flow methods are considered state-of-the-art (SOTA), they struggle with long-range tracking, noise occlusions, and drift throughout the cardiac cycle. Recently, novel learning-based point tracking techniques have been introduced to tackle some of these issues. In this paper, we build upon these techniques and introduce EchoTracker, a two-fold coarse-to-fine model that facilitates the tracking of queried points on a tissue surface across ultrasound image sequences. The architecture contains a preliminary coarse initialization of the trajectories, followed by reinforcement iterations based on fine-grained appearance changes. It is efficient, light, and can run on mid-range GPUs. Experiments demonstrate that the model outperforms SOTA methods, with an average position accuracy of 67% and a median trajectory error of 2.86 pixels. Furthermore, we show a relative improvement of 25% when using our model to calculate the global longitudinal strain (GLS) in a clinical test-retest dataset compared to other methods. This implies that learning-based point tracking can potentially improve performance and yield a higher diagnostic and prognostic value for clinical measurements than current techniques. Our source code is available at: this https URL.
在超声心动图中,组织追踪是一个具有挑战性的任务,因为心脏运动复杂,超声采集的固有本质导致了跟踪过程中的噪声阻塞和漂移。尽管光学流方法被认为是最先进的(SOTA),但它们在长距离跟踪、噪声遮挡和整个心动周期内的漂移方面表现不佳。最近,基于学习的新颖跟踪技术已经引入,以解决这些问题。在本文中,我们沿着这些技术进行了探讨,并引入了EchoTracker,一种双精度模型,用于在超声图像序列上追踪被查询的点在组织表面的轨迹。该架构包含轨迹的初步粗初始化,然后基于细粒度外观变化进行强化迭代。它既高效又轻便,可以在中高档GPU上运行。实验证明,与SOTA方法相比,该模型具有优异的性能,平均位置精度为67%,轨迹误差为2.86像素。此外,我们还展示了使用我们的模型在临床测试再测数据集中计算全局纵向应变(GLS)时的相对改善,相比其他方法,GLS的计算性能提高了25%。这表明,基于学习的点追踪技术有可能提高性能,为临床测量提供更高的诊断和预后价值,目前的技术无法满足临床需求。我们的源代码可在此处下载:https:// this URL.
https://arxiv.org/abs/2405.08587
Safe maneuvering capability is critical for mobile robots in complex environments. However, robotic system dynamics are often time-varying, uncertain, or even unknown during the motion planning and control process. Therefore, many existing model-based reinforcement learning (RL) methods could not achieve satisfactory reliability in guaranteeing safety. To address this challenge, we propose a two-level Vector Field-guided Learning Predictive Control (VF-LPC) approach that guarantees safe maneuverability. The first level, the guiding level, generates safe desired trajectories using the designed kinodynamic guiding vector field, enabling safe motion in obstacle-dense environments. The second level, the Integrated Motion Planning and Control (IMPC) level, first uses the deep Koopman operator to learn a nominal dynamics model offline and then updates the model uncertainties online using sparse Gaussian processes (GPs). The learned dynamics and game-based safe barrier function are then incorporated into the learning predictive control framework to generate near-optimal control sequences. We conducted tests to compare the performance of VF-LPC with existing advanced planning methods in an obstacle-dense environment. The simulation results show that it can generate feasible trajectories quickly. Then, VF-LPC is evaluated against motion planning methods that employ model predictive control (MPC) and RL in high-fidelity CarSim software. The results show that VF-LPC outperforms them under metrics of completion time, route length, and average solution time. We also carried out path-tracking control tests on a racing road to validate the model uncertainties learning capability. Finally, we conducted real-world experiments on a Hongqi E-HS3 vehicle, further validating the VF-LPC approach's effectiveness.
保证移动机器人在复杂环境中的安全机动能力至关重要。然而,机器人系统动力学通常在运动规划和控制过程中是时间变化、不确定或甚至是未知的。因此,许多基于模型的强化学习(RL)方法无法在保证安全方面达到令人满意的可靠性。为解决这个问题,我们提出了一个两级Vector Field-guided Learning Predictive Control(VF-LPC)方法,以确保安全机动。 第一级,指导层,使用设计的水动力引导向量场生成安全的愿望轨迹,使机器人在密集障碍物的环境中安全运动。第二级,集成运动规划与控制(IMPC)层,首先使用深度Koopman操作学习一个定理动态模型,然后在线使用稀疏高斯过程(GPs)更新模型不确定性。然后将学到的动态和基于游戏的safe barrier函数纳入学习预测控制框架,生成最优控制序列。我们对VF-LPC与现有高级规划方法在密集障碍物的环境中的性能进行了测试。 仿真结果表明,VF-LPC可以快速生成可行轨迹。然后,将VF-LPC与采用模型预测控制(MPC)和RL的高保真度CarSim软件的运动规划方法进行比较。结果表明,在完成时间、路径长度和平均解决方案时间等指标上,VF-LPC优越。我们还对赛车道路进行了路径跟踪控制测试,以验证模型不确定性学习能力的有效性。 最后,我们在一辆 Hongqi E-HS3 车上进行了实际实验,进一步验证了VF-LPC方法的有效性。
https://arxiv.org/abs/2405.08283
The authentic 3D hand avatar with every identifiable information, such as hand shapes and textures, is necessary for immersive experiences in AR/VR. In this paper, we present a universal hand model (UHM), which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. For effective universal hand modeling, we perform tracking and modeling at the same time, while previous 3D hand models perform them separately. The conventional separate pipeline suffers from the accumulated errors from the tracking stage, which cannot be recovered in the modeling stage. On the other hand, ours does not suffer from the accumulated errors while having a much more concise overall pipeline. We additionally introduce a novel image matching loss function to address a skin sliding during the tracking and modeling, while existing works have not focused on it much. Finally, using learned priors from our UHM, we effectively adapt our UHM to each person's short phone scan for the authentic hand avatar.
真实的三维手虚拟形象,包括手形状和纹理等可识别的信息,对于AR/VR中的沉浸体验是必要的。在本文中,我们提出了一个通用的手模型(UHM),它具有以下两个特点:1)可以普遍代表任意身份(ID)的高保真3D手网格;2)可以通过短电话扫描,将真实手虚拟形象个性化定制。为了实现有效的通用手建模,我们在跟踪和建模的同时进行,而之前的手模型则是在建模阶段分别进行。传统的单独管道从跟踪阶段积累的错误,在建模阶段无法恢复。另一方面,我们的模型在整体流程中没有积累的错误,而且更加简洁。此外,我们还引入了一种新的图像匹配损失函数,以解决跟踪和建模过程中的皮肤滑动问题,而现有的工作并没有太多关注这个问题。最后,使用从UHM中学习到的先验知识,我们有效地将UHM个性定制化每个人的短电话扫描的真实手虚拟形象。
https://arxiv.org/abs/2405.07933
Many aging individuals encounter challenges in effectively tracking their dietary intake, exacerbating their susceptibility to nutrition-related health complications. Self-reporting methods are often inaccurate and suffer from substantial bias; however, leveraging intelligent prediction methods can automate and enhance precision in this process. Recent work has explored using computer vision prediction systems to predict nutritional information from food images. Still, these methods are often tailored to specific situations, require other inputs in addition to a food image, or do not provide comprehensive nutritional information. This paper aims to enhance the efficacy of dietary intake estimation by leveraging various neural network architectures to directly predict a meal's nutritional content from its image. Through comprehensive experimentation and evaluation, we present NutritionVerse-Direct, a model utilizing a vision transformer base architecture with three fully connected layers that lead to five regression heads predicting calories (kcal), mass (g), protein (g), fat (g), and carbohydrates (g) present in a meal. NutritionVerse-Direct yields a combined mean average error score on the NutritionVerse-Real dataset of 412.6, an improvement of 25.5% over the Inception-ResNet model, demonstrating its potential for improving dietary intake estimation accuracy.
许多老年人面临着在有效跟踪他们的饮食摄入方面遇到的挑战,从而加剧了他们易受与营养相关的健康问题的易感性。自我报告的方法通常是不准确的,并存在很大的偏差;然而,利用智能预测方法可以自动化和提高这个过程的精度。最近的工作已经探索了使用计算机视觉预测系统预测食品图像中的营养信息。然而,这些方法通常都是针对特定情况设计的,需要其他输入,或者没有提供全面的营养信息。本文旨在通过利用各种神经网络架构增强饮食摄入估计的有效性,从而预测从食品图像直接预测餐的营养成分。通过全面的实验和评估,我们提出了NutritionVerse-Direct模型,该模型使用视觉Transformer基础架构和三个全连接层,从而预测餐中的卡路里、质量(g)、蛋白质(g)、脂肪(g)和碳水化合物(g)。NutritionVerse-Direct在NutritionVerse-Real数据集上的综合平均误差分数为412.6,比Inception-ResNet模型提高了25.5%,证明了其提高饮食摄入估计准确性的潜力。
https://arxiv.org/abs/2405.07814
Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze -- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables -- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.
现代信息查询系统正逐渐集成多种模态输入,如视觉和音频。然而,将目光(与用户意图紧密相关,且通过可穿戴的凝视跟踪设备越来越容易获取)集成到系统中仍然是一个未被探索的领域。本文介绍了一种新颖的凝视辅助信息查询范例,称为G-VOILA,它将用户的凝视、视野和语音为基础的自然语言查询协同起来,以促进更直观的查询过程。在一个涉及21名参与者(每日3个场景,p = 21,场景 = 3)的用户体验研究中,我们揭示了用户查询语言中的歧义以及使用G-VOILA时用户的自然查询行为的凝视-声音协同模式。根据定量和定性研究结果,我们为G-VOILA范式开发了一个设计框架,该框架有效地将凝视数据与实时查询上下文集成。然后,我们使用尖端深度学习技术实现了G-VOILA的演示案例。后续用户研究(p = 16,场景 = 2)证明了它的有效性,与没有凝视数据的基线相比,实现了更高的客观得分和主观得分。我们进一步进行了访谈,并为未来的凝视辅助信息查询系统提供了建议。
https://arxiv.org/abs/2405.07652
This work aims to tackle the intent recognition problem in Human-Robot Collaborative assembly scenarios. Precisely, we consider an interactive assembly of a wooden stool where the robot fetches the pieces in the correct order and the human builds the parts following the instruction manual. The intent recognition is limited to the idle state estimation and it is needed to ensure a better synchronization between the two agents. We carried out a comparison between two distinct solutions involving wearable sensors and eye tracking integrated into the perception pipeline of a flexible planning architecture based on Hierarchical Task Networks. At runtime, the wearable sensing module exploits the raw measurements from four 9-axis Inertial Measurement Units positioned on the wrists and hands of the user as an input for a Long Short-Term Memory Network. On the other hand, the eye tracking relies on a Head Mounted Display and Unreal Engine. We tested the effectiveness of the two approaches with 10 participants, each of whom explored both options in alternate order. We collected explicit metrics about the attractiveness and efficiency of the two techniques through User Experience Questionnaires as well as implicit criteria regarding the classification time and the overall assembly time. The results of our work show that the two methods can reach comparable performances both in terms of effectiveness and user preference. Future development could aim at joining the two approaches two allow the recognition of more complex activities and to anticipate the user actions.
此工作旨在解决人机协作组装场景中的意图识别问题。具体而言,我们考虑了一个基于灵活规划架构的人体交互式装配木质凳子的过程。意图识别仅限于空闲状态估计,以确保两个代理之间的更好的同步。我们比较了两种不同的解决方案,这些解决方案涉及可穿戴传感器和集成在灵活规划架构感知管道中的眼动追踪。在运行时,可穿戴感测模块利用用户手腕和手上的四个9轴惯性测量器的原始测量值作为输入,为长短期记忆网络提供输入。另一方面,眼追踪依赖于佩戴式显示器和Unreal Engine。我们对两种方法进行了测试,让10个参与者交替使用这两种方法。我们通过用户体验问卷收集了关于两种技术的吸引力和效率的明确度指标,以及关于分类时间和整体装配时间的隐含标准。我们工作的结果表明,两种方法在有效性方面可以达到相当不错的表现,而在用户偏好方面也有相似的效果。未来的发展可以考虑将这两种方法集成起来,以便能够识别更复杂的任务,并预测用户的行为。
https://arxiv.org/abs/2405.07570
In the high-stakes world of baseball, every nuance of a pitcher's mechanics holds the key to maximizing performance and minimizing runs. Traditional analysis methods often rely on pre-recorded offline numerical data, hindering their application in the dynamic environment of live games. Broadcast video analysis, while seemingly ideal, faces significant challenges due to factors like motion blur and low resolution. To address these challenges, we introduce PitcherNet, an end-to-end automated system that analyzes pitcher kinematics directly from live broadcast video, thereby extracting valuable pitch statistics including velocity, release point, pitch position, and release extension. This system leverages three key components: (1) Player tracking and identification by decoupling actions from player kinematics; (2) Distribution and depth-aware 3D human modeling; and (3) Kinematic-driven pitch statistics. Experimental validation demonstrates that PitcherNet achieves robust analysis results with 96.82% accuracy in pitcher tracklet identification, reduced joint position error by 1.8mm and superior analytics compared to baseline methods. By enabling performance-critical kinematic analysis from broadcast video, PitcherNet paves the way for the future of baseball analytics by optimizing pitching strategies, preventing injuries, and unlocking a deeper understanding of pitcher mechanics, forever transforming the game.
在充满紧张的棒球世界中,每个投手动作的细微差别都掌握着实现卓越表现和最小化失分的关键。传统的分析方法通常依赖于预先记录的离线数值数据,这使得它们在实时比赛的动态环境中应用受限。直播视频分析虽然看似理想,但由于运动模糊和低分辨率等 factors,面临着相当大的挑战。为了应对这些挑战,我们引入了PitcherNet,一种端到端的自动系统,它可以直接从直播视频分析投手的动作,从而提取包括速度、触发点、 pitch position 和 release extension在内的有价值投球统计数据。这个系统利用了三个关键组件:(1)通过解耦动作与运动员动作来跟踪和识别球员;(2)分布和深度感知的三维人体建模;(3)基于动作的投球统计。实验验证表明,PitcherNet在投手跟踪器识别方面的准确度达到了96.82%,减少了关节位置误差1.8mm,并比基线方法具有更卓越的 analytics。通过让直播视频实现关键的动态分析,PitcherNet为棒球分析的未来铺平了道路,通过优化投球策略、防止受伤和揭示投手动作,永远改变了游戏。
https://arxiv.org/abs/2405.07407
Accurate and robust camera tracking in dynamic environments presents a significant challenge for visual SLAM (Simultaneous Localization and Mapping). Recent progress in this field often involves the use of deep learning techniques to generate mask for dynamic objects, which usually require GPUs to operate in real-time (30 fps). Therefore, this paper proposes a novel visual SLAM system for dynamic environments that obtains real-time performance on CPU by incorporating a mask prediction mechanism, which allows the deep learning method and the camera tracking to run entirely in parallel at different frequencies such that neither waits for the result from the other. Based on this, it further introduces a dual-stage optical flow tracking approach and employs a hybrid usage of optical flow and ORB features, which significantly enhance the efficiency and robustness of the system. Compared with state-of-the-art methods, this system maintains high localization accuracy in dynamic environments while achieving a tracking frame rate of 56 fps on a single laptop CPU without any hardware acceleration, thus proving that deep learning methods are still feasible for dynamic SLAM even without GPU support. Based on the available information, this is the first SLAM system to achieve this.
准确且鲁棒的目标跟踪在动态环境中具有重大的挑战,尤其是在视觉SLAM(同时定位与映射)中。最近,这个领域的进展通常涉及使用深度学习技术生成动态对象的掩码,这通常需要GPU在实时(30帧/秒)操作。因此,本文提出了一种新的动态环境下的视觉SLAM系统,通过引入掩码预测机制,在CPU上实现实时性能,使得深度学习方法和相机跟踪可以在不同频率下并行运行,从而使它们不必等待结果。基于此,它还引入了双级光学流跟踪方法,并采用光流和ORB特征的混合使用,显著提高了系统的效率和鲁棒性。与最先进的方法相比,这个系统在动态环境中保持了高定位精度,同时在单个笔记本CPU上实现了56帧/秒的跟踪率,没有使用任何硬件加速,从而证明即使没有GPU支持,深度学习方法仍然可以实现动态SLAM。基于可用信息,这是第一个实现此目标的SLAM系统。
https://arxiv.org/abs/2405.07392
Animatable clothing transfer, aiming at dressing and animating garments across characters, is a challenging problem. Most human avatar works entangle the representations of the human body and clothing together, which leads to difficulties for virtual try-on across identities. What's worse, the entangled representations usually fail to exactly track the sliding motion of garments. To overcome these limitations, we present Layered Gaussian Avatars (LayGA), a new representation that formulates body and clothing as two separate layers for photorealistic animatable clothing transfer from multi-view videos. Our representation is built upon the Gaussian map-based avatar for its excellent representation power of garment details. However, the Gaussian map produces unstructured 3D Gaussians distributed around the actual surface. The absence of a smooth explicit surface raises challenges in accurate garment tracking and collision handling between body and garments. Therefore, we propose two-stage training involving single-layer reconstruction and multi-layer fitting. In the single-layer reconstruction stage, we propose a series of geometric constraints to reconstruct smooth surfaces and simultaneously obtain the segmentation between body and clothing. Next, in the multi-layer fitting stage, we train two separate models to represent body and clothing and utilize the reconstructed clothing geometries as 3D supervision for more accurate garment tracking. Furthermore, we propose geometry and rendering layers for both high-quality geometric reconstruction and high-fidelity rendering. Overall, the proposed LayGA realizes photorealistic animations and virtual try-on, and outperforms other baseline methods. Our project page is this https URL.
翻译 面向角色的可穿戴转移,旨在在角色之间设计和动画服装,是一个具有挑战性的问题。大多数人类Avatar将人体的表示和服装的表示交织在一起,导致在跨身份的虚拟试穿中存在困难。更糟糕的是,交织的表示通常无法准确跟踪服装的滑动运动。为了克服这些限制,我们提出了Layered Gaussian Avatars(LayGA),一种新的表示方法,将人体和服装表示为两个独立的层,以实现从多视角视频中进行 photorealistic 可穿戴转移。我们的表示基于高斯图层为基础,具有出色的服装细节表示能力。然而,高斯图产生无结构的3D高斯分布在实际表面上,这使得准确的人体和服装跟踪和碰撞处理变得具有挑战性。因此,我们提出了两阶段训练,包括单层重建和多层拟合。在单层重建阶段,我们提出了一系列几何约束以重建平滑的表面,同时获得身体和服装的分割。接下来,在多层拟合阶段,我们训练了两个单独的模型来表示身体和服装,并利用重构的服装几何结构作为3D指导以实现更准确的可穿戴跟踪。此外,我们还提出了几何和渲染层,用于高保真度渲染和高质量几何重建。总体而言,LayGA实现了 photorealistic animations 和虚拟试穿,并超越了其他基线方法。我们的项目页面是https:// this URL。
https://arxiv.org/abs/2405.07319
In the field of transportation, it is of paramount importance to address and mitigate illegal actions committed by both motor and non-motor vehicles. Among those actions, wrong-way cycling (i.e., riding a bicycle or e-bike in the opposite direction of the designated traffic flow) poses significant risks to both cyclists and other road users. To this end, this paper formulates a problem of detecting wrong-way cycling ratios in CCTV videos. Specifically, we propose a sparse sampling method called WWC-Predictor to efficiently solve this problem, addressing the inefficiencies of direct tracking methods. Our approach leverages both detection-based information, which utilizes the information from bounding boxes, and orientation-based information, which provides insights into the image itself, to enhance instantaneous information capture capability. On our proposed benchmark dataset consisting of 35 minutes of video sequences and minute-level annotation, our method achieves an average error rate of a mere 1.475% while taking only 19.12% GPU time of straightforward tracking methods under the same detection model. This remarkable performance demonstrates the effectiveness of our approach in identifying and predicting instances of wrong-way cycling.
在运输领域,解决和减轻由机动和非机动车辆实施的非法行为至关重要。在这些行为中,逆向骑行(即在指定交通流量方向上骑行自行车或电动自行车)对骑车人和其他道路用户造成了巨大的安全风险。为此,本文提出了一种检测逆向骑行的比例问题。具体来说,我们提出了一个稀疏采样方法,称为WWC-Predictor,以有效地解决这一问题,解决了直接跟踪方法的低效性。我们的方法利用了基于检测的信息和基于方向的信息,以增强瞬时信息捕捉能力。在我们提出的基准数据集中,包括35分钟的视频序列和分钟级别的注释,我们的方法在相同的检测模型的GPU上只用了19.12% 的时间,平均误差率为仅仅1.475%。这一出色的性能证明了我们在识别和预测逆向骑行实例方面的方法的有效性。
https://arxiv.org/abs/2405.07293
With the advancement of video analysis technology, the multi-object tracking (MOT) problem in complex scenes involving pedestrians is gaining increasing importance. This challenge primarily involves two key tasks: pedestrian detection and re-identification. While significant progress has been achieved in pedestrian detection tasks in recent years, enhancing the effectiveness of re-identification tasks remains a persistent challenge. This difficulty arises from the large total number of pedestrian samples in multi-object tracking datasets and the scarcity of individual instance samples. Motivated by recent rapid advancements in meta-learning techniques, we introduce MAML MOT, a meta-learning-based training approach for multi-object tracking. This approach leverages the rapid learning capability of meta-learning to tackle the issue of sample scarcity in pedestrian re-identification tasks, aiming to improve the model's generalization performance and robustness. Experimental results demonstrate that the proposed method achieves high accuracy on mainstream datasets in the MOT Challenge. This offers new perspectives and solutions for research in the field of pedestrian multi-object tracking.
随着视频分析技术的进步,涉及行人的复杂场景中的多目标跟踪(MOT)问题越来越重要。这个挑战主要涉及两个关键任务:行人检测和识别。虽然近年来在行人检测任务上取得了显著的进展,但增强识别任务的有效性仍然是一个持续的挑战。这种困难源于多对象跟踪数据集中大量的行人样本和样本稀疏性。为了应对这种挑战,受到最近元学习技术快速发展启发的我们引入了MAML MOT,一种基于元学习的多目标跟踪训练方法。这种方法利用元学习的快速学习能力来解决行人识别任务中的样本稀疏性问题,旨在提高模型的泛化性能和鲁棒性。实验结果表明,与主流数据集相比,所提出的方法在MOT挑战中具有很高的准确率。这为研究行人多目标跟踪领域提供了新的视角和解决方案。
https://arxiv.org/abs/2405.07272
Monitoring dietary intake is a crucial aspect of promoting healthy living. In recent years, advances in computer vision technology have facilitated dietary intake monitoring through the use of images and depth cameras. However, the current state-of-the-art image-based food portion estimation algorithms assume that users take images of their meals one or two times, which can be inconvenient and fail to capture food items that are not visible from a top-down perspective, such as ingredients submerged in a stew. To address these limitations, we introduce an innovative solution that utilizes stationary user-facing cameras to track food items on utensils, not requiring any change of camera perspective after installation. The shallow depth of utensils provides a more favorable angle for capturing food items, and tracking them on the utensil's surface offers a significantly more accurate estimation of dietary intake without the need for post-meal image capture. The system is reliable for estimation of nutritional content of liquid-solid heterogeneous mixtures such as soups and stews. Through a series of experiments, we demonstrate the exceptional potential of our method as a non-invasive, user-friendly, and highly accurate dietary intake monitoring tool.
监测饮食摄入是促进健康生活的重要方面。近年来,计算机视觉技术的进步通过使用图像和深度相机来监测饮食摄入。然而,当前基于图像的美食份额估计算法假定用户只会拍摄一顿饭的照片一次,这可能很不方便,而且无法捕捉从上方无法看到的食品物品,例如汤中的食材。为了克服这些限制,我们介绍了一种创新解决方案,该方案利用静止的用户面向摄像头跟踪餐叉上的食品物品,无需安装后改变摄像头视角。餐具浅浅的深度提供了更有利的角度来捕捉食品物品,而在餐叉表面上跟踪它们能够 significantly更准确地估计饮食摄入量,而无需在饭后拍摄图像。我们的系统对于估算汤和炖菜等液态固体混合物的营养含量非常可靠。通过一系列实验,我们证明了我们的方法的非凡潜力作为非侵入性、用户友好且高度准确的饮食摄入监测工具。
https://arxiv.org/abs/2405.08717
The recent works on Video Object Segmentation achieved remarkable results by matching dense semantic and instance-level features between the current and previous frames for long-time propagation. Nevertheless, global feature matching ignores scene motion context, failing to satisfy temporal consistency. Even though some methods introduce local matching branch to achieve smooth propagation, they fail to model complex appearance changes due to the constraints of the local window. In this paper, we present DeVOS (Deformable VOS), an architecture for Video Object Segmentation that combines memory-based matching with motion-guided propagation resulting in stable long-term modeling and strong temporal consistency. For short-term local propagation, we propose a novel attention mechanism ADVA (Adaptive Deformable Video Attention), allowing the adaption of similarity search region to query-specific semantic features, which ensures robust tracking of complex shape and scale changes. DeVOS employs an optical flow to obtain scene motion features which are further injected to deformable attention as strong priors to learnable offsets. Our method achieves top-rank performance on DAVIS 2017 val and test-dev (88.1%, 83.0%), YouTube-VOS 2019 val (86.6%) while featuring consistent run-time speed and stable memory consumption
近年来,在视频对象分割(Video Object Segmentation)方面的研究取得了显著的成果,通过在当前和前一帧之间匹配密集的语义和实例级别特征,实现了长时传播。然而,全局特征匹配忽略了场景运动上下文,未能满足时间一致性。尽管一些方法引入局部匹配分支以实现平滑传播,但由于局部窗口的约束,它们无法建模由于局部窗口引起的复杂视觉效果变化。在本文中,我们提出了DeVOS(可变形视频对象分割)架构,结合基于记忆的匹配和运动引导的传播,实现了稳定的长期建模和强的时间一致性。对于短期局部传播,我们提出了名为ADVA(自适应变形视频注意力)的新注意机制,允许将相似性搜索区域适应性地扩展到查询特定的语义特征,从而确保对复杂形状和比例变化进行稳健跟踪。DeVOS采用光流技术获得场景运动特征,这些特征进一步通过变形注意力作为强的先验进行学习。我们的方法在DAVIS 2017 val和test-dev(88.1%, 83.0%)以及YouTube-VOS 2019 val(86.6%)上实现了 Top-rank 性能,具有稳定的运行速度和合理的内存消耗。
https://arxiv.org/abs/2405.08715
This paper explores the possibilities of the current generation of Large Language Models for incorporating Machine Learning Operations (MLOps) functionalities into ML training code bases. We evaluate the performance of OpenAI (gpt-3.5-turbo) and WizardCoder (open-source, 15B parameters) models on the automated accomplishment of various MLOps functionalities in different settings. We perform a benchmarking study that assesses the ability of these models to: (1) adapt existing code samples (Inlining) with component-specific MLOps functionality such as MLflow and Weights & Biases for experiment tracking, Optuna for hyperparameter optimization etc., and (2) perform the task of Translation from one component of an MLOps functionality to another, e.g., translating existing GitPython library based version control code to Data Version Control library based. We also propose three different approaches that involve teaching LLMs to comprehend the API documentation of the components as a reference while accomplishing the Translation tasks. In our evaluations, the gpt-3.5-turbo model significantly outperforms WizardCoder by achieving impressive Pass@3 accuracy in model optimization (55% compared to 0% by WizardCoder), experiment tracking (100%, compared to 62.5% by WizardCoder), model registration (92% compared to 42% by WizardCoder) and hyperparameter optimization (83% compared to 58% by WizardCoder) on average, in their best possible settings, showcasing its superior code adaptability performance in complex MLOps tasks.
本文探讨了当前一代大型语言模型在将机器学习操作(MLOps)功能集成到ML训练代码库中的可能性。我们评估了OpenAI(gpt-3.5-turbo)和WizardCoder(开源,15B参数)模型在不同设置中自动完成各种MLOps功能的能力。我们进行了一项基准研究来评估这些模型的能力:(1)适应现有的代码样本(内嵌),具有组件特定的MLOps功能,如MLflow和Weights & Biases,用于实验跟踪,Optuna用于超参数优化等;(2)实现从MLOps功能的一个组件到另一个组件的翻译任务,例如,将基于Git的现有Python库版本控制系统翻译为基于Data的版本控制系统。我们还提出了三种不同的方法,涉及将LLM指导其理解组件的API文档作为参考,同时完成翻译任务。在我们的评估中,gpt-3.5-turbo模型在模型优化(与WizardCoder相比,提高了55%)实验跟踪(与WizardCoder相比,提高了100%)模型注册(与WizardCoder相比,提高了92%)和超参数优化(与WizardCoder相比,提高了83%)方面显著优于WizardCoder,在最佳设置中展现了其在复杂MLOps任务中出色的代码适应能力。
https://arxiv.org/abs/2405.06835