Abstract
The ability to generate sentiment-controlled feedback in response to multimodal inputs, comprising both text and images, addresses a critical gap in human-computer interaction by enabling systems to provide empathetic, accurate, and engaging responses. This capability has profound applications in healthcare, marketing, and education. To this end, we construct a large-scale Controllable Multimodal Feedback Synthesis (CMFeed) dataset and propose a controllable feedback synthesis system. The proposed system includes an encoder, decoder, and controllability block for textual and visual inputs. It extracts textual and visual features using a transformer and Faster R-CNN networks and combines them to generate feedback. The CMFeed dataset encompasses images, text, reactions to the post, human comments with relevance scores, and reactions to the comments. The reactions to the post and comments are utilized to train the proposed model to produce feedback with a particular (positive or negative) sentiment. A sentiment classification accuracy of 77.23% has been achieved, 18.82% higher than the accuracy without using the controllability. Moreover, the system incorporates a similarity module for assessing feedback relevance through rank-based metrics. It implements an interpretability technique to analyze the contribution of textual and visual features during the generation of uncontrolled and controlled feedback.
Abstract (translated)
能够针对多模态输入生成情感控制反馈,包括文本和图像,解决了一个关键的人机交互缺口,使得系统能够提供体贴、准确、引人入胜的回应。这种能力在医疗、营销和教育等领域具有深刻的应用。为此,我们构建了一个大规模可控制多模态反馈合成(CMFeed)数据集,并提出了一个可控制反馈合成系统。所提出的系统包括编码器、解码器和一个可控制块,用于处理文本和视觉输入。它使用Transformer和Faster R-CNN网络提取文本和视觉特征,并将它们组合生成反馈。CMFeed数据集包括图像、文本、对帖子及其评论的反应、以及对这些评论的反应。用于训练所提出的模型产生具有特定(积极或消极)情感的反馈。情感分类准确度为77.23%,比没有使用控制权的有18.82%的提高。此外,系统还包括一个相似度模块,通过基于排名的指标评估反馈的相关性。它采用了一种解释性技术来分析在生成未控制和控制反馈过程中文本和视觉特征的贡献。
URL
https://arxiv.org/abs/2402.07640