首页
学习
活动
专区
圈层
工具
发布
社区首页 >专栏 >自然语言处理学术速递[12.9]

自然语言处理学术速递[12.9]

作者头像
公众号-arXiv每日学术速递
发布2021-12-09 20:25:00
发布2021-12-09 20:25:00
5630
举报

cs.CL 方向,今日共计21篇

Transformer(2篇)

【1】 Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval 标题:一举多得--用于视频检索的多模式融合转换器 链接:https://arxiv.org/abs/2112.04446

作者:Nina Shvetsova,Brian Chen,Andrew Rouditchenko,Samuel Thomas,Brian Kingsbury,Rogerio Feris,David Harwath,James Glass,Hilde Kuehne 摘要:最近,视频数据的多模式学习受到了越来越多的关注,因为它可以在不需要人工注释的情况下训练语义上有意义的嵌入,从而实现Zero-Shot检索和分类等任务。在这项工作中,我们提出了一种多模态、模态不可知的融合变换方法,该方法学习在多模态(如视频、音频和文本)之间交换信息,并将它们集成到联合的多模态表示中,以获得聚合多模态时间信息的嵌入。我们建议在训练系统时,同时对所有东西进行组合损失,包括单个模态和成对模态,明确排除任何附加项,如位置或模态编码。在测试时,生成的模型可以处理和融合任意数量的输入模式。此外,Transformer的隐式特性允许处理不同长度的输入。为了评估所提出的方法,我们在大规模HowTo100M数据集上训练模型,并在四个具有挑战性的基准数据集上评估生成的嵌入空间,从而获得Zero-Shot视频检索和Zero-Shot视频动作定位的最新结果。 摘要:Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

【2】 Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents 标题:基于变换的历史文献联合手写和命名实体识别方法 链接:https://arxiv.org/abs/2112.04189

作者:Ahmed Cheikh Rouhoua,Marwa Dhiaf,Yousri Kessentini,Sinda Ben Salem 备注:None 摘要:手写文档中命名实体的相关信息提取仍然是一项具有挑战性的任务。与传统的信息提取方法不同,传统的信息提取方法通常将文本转录和命名实体识别作为单独的后续任务,本文提出了一种基于端到端转换器的方法来联合执行这两项任务。拟议办法在段落一级运作,这带来两个主要好处。首先,它允许模型避免由于线分割而导致的不可恢复的早期错误。其次,它允许模型利用更大的二维上下文信息来识别语义类别,从而达到更高的最终预测精度。我们还探索了不同的训练场景来展示它们对性能的影响,并且我们证明了两阶段学习策略可以使模型达到更高的最终预测精度。据我们所知,这项工作提出了第一种方法,采用Transformer网络的命名实体识别手写文件。我们使用ESPOSALES数据库在2017年ICDAR信息提取竞赛中为完成任务取得了最新的技术水平,尽管所提出的技术不使用任何词典、语言建模或后处理。 摘要:The extraction of relevant information carried out by named entities in handwriting documents is still a challenging task. Unlike traditional information extraction approaches that usually face text transcription and named entity recognition as separate subsequent tasks, we propose in this paper an end-to-end transformer-based approach to jointly perform these two tasks. The proposed approach operates at the paragraph level, which brings two main benefits. First, it allows the model to avoid unrecoverable early errors due to line segmentation. Second, it allows the model to exploit larger bi-dimensional context information to identify the semantic categories, reaching a higher final prediction accuracy. We also explore different training scenarios to show their effect on the performance and we demonstrate that a two-stage learning strategy can make the model reach a higher final prediction accuracy. As far as we know, this work presents the first approach that adopts the transformer networks for named entity recognition in handwritten documents. We achieve the new state-of-the-art performance in the ICDAR 2017 Information Extraction competition using the Esposalles database, for the complete task, even though the proposed technique does not use any dictionaries, language modeling, or post-processing.

BERT(1篇)

【1】 JABER: Junior Arabic BERt 标题:贾比尔:初级阿拉伯语BERT 链接:https://arxiv.org/abs/2112.04329

作者:Abbas Ghaddar,Yimeng Wu,Ahmad Rashid,Khalil Bibi,Mehdi Rezagholizadeh,Chao Xing,Yasheng Wang,Duan Xinyu,Zhefeng Wang,Baoxing Huai,Xin Jiang,Qun Liu,Philippe Langlais 备注:Techincal Report 摘要:在单语评估环境中,语言特定的预先训练模型已被证明比多语言模型更准确,阿拉伯语也不例外。然而,我们发现之前发布的阿拉伯BERT模型明显训练不足。在本技术报告中,我们介绍了JABER,初级阿拉伯语BERt,我们专门用于阿拉伯语的预训练语言模型原型。我们进行了一项实证研究,以系统地评估各种现有阿拉伯语NLU任务中模型的性能。实验结果表明,JABER在ALUE(阿拉伯语理解评估的新基准)和NER基准上达到了最先进的性能 摘要:Language-specific pre-trained models have proven to be more accurate than multilingual ones in a monolingual evaluation setting, Arabic is no exception. However, we found that previously released Arabic BERT models were significantly under-trained. In this technical report, we present JABER, Junior Arabic BERt, our pretrained language model prototype dedicated for Arabic. We conduct an empirical study to systematically evaluate the performance of models across a diverse set of existing Arabic NLU tasks. Experimental results show that JABER achieves the state-of-the-art performances on ALUE, a new benchmark for Arabic Language Understanding Evaluation, as well as on a well-established NER benchmark

QA|VQA|问答|对话(3篇)

【1】 Prompting Visual-Language Models for Efficient Video Understanding 标题:提示视觉语言模型以实现高效的视频理解 链接:https://arxiv.org/abs/2112.04478

作者:Chen Ju,Tengda Han,Kunhao Zheng,Ya Zhang,Weidi Xie 备注:Project page: this https URL 摘要:视觉语言预训练在从大规模web数据学习联合视觉文本表示方面取得了巨大成功,展示了卓越的Zero-Shot概括能力。本文提出了一种简单的方法来有效地适应一个预先训练的视觉语言模型,以最小的训练新的任务,在这里,我们考虑视频理解任务。具体而言,我们建议优化一些随机向量,称为连续提示向量,将新任务转换为与训练前目标相同的格式。此外,为了弥合静态图像和视频之间的鸿沟,时间信息使用轻量级的变换器进行编码,这些变换器堆叠在逐帧的视觉特征之上。在实验上,我们进行了广泛的烧蚀研究,以分析关键部件和必要性。在动作识别、动作定位和文本视频检索的9个公共基准上,在封闭集、Few-Shot、开放集场景中,我们实现了与现有方法相比具有竞争力或最先进的性能,尽管训练的参数明显减少。 摘要:Visual-language pre-training has shown great success for learning joint visual-textual representations from large-scale web data, demonstrating remarkable ability for zero-shot generalisation. This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training, and here, we consider video understanding tasks. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert the novel tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components and necessities. On 9 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, open-set scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite training significantly fewer parameters.

【2】 SNEAK: Synonymous Sentences-Aware Adversarial Attack on Natural Language Video Localization 标题:Screak:自然语言视频本地化的同义句感知对抗性攻击 链接:https://arxiv.org/abs/2112.04154

作者:Wenbo Gou,Wen Shi,Jian Lou,Lijie Huang,Pan Zhou,Ruixuan Li 摘要:自然语言视频定位(NLVL)是视觉语言理解领域的一项重要任务,它不仅要求深入理解计算机视觉和自然语言方面,更重要的是深入理解两者之间的相互作用。对抗性漏洞已被公认为深层神经网络模型的一个关键安全问题,需要谨慎调查。尽管对视频和语言任务进行了广泛而独立的研究,但目前对NLVL等视觉-语言联合任务中对抗性稳健性的理解还不太成熟。因此,本文旨在通过从攻击和防御两个方面考察漏洞的三个方面,全面研究NLVL模型的对抗鲁棒性。为了实现攻击目标,我们提出了一种新的对抗性攻击范式,称为NLVL上的同义句感知对抗性攻击(SLEK),它捕获了视觉和语言双方之间的跨模态相互作用。 摘要:Natural language video localization (NLVL) is an important task in the vision-language understanding area, which calls for an in-depth understanding of not only computer vision and natural language side alone, but more importantly the interplay between both sides. Adversarial vulnerability has been well-recognized as a critical security issue of deep neural network models, which requires prudent investigation. Despite its extensive yet separated studies in video and language tasks, current understanding of the adversarial robustness in vision-language joint tasks like NLVL is less developed. This paper therefore aims to comprehensively investigate the adversarial robustness of NLVL models by examining three facets of vulnerabilities from both attack and defense aspects. To achieve the attack goal, we propose a new adversarial attack paradigm called synonymous sentences-aware adversarial attack on NLVL (SNEAK), which captures the cross-modality interplay between the vision and language sides.

【3】 FreeTalky: Don't Be Afraid! Conversations Made Easier by a Humanoid Robot using Persona-based Dialogue 标题:别害怕!使用基于人物角色的对话,人形机器人使对话变得更容易 链接:https://arxiv.org/abs/2112.04126

作者:Chanjun Park,Yoonna Jang,Seolhwa Lee,Sungjin Park,Heuiseok Lim 备注:Accepted for Artificial Intelligence for Education (AI4EDU) workshop at AAAI 2022 摘要:我们提出了一个基于深度学习的外语学习平台,名为FreeTalky,通过使用仿人机器人NAO和各种深度学习模型,为那些在外语学习中感到焦虑的人设计。NAO中嵌入的基于角色的对话系统为用户提供了有趣且一致的多回合对话。此外,语法错误纠正系统还促进了用户语法技能的提高。因此,我们的系统支持基于人物角色对话的个性化学习,并使用语法错误反馈促进用户的语法学习。此外,我们还通过人类评估,验证了FreeTalky是否通过用NAO机器人代替对话中的真实人类,在缓解仇外心理方面提供了实际帮助。 摘要:We propose a deep learning-based foreign language learning platform, named FreeTalky, for people who experience anxiety dealing with foreign languages, by employing a humanoid robot NAO and various deep learning models. A persona-based dialogue system that is embedded in NAO provides an interesting and consistent multi-turn dialogue for users. Also, an grammar error correction system promotes improvement in grammar skills of the users. Thus, our system enables personalized learning based on persona dialogue and facilitates grammar learning of a user using grammar error feedback. Furthermore, we verified whether FreeTalky provides practical help in alleviating xenoglossophobia by replacing the real human in the conversation with a NAO robot, through human evaluation.

语义分析(1篇)

【1】 Multinational Address Parsing: A Zero-Shot Evaluation 标题:跨国地址解析:零概率评估 链接:https://arxiv.org/abs/2112.04008

作者:Marouane Yassine,David Beauchemin,François Laviolette,Luc Lamontagne 备注:Accepted in the International Journal of Information Science and Technology (iJIST). arXiv admin note: text overlap with arXiv:2006.16152 摘要:地址解析包括识别组成地址的段,例如街道名称或邮政编码。由于地址解析在记录链接等任务中的重要性,人们使用了许多技术来进行地址解析,最近的一种技术是依靠神经网络。虽然这些模型产生了显著的结果,但之前关于神经网络的工作只关注解析来自单一来源国的地址。本文探讨了在Zero-Shot迁移学习环境下,在不进行进一步训练的情况下,将通过对一些国家的地址进行深度学习模型训练而获得的地址解析知识转移给其他国家的可能性。我们还实验了在相同的Zero-Shot转移设置下使用注意机制和域对抗训练算法来提高性能。这两种方法在大多数测试国家都能产生最先进的性能,而在其余国家则能产生良好的效果。我们还探讨了不完全地址对最佳模型的影响,并评估了在训练期间使用不完全地址的影响。此外,我们还提出了一些经过训练的模型的开源Python实现。 摘要:Address parsing consists of identifying the segments that make up an address, such as a street name or a postal code. Because of its importance for tasks like record linkage, address parsing has been approached with many techniques, the latest relying on neural networks. While these models yield notable results, previous work on neural networks has only focused on parsing addresses from a single source country. This paper explores the possibility of transferring the address parsing knowledge acquired by training deep learning models on some countries' addresses to others with no further training in a zero-shot transfer learning setting. We also experiment using an attention mechanism and a domain adversarial training algorithm in the same zero-shot transfer setting to improve performance. Both methods yield state-of-the-art performance for most of the tested countries while giving good results to the remaining countries. We also explore the effect of incomplete addresses on our best model, and we evaluate the impact of using incomplete addresses during training. In addition, we propose an open-source Python implementation of some of our trained models.

推理|分析|理解|解释(1篇)

【1】 Sentiment Analysis and Effect of COVID-19 Pandemic using College SubReddit Data 标题:基于高校Subreddit数据的冠状病毒大流行情绪分析及效果分析 链接:https://arxiv.org/abs/2112.04351

作者:Tian Yan,Fang Liu 摘要:2019冠状病毒疾病已经影响到社会和人类健康和幸福。在本研究中,我们从与8所大学相关的亚Reddit社区收集了2019年(大流行前)和2020年(大流行)的Reddit数据,应用自然语言处理(NLP)技术,并使用社交媒体数据训练图形神经网络,研究与大流行前相比,大流行如何影响人们的情绪和心理状态。具体地说,我们首先应用一种预训练的鲁棒优化BERT预训练方法(RoBERTa)从Reddit消息的语义信息中学习嵌入,并训练一个用于情感分类的图注意网络(GAT)。GAT的使用允许我们在训练期间利用消息之间的关系信息。然后,我们应用子组自适应模型叠加,将RoBERTa和GAT的预测概率结合起来,得到情绪的最终分类。通过对收集的数据进行手动标记和模型预测情绪标签,我们应用广义线性混合效应模型以统计显著的方式估计流行病和在线教学对人们情绪的影响。结果表明,在研究人群中,2020年出现消极情绪的几率比2019年的几率高14.6\%$(p$-value$<0.001$),而在2020年,面对面教学出现消极情绪的几率比在线教学高41.6\%$(p$-value$=0.037$)。 摘要:The COVID-19 pandemic has affected societies and human health and well-being in various ways. In this study, we collected Reddit data from 2019 (pre-pandemic) and 2020 (pandemic) from the subreddits communities associated with 8 universities, applied natural language processing (NLP) techniques, and trained graphical neural networks with social media data, to study how the pandemic has affected people's emotions and psychological states compared to the pre-pandemic era. Specifically, we first applied a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) to learn embedding from the semantic information of Reddit messages and trained a graph attention network (GAT) for sentiment classification. The usage of GAT allows us to leverage the relational information among the messages during training. We then applied subgroup-adaptive model stacking to combine the prediction probabilities from RoBERTa and GAT to yield the final classification on sentiment. With the manually labeled and model-predicted sentiment labels on the collected data, we applied a generalized linear mixed-effects model to estimate the effects of pandemic and online teaching on people's sentiment in a statistically significant manner. The results suggest the odds of negative sentiments in 2020 is $14.6\%$ higher than the odds in 2019 ($p$-value $<0.001$), and the odds of negative sentiments are $41.6\%$ higher with in-person teaching than with online teaching in 2020 ($p$-value $=0.037$) in the studied population.

GAN|对抗|攻击|生成相关(1篇)

【1】 Does Structure Matter? Leveraging Data-to-Text Generation for Answering Complex Information Needs 标题:结构重要吗?利用数据到文本生成来满足复杂的信息需求 链接:https://arxiv.org/abs/2112.04344

作者:Hanane Djeddal,Thomas Gerald,Laure Soulier,Karen Pinel-Sauvagnat,Lynda Tamine 备注:8 pages, 1 figure, ECIR 2022 short paper 摘要:在这项工作中,我们的目标是用自然语言为复杂的信息需求提供结构化的答案。特别是,我们设想从数据到文本生成的角度使用生成模型。我们建议使用内容选择和规划管道,旨在通过生成中间计划来构建答案。使用TREC复杂答案检索(CAR)数据集进行实验评估。我们评估了生成的答案及其相应的结构,并与文本到文本模型进行了比较,展示了基于计划的模型的有效性。 摘要:In this work, our aim is to provide a structured answer in natural language to a complex information need. Particularly, we envision using generative models from the perspective of data-to-text generation. We propose the use of a content selection and planning pipeline which aims at structuring the answer by generating intermediate plans. The experimental evaluation is performed using the TREC Complex Answer Retrieval (CAR) dataset. We evaluate both the generated answer and its corresponding structure and show the effectiveness of planning-based models in comparison to a text-to-text model.

半/弱/无监督|不确定性(1篇)

【1】 Learning music audio representations via weak language supervision 标题:通过弱语言监督学习音乐音频表征 链接:https://arxiv.org/abs/2112.04214

作者:Ilaria Manco,Emmanouil Benetos,Elio Quinton,Gyorgy Fazekas 备注:5 pages, 5 figures 摘要:用于音乐信息检索的音频表示通常通过任务特定方式的监督学习来学习。尽管该方案能够有效地产生最先进的结果,但对于模型可能具有的应用范围而言,该方案缺乏灵活性,并且需要大量注释数据集。在这项工作中,我们提出了一个问题,即是否有可能利用弱对齐文本作为学习通用音乐音频表示的唯一监督信号。为了解决这个问题,我们设计了一个多模式的音乐和语言预训练体系结构(MuLaP),通过一组代理任务进行优化。以嘈杂的自然语言描述的形式提供微弱的监督,传达曲目的整体音乐内容。在预训练之后,我们将模型的音频主干转移到一组音乐音频分类和回归任务中。我们通过比较同一音频主干与不同训练策略产生的音频表示的性能,证明了我们方法的有效性,并表明我们的预训练方法在所有任务和数据集上始终取得可比或更高的分数。我们的实验还证实,MuLaP有效地利用音频字幕对来学习与文献中的纯音频和跨模态自监督方法相竞争的表示。 摘要:Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address this question, we design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. Weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. After pre-training, we transfer the audio backbone of the model to a set of music audio classification and regression tasks. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies and show that our pre-training method consistently achieves comparable or higher scores on all tasks and datasets considered. Our experiments also confirm that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.

识别/分类(1篇)

【1】 A study on native American English speech recognition by Indian listeners with varying word familiarity level 标题:不同词汇熟悉度的印度听者对美国本土英语语音识别的研究 链接:https://arxiv.org/abs/2112.04151

作者:Abhayjeet Singh,Achuth Rao MV,Rakesh Vaideeswaran,Chiranjeevi Yarra,Prasanta Kumar Ghosh 备注:6 pages, 5 figues, COCOSDA 2021 摘要:在这项研究中,不同印第安血统的听者被要求去听和识别美国人所说的胆小的话语。当每个听众识别一个话语时,我们有三种反应:1。句子难度评分,2。演讲者难度评分,和3。话语的转录。根据这些抄本,计算单词错误率(WER),并将其用作评估识别句子和原始句子之间相似性的指标。本研究中选择的句子根据单词出现的频率分为三组:简单、中等和困难。我们观察到,句子、说话人难度评分和WER从容易的句子类别增加到困难的句子类别。我们还比较了在以下三种声学模型(AM)和语言模型(LM)组合下使用三种自动语音识别(ASR)的人类语音识别性能:ASR1)AM使用来自印度裔说话人的录音和基于TIMIT文本的LM进行训练,ASR2)AM使用来自美国原住民的录音,LM使用LIBRI语音语料库构建的上文本,ASR3)AM使用来自美国原住民的录音,LM使用LIBRI语音和TIMIT文本构建。我们观察到高铁的性能与ASR1相似,而ASR3的性能最好。对说话人出生地的分析表明,与其他几个出生地相比,印度听众更难识别某些出生地的说话人的话语 摘要:In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.The sentences selected in this study are categorized into three groups: Easy, Medium and Hard, based on the frequency ofoccurrence of the words in them. We observe that the sentence, speaker difficulty ratings and the WERs increase from easy to hard categories of sentences. We also compare the human speech recognition performance with that using three automatic speech recognition (ASR) under following three combinations of acoustic model (AM) and language model(LM): ASR1) AM trained with recordings from speakers of Indian origin and LM built on TIMIT text, ASR2) AM using recordings from native American speakers and LM built ontext from LIBRI speech corpus, and ASR3) AM using recordings from native American speakers and LM build on LIBRI speech and TIMIT text. We observe that HSR performance is similar to that of ASR1 whereas ASR3 achieves the best performance. Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities

Zero/Few/One-Shot|迁移|自适应(1篇)

【1】 Zero-Shot Recommendation as Language Modeling 标题:作为语言建模的零命中率推荐 链接:https://arxiv.org/abs/2112.04184

作者:Damien Sileo,Wout Vossen,Robbe Raymaekers 备注:Accepted at ECIR 2022

表征(1篇)

【1】 VIRT: Improving Representation-based Models for Text Matching through Virtual Interaction 标题:VIRT:改进基于表示的虚拟交互文本匹配模型 链接:https://arxiv.org/abs/2112.04195

作者:Dan Li,Yang Yang,Hongyin Tang,Jingang Wang,Tong Xu,Wei Wu,Enhong Chen 摘要:随着预训练Transformer的蓬勃发展,在支持相关自然语言应用的文本对建模方面取得了显著的进展。为文本匹配开发了两种方法:基于交互的模型在文本对上执行完全交互,基于表示的模型使用暹罗编码器独立地对文本对进行编码。前者由于其深入的交互建模能力而实现了引人注目的性能,但牺牲了推理延迟。后者是高效的,在实际应用中被广泛采用,但是,由于缺乏交互,其性能会严重下降。尽管以前的一些工作试图将交互知识集成到基于表示的模型中,但考虑到计算成本,它们只在顶层执行后期交互或知识转移。较低层中的交互信息仍然缺失,这限制了基于表示的解决方案的性能。为了解决这一问题,我们提出了一种新的交互机制,称为VIRT,以在基于表示的模型中实现完整和深入的交互建模,而无需进行实际的推理计算。具体地说,VIRT要求基于表示的编码器进行虚拟交互,以模仿基于交互的模型所做的行为。此外,从基于交互的编码器中提取的知识被作为监督信号,以保证虚拟交互的有效性。由于虚拟交互只发生在训练阶段,VIRT不会增加推理成本。此外,我们还设计了一种VIRT自适应的后期交互策略,以充分利用学习到的虚拟交互知识。 摘要:With the booming of pre-trained transformers, remarkable progress has been made on textual pair modeling to support relevant natural language applications. Two lines of approaches are developed for text matching: interaction-based models performing full interactions over the textual pair, and representation-based models encoding the pair independently with siamese encoders. The former achieves compelling performance due to its deep interaction modeling ability, yet with a sacrifice in inference latency. The latter is efficient and widely adopted for practical use, however, suffers from severe performance degradation due to the lack of interactions. Though some prior works attempt to integrate interactive knowledge into representation-based models, considering the computational cost, they only perform late interaction or knowledge transferring at the top layers. Interactive information in the lower layers is still missing, which limits the performance of representation-based solutions. To remedy this, we propose a novel \textit{Virtual} InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models without \textit{actual} inference computations. Concretely, VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do. In addition, the knowledge distilled from interaction-based encoders is taken as supervised signals to promise the effectiveness of virtual interactions. Since virtual interactions only happen at the training stage, VIRT would not increase the inference cost. Furthermore, we design a VIRT-adapted late interaction strategy to fully utilize the learned virtual interactive knowledge.

其他神经网络|深度学习|模型|建模(5篇)

【1】 FLAVA: A Foundational Language And Vision Alignment Model 标题:FLAVA:一种基础性的语言和视觉对齐模型 链接:https://arxiv.org/abs/2112.04482

作者:Amanpreet Singh,Ronghang Hu,Vedanuj Goswami,Guillaume Couairon,Wojciech Galuba,Marcus Rohrbach,Douwe Kiela 备注:18 pages 摘要:最先进的视觉和视觉语言模型依赖于大规模的视觉语言预训练,以便在各种下游任务中获得良好的性能。一般来说,此类模型通常是跨模态(对比)或多模态(早期融合),但不是两者兼而有之;它们通常只针对特定的模式或任务。一个有前途的方向是使用一个整体的通用模型,作为一个“基础”,同时针对所有的模式——一个真正的视觉和语言基础模型应该擅长于视觉任务、语言任务和跨和多模态视觉和语言任务。我们将FLAVA作为这样一个模型引入,并在跨越这些目标模式的35项任务中展示了令人印象深刻的性能。 摘要:State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.

【2】 Improving language models by retrieving from trillions of tokens 标题:通过从数万亿个令牌中检索来改进语言模型 链接:https://arxiv.org/abs/2112.04426

作者:Sebastian Borgeaud,Arthur Mensch,Jordan Hoffmann,Trevor Cai,Eliza Rutherford,Katie Millican,George van den Driessche,Jean-Baptiste Lespiau,Bogdan Damoc,Aidan Clark,Diego de Las Casas,Aurelia Guy,Jacob Menick,Roman Ring,Tom Hennigan,Saffron Huang,Loren Maggiore,Chris Jones,Albin Cassirer,Andy Brock,Michela Paganini,Geoffrey Irving,Oriol Vinyals,Simon Osindero,Karen Simonyan,Jack W. Rae,Erich Elsen,Laurent Sifre 摘要:我们通过基于与前面标记的局部相似性,对从大型语料库检索到的文档块进行条件化处理来增强自回归语言模型。通过一个2万亿美元的令牌数据库,我们的检索增强型Transformer(RETRO)在桩上获得了与GPT-3和Jurassic-1相当的性能,尽管使用的参数少了25$\倍。经过微调后,RETRO性能将转化为下游知识密集型任务,如问答。RETRO结合了一个冻结的Bert检索器、一个可微编码器和一个分块交叉注意机制,根据比训练期间通常消耗的数据多一个数量级的数据来预测令牌。我们通常从头开始训练RETRO,但也可以通过检索快速改装预先训练过的Transformer,并且仍然获得良好的性能。我们的工作以前所未有的规模通过显式记忆为改进语言模型开辟了新的途径。 摘要:We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.

【3】 Ethical and social risks of harm from Language Models 标题:语言模式危害的伦理和社会风险 链接:https://arxiv.org/abs/2112.04359

作者:Laura Weidinger,John Mellor,Maribeth Rauh,Conor Griffin,Jonathan Uesato,Po-Sen Huang,Myra Cheng,Mia Glaese,Borja Balle,Atoosa Kasirzadeh,Zac Kenton,Sasha Brown,Will Hawkins,Tom Stepleton,Courtney Biles,Abeba Birhane,Julia Haas,Laura Rimell,Lisa Anne Hendricks,William Isaac,Sean Legassick,Geoffrey Irving,Iason Gabriel 摘要:本文旨在帮助构建与大规模语言模型(LMs)相关的风险景观。为了促进负责任创新的进步,需要深入了解这些模型带来的潜在风险。利用计算机科学、语言学和社会科学的多学科专业知识和文献,详细分析了广泛的已确定和预期风险。我们概述了六个具体的风险领域:一、歧视、排斥和毒性;二。信息危害,III.错误信息危害,V.恶意使用,V.人机交互危害,VI.自动化、访问和环境危害。第一个领域涉及陈规定型观念、不公平歧视、排他性规范、有毒语言以及社会群体在LMs方面表现不佳。第二个重点是私人数据泄漏或LMs正确推断敏感信息的风险。第三种方法解决了由不良、虚假或误导性信息(包括敏感领域中的信息)引起的风险,以及对共享信息的信任受损等连锁风险。第四部分考虑了试图使用LMs造成伤害的行为者的风险。第五个重点关注用于支持与人类用户交互的会话代理的LLM的特定风险,包括不安全使用、操纵或欺骗。第六部分讨论了可能对不同社会群体或社区产生不同影响的环境危害、工作自动化和其他挑战的风险。我们总共深入审查了21项风险。我们讨论了不同风险的起源点,并指出了潜在的缓解方法。最后,我们讨论了实施缓解措施的组织责任,以及协作和参与的作用。我们强调了进一步研究的方向,特别是扩展用于评估和评估LMs中概述的风险的工具包。 摘要:This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. We outline six specific risk areas: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, V. Malicious Uses, V. Human-Computer Interaction Harms, VI. Automation, Access, and Environmental Harms. The first area concerns the perpetuation of stereotypes, unfair discrimination, exclusionary norms, toxic language, and lower performance by social group for LMs. The second focuses on risks from private data leaks or LMs correctly inferring sensitive information. The third addresses risks arising from poor, false or misleading information including in sensitive domains, and knock-on risks such as the erosion of trust in shared information. The fourth considers risks from actors who try to use LMs to cause harm. The fifth focuses on risks specific to LLMs used to underpin conversational agents that interact with human users, including unsafe use, manipulation or deception. The sixth discusses the risk of environmental harm, job automation, and other challenges that may have a disparate effect on different social groups or communities. In total, we review 21 risks in-depth. We discuss the points of origin of different risks and point to potential mitigation approaches. Lastly, we discuss organisational responsibilities in implementing mitigations, and the role of collaboration and participation. We highlight directions for further research, particularly on expanding the toolkit for assessing and evaluating the outlined risks in LMs.

【4】 Contrastive Instruction-Trajectory Learning for Vision-Language Navigation 标题:视觉语言导航的对比教学-轨迹学习 链接:https://arxiv.org/abs/2112.04138

作者:Xiwen Liang,Fengda Zhu,Yi Zhu,Bingqian Lin,Bing Wang,Xiaodan Liang 备注:Accepted by AAAI 2022 摘要:视觉语言导航(VLN)任务要求agent在自然语言指令的指导下到达目标。以前的作品学习按照指示一步一步地导航。然而,这些工作可能无法区分指令轨迹对之间的相似性和差异,并且忽略了子指令的时间连续性。这些问题阻碍了代理学习独特的视觉和语言表示,损害了导航策略的健壮性和通用性。在本文中,我们提出了一个对比指令轨迹学习(CITL)框架,该框架探索了相似数据样本之间的不变性和不同数据样本之间的差异,以学习鲁棒导航的独特表示。具体而言,我们提出:(1)粗粒度对比学习目标,通过对比全轨迹观察和指令的语义来增强视觉和语言表征;(2) 利用子指令的时间信息感知指令的细粒度对比学习目标;(3) 一种用于对比学习的成对样本重加权机制,用于挖掘硬样本,从而减轻对比学习中数据采样偏差的影响。我们的CITL可以轻松地与VLN主干集成,形成新的学习范式,并在看不见的环境中实现更好的通用性。大量实验表明,使用CITL的模型在R2R、R4R和RxR上优于以前的最新方法。 摘要:The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction. Previous works learn to navigate step-by-step following an instruction. However, these works may fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. These problems hinder agents from learning distinctive vision-and-language representations, harming the robustness and generalizability of the navigation policy. In this paper, we propose a Contrastive Instruction-Trajectory Learning (CITL) framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation. Specifically, we propose: (1) a coarse-grained contrastive learning objective to enhance vision-and-language representations by contrasting semantics of full trajectory observations and instructions, respectively; (2) a fine-grained contrastive learning objective to perceive instructions by leveraging the temporal information of the sub-instructions; (3) a pairwise sample-reweighting mechanism for contrastive learning to mine hard samples and hence mitigate the influence of data sampling bias in contrastive learning. Our CITL can be easily integrated with VLN backbones to form a new learning paradigm and achieve better generalizability in unseen environments. Extensive experiments show that the model with CITL surpasses the previous state-of-the-art methods on R2R, R4R, and RxR.

【5】 Learning to Select the Next Reasonable Mention for Entity Linking 标题:学会选择实体链接的下一个合理提法 链接:https://arxiv.org/abs/2112.04104

作者:Jian Sun,Yu Zhou,Chengqing Zong 备注:Accepted to AAAI-2022 Workshop on Knowledge Discovery from Unstructured Data in Financial Services 摘要:实体链接旨在建立文档中提及的实体与知识图(KG)中相应实体之间的链接。以前的工作表明,全球一致性对于实体联系是有效的。然而,现有的基于顺序决策的全局链接方法大多集中于如何利用先前链接的实体来增强后续决策。在这些方法中,提及的顺序是固定的,使得模型无法根据先前链接的结果调整后续链接目标,这将导致先前的信息被不合理地利用。为了解决这个问题,我们提出了一种称为DyMen的新模型,通过强化学习,基于先前链接的实体动态调整后续链接目标,使该模型能够选择一个能够充分利用先前链接信息的链接目标。通过滑动窗口对提及进行采样,减少强化学习的动作采样空间,保持提及的语义一致性。在多个基准数据集上进行的实验表明了该模型的有效性。 摘要:Entity linking aims to establish a link between entity mentions in a document and the corresponding entities in knowledge graphs (KGs). Previous work has shown the effectiveness of global coherence for entity linking. However, most of the existing global linking methods based on sequential decisions focus on how to utilize previously linked entities to enhance the later decisions. In those methods, the order of mention is fixed, making the model unable to adjust the subsequent linking targets according to the previously linked results, which will cause the previous information to be unreasonably utilized. To address the problem, we propose a novel model, called DyMen, to dynamically adjust the subsequent linking target based on the previously linked entities via reinforcement learning, enabling the model to select a link target that can fully use previously linked information. We sample mention by sliding window to reduce the action sampling space of reinforcement learning and maintain the semantic coherence of mention. Experiments conducted on several benchmark datasets have shown the effectiveness of the proposed model.

其他(3篇)

【1】 MLP Architectures for Vision-and-Language Modeling: An Empirical Study 标题:用于视觉和语言建模的MLP体系结构:一项实证研究 链接:https://arxiv.org/abs/2112.04453

作者:Yixin Nie,Linjie Li,Zhe Gan,Shuohang Wang,Chenguang Zhu,Michael Zeng,Zicheng Liu,Mohit Bansal,Lijuan Wang 备注:15 pages 摘要:我们启动了第一个关于使用MLP架构进行视觉和语言(VL)融合的实证研究。通过对5个VL任务和5个稳健的VQA基准的大量实验,我们发现:(i)在没有预训练的情况下,使用MLP进行多模融合与Transformer相比有明显的性能差距;(ii)然而,VL预训练有助于缩小绩效差距;(iii)在MLP上增加微小的单头注意,而不是沉重的多头注意,足以实现与Transformer相当的性能。此外,我们还发现,当在更硬的鲁棒VQA基准上进行评估时,MLP和Transformer之间的性能差距并未扩大,这表明将MLP用于VL融合可以大致推广到与使用Transformer类似的程度。这些结果提示,MLP可以有效地学习对齐从低级编码器中提取的视觉和文本特征,而无需严重依赖自我注意。基于此,我们提出了一个更大胆的问题:我们是否可以有一个用于VL建模的全MLP架构,其中VL融合和视觉编码器都被MLP取代?我们的结果表明,当两个模型都经过预训练时,与最先进的全功能VL模型相比,全MLP VL模型是次优的。然而,与未经预训练的全功能Transformer型号相比,全MLP的预训练平均得分更高。这表明了大规模预训练MLP类体系结构用于VL建模的潜力,并启发了未来的研究方向,即以较少的归纳设计偏差简化已建立的VL建模。我们的代码可在以下网站公开获取:https://github.com/easonnie/mlp-vil 摘要:We initiate the first empirical study on the use of MLP architectures for vision-and-language (VL) fusion. Through extensive experiments on 5 VL tasks and 5 robust VQA benchmarks, we find that: (i) Without pre-training, using MLPs for multimodal fusion has a noticeable performance gap compared to transformers; (ii) However, VL pre-training can help close the performance gap; (iii) Instead of heavy multi-head attention, adding tiny one-head attention to MLPs is sufficient to achieve comparable performance to transformers. Moreover, we also find that the performance gap between MLPs and transformers is not widened when being evaluated on the harder robust VQA benchmarks, suggesting using MLPs for VL fusion can generalize roughly to a similar degree as using transformers. These results hint that MLPs can effectively learn to align vision and text features extracted from lower-level encoders without heavy reliance on self-attention. Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs? Our result shows that an all-MLP VL model is sub-optimal compared to state-of-the-art full-featured VL models when both of them get pre-trained. However, pre-training an all-MLP can surprisingly achieve a better average score than full-featured transformer models without pre-training. This indicates the potential of large-scale pre-training of MLP-like architectures for VL modeling and inspires the future research direction on simplifying well-established VL modeling with less inductive design bias. Our code is publicly available at: https://github.com/easonnie/mlp-vil

【2】 Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand 标题:二维排行榜:携手生成和评估语言 链接:https://arxiv.org/abs/2112.04139

作者:Jungo Kasai,Keisuke Sakaguchi,Ronan Le Bras,Lavinia Dunagan,Jacob Morrison,Alexander R. Fabbri,Yejin Choi,Noah A. Smith 备注:Project website: this https URL

【3】 Emotion-Cause Pair Extraction in Customer Reviews 标题:客户评论中情感原因对的提取 链接:https://arxiv.org/abs/2112.03984

作者:Arpit Mittal,Jeel Tejaskumar Vaishnav,Aishwarya Kaliki,Nathan Johns,Wyatt Pease 备注:7 Pages, 8 Figures 摘要:情感原因对提取(ECPE)是自然语言处理中一个复杂而热门的领域,因为它在各个领域都有着重要的意义和潜在的应用。在本报告中,我们的目标是介绍我们在在线评论领域的ECPE工作。利用人工标注的数据集,我们探索了一种使用神经网络提取情感-原因对的算法。此外,我们提出了一个模型,该模型使用了以前的参考资料,并将情绪原因对提取与情绪感知单词嵌入领域的研究相结合,将这些嵌入发送到Bi LSTM层,该层为我们提供情绪相关子句。在有限数据集的约束下,我们实现了。我们报告的总体范围包括全面的文献综述、数据集构建和初始模型训练参考方法的实施、通过建议改进管道来修改ECPE中以前的工作,以及特定审查领域的算法开发和实施。 摘要:Emotion-Cause Pair Extraction (ECPE) is a complex yet popular area in Natural Language Processing due to its importance and potential applications in various domains. In this report , we aim to present our work in ECPE in the domain of online reviews. With a manually annotated dataset, we explore an algorithm to extract emotion cause pairs using a neural network. In addition, we propose a model using previous reference materials and combining emotion-cause pair extraction with research in the domain of emotion-aware word embeddings, where we send these embeddings into a Bi-LSTM layer which gives us the emotionally relevant clauses. With the constraint of a limited dataset, we achieved . The overall scope of our report comprises of a comprehensive literature review, implementation of referenced methods for dataset construction and initial model training, and modifying previous work in ECPE by proposing an improvement to the pipeline, as well as algorithm development and implementation for the specific domain of reviews.

机器翻译,仅供参考

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-12-09,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 arXiv每日学术速递 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档