访问www.arxivdaily.com获取含摘要速递,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏、发帖等功能!点击阅读原文即可访问
cs.CL 方向,今日共计47篇
Transformer(2篇)
【1】 Efficient Dialogue State Tracking by Masked Hierarchical Transformer 标题:基于屏蔽分层变换器的高效对话状态跟踪
作者:Min Mao,Jiasheng Liu,Jingyao Zhou,Haipang Wu 机构:Hithink RoyalFlush Information Network Co.,Ltd., Zhejiang, China 备注:6 pages, 3 figures 链接:https://arxiv.org/abs/2106.14433 摘要:本文描述了dstc9的第2轨道:跨语言多域对话状态跟踪的方法,任务目标是建立一个跨语言对话状态跟踪系统,该系统具有丰富资源语言的训练集和低资源语言的测试集。提出了时隙操作分类任务和状态跟踪任务的联合学习方法。此外,我们设计了一种新的掩码机制来融合对话的上下文信息,实验结果表明,该模型在dstcchallengeⅡ上取得了良好的性能,在MultiWOZ(en-zh)数据集和CrossWOZ(zh-en)数据集上的联合准确率分别为62.37%和23.96%。 摘要:This paper describes our approach to DSTC 9 Track 2: Cross-lingual Multi-domain Dialog State Tracking, the task goal is to build a Cross-lingual dialog state tracker with a training set in rich resource language and a testing set in low resource language. We formulate a method for joint learning of slot operation classification task and state tracking task respectively. Furthermore, we design a novel mask mechanism for fusing contextual information about dialogue, the results show the proposed model achieves excellent performance on DSTC Challenge II with a joint accuracy of 62.37% and 23.96% in MultiWOZ(en - zh) dataset and CrossWOZ(zh - en) dataset, respectively.
【2】 SymbolicGPT: A Generative Transformer Model for Symbolic Regression 标题:SymbolicGPT:一种符号回归的产生式变换模型
作者:Mojtaba Valipour,Bowen You,Maysum Panju,Ali Ghodsi 机构:University of Waterloo 备注:11 pages, 4 figures 链接:https://arxiv.org/abs/2106.14131 摘要:符号回归的任务是识别一个数学表达式,该表达式最适合所提供的输入和输出值数据集。由于数学表达式空间的丰富性,符号回归通常是一个具有挑战性的问题。传统的基于遗传进化算法的方法已经使用了几十年,而基于深度学习的方法是一个相对较新的和活跃的研究领域。在这项工作中,我们提出了一个新的基于Transformer的符号回归语言模型SymbolicGPT。该模型利用了概率语言模型(如GPT)的优点,包括性能上的优势和灵活性。通过综合实验,我们发现我们的模型在精确度、运行时间和数据效率方面都比竞争模型有很强的表现。 摘要:Symbolic regression is the task of identifying a mathematical expression that best fits a provided dataset of input and output values. Due to the richness of the space of mathematical expressions, symbolic regression is generally a challenging problem. While conventional approaches based on genetic evolution algorithms have been used for decades, deep learning-based methods are relatively new and an active research area. In this work, we present SymbolicGPT, a novel transformer-based language model for symbolic regression. This model exploits the advantages of probabilistic language models like GPT, including strength in performance and flexibility. Through comprehensive experiments, we show that our model performs strongly compared to competing models with respect to the accuracy, running time, and data efficiency.
BERT(3篇)
【1】 AMU-EURANOVA at CASE 2021 Task 1: Assessing the stability of multilingual BERT 标题:2021年CASE的AMU-EURANOVA任务1:评估多语种BERT的稳定性
作者:Léo Bouscarrat,Antoine Bonnefoy,Cécile Capponi,Carlos Ramisch 机构:EURA NOVA, Marseille, France, Aix Marseille Univ, Universit´e de Toulon, CNRS, LIS, Marseille, France 备注:None 链接:https://arxiv.org/abs/2106.14625 摘要:本文解释了我们参与案例2021共享任务的任务1。此任务是关于从新闻中提取多语言事件。我们重点研究了子任务4,事件信息提取。这个子任务有一个小的训练数据集,我们微调了一个多语言的BERT来解决这个子任务。我们研究了数据集的不稳定性问题,并试图减轻它。 摘要:This paper explains our participation in task 1 of the CASE 2021 shared task. This task is about multilingual event extraction from news. We focused on sub-task 4, event information extraction. This sub-task has a small training dataset and we fine-tuned a multilingual BERT to solve this sub-task. We studied the instability problem on the dataset and tried to mitigate it.
【2】 A Closer Look at How Fine-tuning Changes BERT 标题:仔细观察微调如何改变BERT
作者:Yichu Zhou,Vivek Srikumar 机构:University of Utah 链接:https://arxiv.org/abs/2106.14282 摘要:鉴于在今天的自然语言处理中,预先训练的语境化表达非常普遍,人们已经做出了一些努力来理解这种表达所包含的信息。使用这种表示法的一个常见策略是为最终任务对它们进行微调。然而,对任务的微调如何改变底层空间的研究较少。在这项工作中,我们研究了英国BERT家族,并使用两种探测技术来分析微调是如何改变空间的。我们的实验表明,微调可以提高性能,因为它会将与标签相关的点从其他标签上推离。通过比较微调前后的表征,我们还发现微调并不会随意改变表征;相反,它调整表示以适应下游任务,同时保留原始结构。最后,利用精心构建的实验,我们证明了微调可以在表示中编码训练集,这表明了一种新的过度拟合问题。 摘要:Given the prevalence of pre-trained contextualized representations in today's NLP, there have been several efforts to understand what information such representations contain. A common strategy to use such representations is to fine-tune them for an end task. However, how fine-tuning for a task changes the underlying space is less studied. In this work, we study the English BERT family and use two probing techniques to analyze how fine-tuning changes the space. Our experiments reveal that fine-tuning improves performance because it pushes points associated with a label away from other labels. By comparing the representations before and after fine-tuning, we also discover that fine-tuning does not change the representations arbitrarily; instead, it adjusts the representations to downstream tasks while preserving the original structure. Finally, using carefully constructed experiments, we show that fine-tuning can encode training sets in a representation, suggesting an overfitting problem of a new kind.
【3】 Benchmarking Differential Privacy and Federated Learning for BERT Models 标题:BERT模型的基准差分隐私和联合学习
作者:Priyam Basu,Tiasa Singha Roy,Rakshit Naidu,Zumrut Muftuoglu,Sahib Singh,Fatemehsadat Mireshghallah 机构: or in personal journals and well-being appli-Equal contribution 1Manipal Institute of Technology 2CarnegieMellon University 3OpenMined 4Yildiz Technical University 5FordMotor Company 6University of California 备注:4 pages, 3 tables, 1 figure 链接:https://arxiv.org/abs/2106.13973 摘要:自然语言处理(NLP)技术可以用来帮助诊断疾病,如抑郁症,使用一个人的话语集合。抑郁症是一种严重的医学疾病,会对人的感觉、思维和行为产生不良影响,从而导致情绪和身体问题。由于此类数据的敏感性,需要采取隐私措施来处理和训练具有此类数据的模型。在这项工作中,我们研究了差异隐私(DP)的应用,在集中和联合学习(FL)设置中,对训练语境化语言模型(BERT,ALBERT,RoBERTa和DistilBERT)的影响。我们提供关于如何私下训练NLP模型以及什么架构和设置提供了更理想的隐私实用程序权衡的见解。我们设想这项工作将用于未来的医疗保健和心理健康研究,以保持医疗历史的隐私。因此,我们提供了这项工作的开源实现。 摘要:Natural Language Processing (NLP) techniques can be applied to help with the diagnosis of medical conditions such as depression, using a collection of a person's utterances. Depression is a serious medical illness that can have adverse effects on how one feels, thinks, and acts, which can lead to emotional and physical problems. Due to the sensitive nature of such data, privacy measures need to be taken for handling and training models with such data. In this work, we study the effects that the application of Differential Privacy (DP) has, in both a centralized and a Federated Learning (FL) setup, on training contextualized language models (BERT, ALBERT, RoBERTa and DistilBERT). We offer insights on how to privately train NLP models and what architectures and setups provide more desirable privacy utility trade-offs. We envisage this work to be used in future healthcare and mental health studies to keep medical history private. Therefore, we provide an open-source implementation of this work.
QA|VQA|问答|对话(2篇)
【1】 Overview of BioASQ 2020: The eighth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering 标题:BioASQ 2020综述:第八届BioASQ关于大规模生物医学语义索引和问题回答的挑战
作者:Anastasios Nentidis,Anastasia Krithara,Konstantinos Bougiatiotis,Martin Krallinger,Carlos Rodriguez-Penagos,Marta Villegas,Georgios Paliouras 机构:Paliouras, National Center for Scientific Research “Demokritos”, Athens, Greece, Aristotle University of Thessaloniki, Thessaloniki, Greece, National and Kapodistrian University of Athens, Athens, Greece, Barcelona Supercomputing Center, Barcelona, Spain 备注:None 链接:https://arxiv.org/abs/2106.14618 摘要:在本文中,我们对BioASQ挑战的第八版进行了概述,该挑战作为评估论坛(CLEF)2020会议和实验室的一个实验室运行。BioASQ是一系列挑战,旨在促进大规模生物医学语义索引和问答的系统和方法。为此,自2012年起,每年组织一次共享任务,不同的团队开发系统,在相同要求的基准数据集上竞争,这些数据集代表生物医学领域专家的真实信息需求。今年,随着西班牙语医学语义索引新任务的推出,这一挑战得到了扩展。共有34支系统超过100个的队伍参加了挑战赛的三项任务。与前几年一样,评估结果显示,表现最好的系统成功地超越了强大的基线,这表明最先进的系统通过不断改进不断推动研究前沿。 摘要:In this paper, we present an overview of the eighth edition of the BioASQ challenge, which ran as a lab in the Conference and Labs of the Evaluation Forum (CLEF) 2020. BioASQ is a series of challenges aiming at the promotion of systems and methodologies for large-scale biomedical semantic indexing and question answering. To this end, shared tasks are organized yearly since 2012, where different teams develop systems that compete on the same demanding benchmark datasets that represent the real information needs of experts in the biomedical domain. This year, the challenge has been extended with the introduction of a new task on medical semantic indexing in Spanish. In total, 34 teams with more than 100 systems participated in the three tasks of the challenge. As in previous years, the results of the evaluation reveal that the top-performing systems managed to outperform the strong baselines, which suggests that state-of-the-art systems keep pushing the frontier of research through continuous improvements.
【2】 PeCoQ: A Dataset for Persian Complex Question Answering over Knowledge Graph 标题:PeCoQ:基于知识图的波斯复杂问答数据集
作者:Romina Etezadi,Mehrnoush Shamsfard 机构:ADatasetforPersianComplexQuestionAnsweringoverKnowledgeGraphRominaEtezadiNaturalLanguageProcessingLabShahidBeheshtiUniversityTehran, irMehrnoushShamsfardNaturalLanguageProcessingLabShahidBeheshtiUniversityTehran 备注:5 pages, 4 figures 链接:https://arxiv.org/abs/2106.14167 摘要:问答系统可以从非结构化文本或结构化数据(如知识图)中找到用户问题的答案。使用监督学习方法(包括深度学习模型)回答问题需要大量的训练数据集。近年来,一些数据集被提出用于知识图上的问答任务,这是本文的重点。虽然有许多英语数据集被提出,但也有一些波斯语问答数据集。本文介绍了波斯语问答数据集\textit{PeCoQ}。这个数据集包含10000个复杂的问题和答案,这些问题和答案是从波斯知识图FarsBase中提取出来的。对于每个问题,还提供了SPARQL查询和两个由语言学家编写的释义。数据集中有不同类型的复杂性,如多关系、多实体、顺序和时间约束。在本文中,我们讨论了数据集的特点,并描述了我们构建数据集的方法。 摘要:Question answering systems may find the answers to users' questions from either unstructured texts or structured data such as knowledge graphs. Answering questions using supervised learning approaches including deep learning models need large training datasets. In recent years, some datasets have been presented for the task of Question answering over knowledge graphs, which is the focus of this paper. Although many datasets in English were proposed, there have been a few question-answering datasets in Persian. This paper introduces \textit{PeCoQ}, a dataset for Persian question answering. This dataset contains 10,000 complex questions and answers extracted from the Persian knowledge graph, FarsBase. For each question, the SPARQL query and two paraphrases that were written by linguists are provided as well. There are different types of complexities in the dataset, such as multi-relation, multi-entity, ordinal, and temporal constraints. In this paper, we discuss the dataset's characteristics and describe our methodology for building it.
机器翻译(1篇)
【1】 KGRefiner: Knowledge Graph Refinement for Improving Accuracy of Translational Link Prediction Methods 标题:KGRefiner:提高翻译链接预测精度的知识图求精方法
作者:Mohammad Javad Saeedizade,Najmeh Torabian,Behrouz Minaei-Bidgoli 机构: Iran University of Science and Technology 备注:6 pages, 1 figure 链接:https://arxiv.org/abs/2106.14233 摘要:链接预测是根据知识图中包含的事实推断出知识图中实体之间的缺失关系。最近在链路预测方面的工作试图通过在神经网络结构中使用更多的层或增加模型的计算复杂度的方法来提供用于提高链路预测精度的模型。本文提出了一种细化知识图的方法,使得知识图的信息量更大,并且使用相对快速的转换模型可以更准确地执行链接预测操作。翻译链接预测模型,如TransE、TransH、TransD等,比深度学习方法复杂得多。该方法利用知识图中的关系层次结构和实体层次结构,将实体信息作为一个新的实体添加到知识图中,并将其连接到层次结构中包含该信息的节点。实验结果表明,该方法能显著提高移动链接预测方法的性能H@10先生,先生。 摘要:Link prediction is the task of predicting missing relations between entities of the knowledge graph by inferring from the facts contained in it. Recent work in link prediction has attempted to provide a model for increasing link prediction accuracy by using more layers in neural network architecture or methods that add to the computational complexity of models. This paper we proposed a method for refining the knowledge graph, which makes the knowledge graph more informative, and link prediction operations can be performed more accurately using relatively fast translational models. Translational link prediction models, such as TransE, TransH, TransD, etc., have much less complexity than deep learning approaches. This method uses the hierarchy of relationships and also the hierarchy of entities in the knowledge graph to add the entity information as a new entity to the graph and connect it to the nodes which contain this information in their hierarchy. Our experiments show that our method can significantly increase the performance of translational link prediction methods in H@10, MR, MRR.
语义分析(1篇)
【1】 Semantic Parsing Natural Language into Relational Algebra 标题:自然语言到关系代数的语义解析
作者:Ruiyang Xu,Ayush Singh 机构:CS, NLP Fall , Northeastern University 备注:Semester project report for NLP course 链接:https://arxiv.org/abs/2106.13858 摘要:自然数据库接口(NLIDB)是近几十年来研究的热点。NLIDB的核心是一个用于将自然语言转换为SQL的语义解析器。传统NLP方法的解决方案侧重于语法规则模式的学习和通过中间逻辑形式的配对。尽管这些方法在某些特定的数据库和解析任务中提供了可接受的性能,但是它们很难推广和扩展。另一方面,神经深度学习的最新进展似乎为建立一个通用的NLIDB系统提供了一个有希望的方向。与传统方法不同的是,这些神经方法将句法分析问题看作是一个序列对序列的学习问题。本文对几种序列间学习模型进行了实验,并对其在一般数据库解析任务中的性能进行了评价。 摘要:Natural interface to database (NLIDB) has been researched a lot during the past decades. In the core of NLIDB, is a semantic parser used to convert natural language into SQL. Solutions from traditional NLP methodology focuses on grammar rule pattern learning and pairing via intermediate logic forms. Although those methods give an acceptable performance on certain specific database and parsing tasks, they are hard to generalize and scale. On the other hand, recent progress in neural deep learning seems to provide a promising direction towards building a general NLIDB system. Unlike the traditional approach, those neural methodologies treat the parsing problem as a sequence-to-sequence learning problem. In this paper, we experimented on several sequence-to-sequence learning models and evaluate their performance on general database parsing task.
Graph|知识图谱|Knowledge(2篇)
【1】 An evaluation of template and ML-based generation of user-readable text from a knowledge graph 标题:从知识图生成用户可读文本的模板和基于ML的评估
作者:Zola Mahlaza,C. Maria Keet,Jarryd Dunn,Matthew Poulter 机构:University of Cape Town, University Avenue, Cape Town, South Africa 备注:15 pages, 6 figures 链接:https://arxiv.org/abs/2106.14613 摘要:典型的用户友好的知识图呈现是可视化和自然语言文本。在后一种HCI解决方案中,数据驱动的自然语言生成系统受到越来越多的关注,但由于受到内容丢失、幻觉或重复等错误的影响,它们的性能往往优于基于模板的系统。目前还不清楚这些错误中的哪些与文本所针对的人的低质量判断显著相关,这妨碍了根据错误对改进人类评估的影响来解决错误。我们通过一个实验评估了它们之间的可能联系,该实验利用了专家和众包的方法来评估人类编写的文本、模板生成的文本和序列到序列模型生成的文本。结果表明,有错误的文本与人类对自然性和质量的低判断之间没有显著的关联。在机器学习生成的文本中,有丢失或产生幻觉的插槽与人类对自然性和质量的低判断之间也没有显著的联系。因此,这两种方法似乎都是为知识图设计自然语言接口的可行选择。 摘要:Typical user-friendly renderings of knowledge graphs are visualisations and natural language text. Within the latter HCI solution approach, data-driven natural language generation systems receive increased attention, but they are often outperformed by template-based systems due to suffering from errors such as content dropping, hallucination, or repetition. It is unknown which of those errors are associated significantly with low quality judgements by humans who the text is aimed for, which hampers addressing errors based on their impact on improving human evaluations. We assessed their possible association with an experiment availing of expert and crowdsourced evaluations of human authored text, template generated text, and sequence-to-sequence model generated text. The results showed that there was no significant association between human authored texts with errors and the low human judgements of naturalness and quality. There was also no significant association between machine learning generated texts with dropped or hallucinated slots and the low human judgements of naturalness and quality. Thus, both approaches appear to be viable options for designing a natural language interface for knowledge graphs.
【2】 A Knowledge-Grounded Dialog System Based on Pre-Trained Language Models 标题:基于预训练语言模型的基于知识的对话系统
作者:Weijie Zhang,Jiaoxuan Chen,Haipang Wu,Sanhui Wan,Gongfeng Li 机构:Hithink RoyalFlush Information Network Co.,Ltd., Zhejiang, China 备注:7 pages, 1 figures 链接:https://arxiv.org/abs/2106.14444 摘要:我们提出了一个知识为基础的对话系统开发的第九对话系统技术挑战(DSTC9)轨道1-超越领域API:面向任务的会话建模与非结构化知识访问。我们利用迁移学习和现有的语言模型来完成这个挑战轨道上的任务。具体来说,我们将任务分为四个子任务,并对每个子任务上的几个Transformer模型进行了微调。我们还做了一些改进,在性能和效率上都有所提高,包括将模型与传统的实体匹配技术相结合,以及在语言模型的输出层添加指针网络。 摘要:We present a knowledge-grounded dialog system developed for the ninth Dialog System Technology Challenge (DSTC9) Track 1 - Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access. We leverage transfer learning with existing language models to accomplish the tasks in this challenge track. Specifically, we divided the task into four sub-tasks and fine-tuned several Transformer models on each of the sub-tasks. We made additional changes that yielded gains in both performance and efficiency, including the combination of the model with traditional entity-matching techniques, and the addition of a pointer network to the output layer of the language model.
摘要|信息提取(3篇)
【1】 Key Information Extraction From Documents: Evaluation And Generator 标题:文档关键信息抽取:评价与生成器
作者:Oliver Bensch,Mirela Popa,Constantin Spille 机构: Maastricht University, MD Maastricht, The Netherlands, KI Group GmbH 备注:7 pages, 1 figure, accepted at the 2nd International Deep Learning meets Ontologies and Natural Language Processing workshop at ESWC 2021, Hersonissos, Greece 链接:https://arxiv.org/abs/2106.14624 摘要:从文档中提取信息通常依赖于处理一维文本序列的自然语言处理方法。在某些情况下,例如,对于从半结构化文档(如发票文档)中提取关键信息,文本的空间和格式信息对于理解上下文意义至关重要。卷积神经网络已经普遍应用于计算机视觉模型中,用于处理和提取多维数据中的关系。因此,自然语言处理模型在过去已经与计算机视觉模型相结合,以受益于例如位置信息和改进这些关键信息提取模型的性能。现有的模型要么是在未发布的数据集上训练,要么是在收据的注释集合上训练,而不是集中在类似PDF的文档上。因此,在本研究计画中,我们建立了一个模版式文件产生器来比较资讯撷取的最新模式。重建了现有的信息提取模型“Chargrid”(Katti等人,2019),并评估了边界盒回归解码器以及NLP预处理步骤对文档信息提取的影响。结果表明,基于NLP的预处理有利于提高模型的性能。但是,使用边界框回归解码器仅对不遵循矩形形状的字段提高模型性能。 摘要:Extracting information from documents usually relies on natural language processing methods working on one-dimensional sequences of text. In some cases, for example, for the extraction of key information from semi-structured documents, such as invoice-documents, spatial and formatting information of text are crucial to understand the contextual meaning. Convolutional neural networks are already common in computer vision models to process and extract relationships in multidimensional data. Therefore, natural language processing models have already been combined with computer vision models in the past, to benefit from e.g. positional information and to improve performance of these key information extraction models. Existing models were either trained on unpublished data sets or on an annotated collection of receipts, which did not focus on PDF-like documents. Hence, in this research project a template-based document generator was created to compare state-of-the-art models for information extraction. An existing information extraction model "Chargrid" (Katti et al., 2019) was reconstructed and the impact of a bounding box regression decoder, as well as the impact of an NLP pre-processing step was evaluated for information extraction from documents. The results have shown that NLP based pre-processing is beneficial for model performance. However, the use of a bounding box regression decoder increases the model performance only for fields that do not follow a rectangular shape.
【2】 A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy 标题:一种基于中心度加权相关性和自引用冗余度的免训练、免参考摘要评价指标
作者:Wang Chen,Piji Li,Irwin King 机构:Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, Tencent AI Lab 备注:ACL 2021 long paper 链接:https://arxiv.org/abs/2106.13945 摘要:近年来,基于参考文献和有监督的摘要评价指标得到了广泛的研究。然而,收集人类注释的参考文献和评级是昂贵和耗时的。为了避免这些局限性,我们提出了一种无训练、无参考的摘要评价指标。我们的指标包括一个中心性加权关联度得分和一个自参考冗余度得分。在源文档和给定摘要之间计算伪引用的相关性得分,其中伪引用内容由句子中心度加权以提供重要性指导。除了一个基于$F\u 1$的相关性分数外,我们还设计了一个基于$F\beta$的变量,该变量更加关注回忆分数。对于摘要的冗余度,我们通过计算摘要本身的自掩蔽相似度来评价摘要中的冗余信息。最后,我们结合相关度和冗余度得分来产生给定总结的最终评价得分。大量的实验表明,该方法在多文档和单文档摘要评价方面均优于现有方法。 摘要:In recent years, reference-based and supervised summarization evaluation metrics have been widely explored. However, collecting human-annotated references and ratings are costly and time-consuming. To avoid these limitations, we propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. The relevance score is computed between the pseudo reference built from the source document and the given summary, where the pseudo reference content is weighted by the sentence centrality to provide importance guidance. Besides an $F_1$-based relevance score, we also design an $F_\beta$-based variant that pays more attention to the recall score. As for the redundancy score of the summary, we compute a self-masked similarity score with the summary itself to evaluate the redundant information in the summary. Finally, we combine the relevance and redundancy scores to produce the final evaluation score of the given summary. Extensive experiments show that our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
【3】 XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages 标题:XL-SUM:44种语言的大规模多语言摘要
作者:Tahmid Hasan,Abhik Bhattacharjee,Md Saiful Islam,Kazi Samin,Yuan-Fang Li,Yong-Bin Kang,M. Sohel Rahman,Rifat Shahriyar 机构:Bangladesh University of Engineering and Technology (BUET), University of Rochester, Monash University, Swinburne University of Technology 备注:Findings of the Association for Computational Linguistics, ACL 2021 (camera-ready) 链接:https://arxiv.org/abs/2106.13822 摘要:现代的抽象文本摘要研究主要集中在英语等高资源语言上,这主要是由于中低资源语言的数据集有限。在这项工作中,我们提出了xlsum,一个全面和多样化的数据集,包括100万专业注释文章摘要对从BBC,提取使用一套精心设计的启发式。数据集涵盖44种语言,从低到高资源,其中许多语言目前没有公共数据集。xlsum具有高度的抽象性、简洁性和高质量,这一点可以从人性化和内在的评价中得到体现。我们使用xlsum对mT5进行了微调,并对多语言和低资源摘要任务进行了实验。与使用类似的单语数据集得到的结果相比,XL-Sum得出了具有竞争力的结果:我们在10种语言上显示了高于11的ROUGE-2分数,其中一些超过了15分,这是通过多语言训练获得的。此外,单独进行低资源语言训练也能提供有竞争力的表现。就我们所知,xlsum是最大的抽象摘要数据集,从单个来源收集的样本数量和所涵盖的语言数量来看。我们正在发布我们的数据集和模型,以鼓励未来对多语言抽象摘要的研究。可以在\url中找到资源{https://github.com/csebuetnlp/xl-sum}. 摘要:Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/xl-sum}.
推理|分析|理解|解释(3篇)
【1】 An Adversarial Learning based Multi-Step Spoken Language Understanding System through Human-Computer Interaction 标题:一种基于对抗性学习的人机交互多步骤口语理解系统
作者:Yu Wang,Yilin Shen,Hongxia Jin 机构:AI Center, Samsung Research America, Mountain View, CA, USA 备注:5 Pages, original work published at ICASSP 2021 链接:https://arxiv.org/abs/2106.14611 摘要:现有的大多数口语理解系统只能基于单轮用户查询进行语义框架解析。他们不能通过与用户的多轮交互来获取用户的反馈来更新/添加/删除槽值。本文介绍了一种新的基于对抗式学习的多步口语理解系统,该系统可以利用多轮用户反馈信息来更新时隙值。我们在基准ATIS数据集上进行了两个实验,结果表明,新系统只需一轮反馈,就F1而言,至少可以提高2.5\%$的解析性能。当反馈轮的数量增加时,改进会变得更大。此外,我们还将新系统与最新的对话状态跟踪系统进行了比较,结果表明,新的交互系统在多轮口语理解任务中,在时隙和句子水平上的准确率都有更好的表现。 摘要:Most of the existing spoken language understanding systems can perform only semantic frame parsing based on a single-round user query. They cannot take users' feedback to update/add/remove slot values through multiround interactions with users. In this paper, we introduce a novel multi-step spoken language understanding system based on adversarial learning that can leverage the multiround user's feedback to update slot values. We perform two experiments on the benchmark ATIS dataset and demonstrate that the new system can improve parsing performance by at least $2.5\%$ in terms of F1, with only one round of feedback. The improvement becomes even larger when the number of feedback rounds increases. Furthermore, we also compare the new system with state-of-the-art dialogue state tracking systems and demonstrate that the new interactive system can perform better on multiround spoken language understanding tasks in terms of slot- and sentence-level accuracy.
【2】 A Case Study of LLVM-Based Analysis for Optimizing SIMD Code Generation 标题:基于LLVM的SIMD代码生成优化实例研究
作者:Joseph Huber,Weile Wei,Giorgis Georgakoudis,Johannes Doerfert,Oscar Hernandez 机构:Oak Ridge National Laboratory, Louisiana State University, Lawrence Livermore National Laboratory, Argonne National Laboratory 链接:https://arxiv.org/abs/2106.14332 摘要:本文介绍了一种使用基于LLVM的工具来调优DCA++(dynamicclusterapproximation)应用程序的方法,该应用程序针对新的arma64fx处理器。其目标是描述新体系结构所需的更改,并生成针对新的可伸缩向量扩展指令集的高效单指令/多数据(SIMD)指令。在手动调优期间,作者使用LLVM工具通过使用OpenMP SIMD改进代码并行化,重构代码并应用转换以实现SIMD优化,并确保使用正确的库来实现最佳性能。通过应用这些代码更改,编码速度提高了1.98倍,在A64FX处理器上实现了78 GFlops。作者的目标是自动化openmpadvisor工具中的部分工作,该工具构建在现有和新引入的LLVM工具之上。 摘要:This paper presents a methodology for using LLVM-based tools to tune the DCA++ (dynamical clusterapproximation) application that targets the new ARM A64FX processor. The goal is to describethe changes required for the new architecture and generate efficient single instruction/multiple data(SIMD) instructions that target the new Scalable Vector Extension instruction set. During manualtuning, the authors used the LLVM tools to improve code parallelization by using OpenMP SIMD,refactored the code and applied transformation that enabled SIMD optimizations, and ensured thatthe correct libraries were used to achieve optimal performance. By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor. The authorsaim to automatize parts of the efforts in the OpenMP Advisor tool, which is built on top of existingand newly introduced LLVM tooling.
【3】 Rationale-Inspired Natural Language Explanations with Commonsense 标题:理性启发的常识自然语言解释
作者:Bodhisattwa Prasad Majumder,Oana-Maria Camburu,Thomas Lukasiewicz,Julian McAuley 机构:Department of Computer Science and Engineering, UC San Diego, USA, Department of Computer Science, University of Oxford, UK, Alan Turing Institute, London, UK 链接:https://arxiv.org/abs/2106.13876 摘要:可解释的机器学习模型主要使用提取原理(即输入特征的子集)或自由文本自然语言解释(NLEs)作为抽象证明来证明预测标签的正确性。虽然NLE比提取理论更全面,但机器生成的NLE有时缺乏常识性知识。在这里,我们表明,常识知识可以作为一个桥梁之间的提取原理和自然语言,使这两种类型的解释更好。更准确地说,我们引入了一个统一的框架,称为RExC(理性启发的常识解释),它(1)将原理提取为一组负责机器预测的特征,(2)使用可用的常识资源扩展提取原理,利用扩展知识生成自然语言解释。我们的框架在自然语言处理和视觉语言理解的五个任务中生成NLE,大大超过了以前的最新水平,人类注释者一致认为RExC生成的解释更全面,基于常识,与以前的先进车型相比,总体上更受欢迎。此外,我们的工作表明,常识性的基础解释可以提高任务绩效和基本原理提取能力。 摘要:Explainable machine learning models primarily justify predicted labels using either extractive rationales (i.e., subsets of input features) or free-text natural language explanations (NLEs) as abstractive justifications. While NLEs can be more comprehensive than extractive rationales, machine-generated NLEs have been shown to sometimes lack commonsense knowledge. Here, we show that commonsense knowledge can act as a bridge between extractive rationales and NLEs, rendering both types of explanations better. More precisely, we introduce a unified framework, called RExC (Rationale-Inspired Explanations with Commonsense), that (1) extracts rationales as a set of features responsible for machine predictions, (2) expands the extractive rationales using available commonsense resources, and (3) uses the expanded knowledge to generate natural language explanations. Our framework surpasses by a large margin the previous state-of-the-art in generating NLEs across five tasks in both natural language processing and vision-language understanding, with human annotators consistently rating the explanations generated by RExC to be more comprehensive, grounded in commonsense, and overall preferred compared to previous state-of-the-art models. Moreover, our work shows that commonsense-grounded explanations can enhance both task performance and rationales extraction capabilities.
GAN|对抗|攻击|生成相关(2篇)
【1】 Keyphrase Generation for Scientific Document Retrieval 标题:面向科技文献检索的关键词生成
作者:Florian Boudin,Ygor Gallina,Akiko Aizawa 机构:LS,N, Universit´e de Nantes, France, National Institute of Informatics, Tokyo, Japan 备注:Accepted at ACL 2020 链接:https://arxiv.org/abs/2106.14726 摘要:序列到序列模型在关键词生成方面取得了显著的进展,但它们是否可靠,是否有利于文档检索,目前尚不清楚。这项研究提供了经验证据,证明这样的模型可以显著提高检索性能,并引入了一个新的外部评价框架,允许更好地理解关键短语生成模型的局限性。使用这个框架,我们指出并讨论了用关键字短语(不在文本中)补充文档以及跨域概括模型时遇到的困难。我们的代码在https://github.com/boudinfl/ir-using-kg 摘要:Sequence-to-sequence models have lead to significant progress in keyphrase generation, but it remains unknown whether they are reliable enough to be beneficial for document retrieval. This study provides empirical evidence that such models can significantly improve retrieval performance, and introduces a new extrinsic evaluation framework that allows for a better understanding of the limitations of keyphrase generation models. Using this framework, we point out and discuss the difficulties encountered with supplementing documents with -- not present in text -- keyphrases, and generalizing models across domains. Our code is available at https://github.com/boudinfl/ir-using-kg
【2】 Progressive Open-Domain Response Generation with Multiple Controllable Attributes 标题:具有多个可控属性的渐进式开放域响应生成
作者:Haiqin Yang,Xiaoyuan Yao,Yiqun Duan,Jianping Shen,Jie Zhong,Kun Zhang 机构:Ping An Life Insurance Company of China, Carnegie Mellon University 备注:7 pages, 2 figures, 3 tables, in IJCAI'21 链接:https://arxiv.org/abs/2106.14614 摘要:在开放域对话系统中,为了增强生成响应的多样性,需要包含更多的可控属性。然而,现有的方法只能生成一个可控属性的响应,或者缺乏一种灵活的方法来生成多个可控属性的响应。在本文中,我们提出了一个逐步训练的分层编码器-解码器(PHED)来解决这个问题。更具体地说,PHED在Transformer上部署了条件变分自动编码器(CVAE),以便在一个阶段包含属性的一个方面。CVAE的一个重要特点是将每个阶段的潜在变量分为两类:一类是捕获共同语义特征的全局变量,另一类是吸收该阶段属性信息的特定变量。然后,PHED将CVAE潜变量与Transformer编码器耦合,并通过最小化新导出的ELBO和受控损耗来训练,以产生下一阶段的输入并根据需要产生响应。最后,我们进行了广泛的评估,以表明PHED显著优于最先进的神经生成模型,并产生更多样化的反应。 摘要:It is desirable to include more controllable attributes to enhance the diversity of generated responses in open-domain dialogue systems. However, existing methods can generate responses with only one controllable attribute or lack a flexible way to generate them with multiple controllable attributes. In this paper, we propose a Progressively trained Hierarchical Encoder-Decoder (PHED) to tackle this task. More specifically, PHED deploys Conditional Variational AutoEncoder (CVAE) on Transformer to include one aspect of attributes at one stage. A vital characteristic of the CVAE is to separate the latent variables at each stage into two types: a global variable capturing the common semantic features and a specific variable absorbing the attribute information at that stage. PHED then couples the CVAE latent variables with the Transformer encoder and is trained by minimizing a newly derived ELBO and controlled losses to produce the next stage's input and produce responses as required. Finally, we conduct extensive evaluations to show that PHED significantly outperforms the state-of-the-art neural generation models and produces more diverse responses as expected.
检测相关(3篇)
【1】 Neural Models for Offensive Language Detection 标题:用于攻击性语言检测的神经模型
作者:Ehab Hamdy 机构:. Pr¨ufer, Prof. Dr. Jelena Mitrovi´c, Prof. Dr. Michael Granitzer 链接:https://arxiv.org/abs/2106.14609 摘要:攻击性语言检测是一种不断发展的自然语言处理应用。这种增长主要是因为社交网络的广泛使用,社交网络成为人们交流、工作和享受娱乐内容的主流渠道。许多分享攻击性和攻击性内容的事件在很大程度上对社会产生了负面影响。我们相信,改进和比较不同的机器学习模型来对抗这些有害内容是本论文的一个重要而富有挑战性的目标。我们针对攻击性语言检测问题,建立有效的攻击性语言检测自动化模型。随着NLP模型的最新进展,特别是Transformer模型,它解决了标准seq-to-seq技术的许多缺点。BERT模型已经在许多NLP任务上显示了最新的结果。虽然文献仍在探讨BERT在自然语言处理领域取得成就的原因。其他有效的变体已经被开发出来,以改进标准的BERT,例如RoBERTa和ALBERT。此外,由于社交媒体上文本的多语种性质可能会影响对给定tween的模型决策,因此有必要研究多语种模型,例如XLM RoBERTa在100种语言上的训练以及它与单语种模型的比较。罗BERT为基础的模式被证明是最有能力的模式,并取得了最高的F1成绩的任务。全面的攻击性语言检测系统的另一个关键方面是模型训练和推理的速度。在这方面,我们考虑了模型运行时,并对名为BlazingText的FastText的非常高效的实现进行了微调,该实现取得了良好的效果,比基于BERT的模型快得多。 摘要:Offensive language detection is an ever-growing natural language processing (NLP) application. This growth is mainly because of the widespread usage of social networks, which becomes a mainstream channel for people to communicate, work, and enjoy entertainment content. Many incidents of sharing aggressive and offensive content negatively impacted society to a great extend. We believe contributing to improving and comparing different machine learning models to fight such harmful contents is an important and challenging goal for this thesis. We targeted the problem of offensive language detection for building efficient automated models for offensive language detection. With the recent advancements of NLP models, specifically, the Transformer model, which tackled many shortcomings of the standard seq-to-seq techniques. The BERT model has shown state-of-the-art results on many NLP tasks. Although the literature still exploring the reasons for the BERT achievements in the NLP field. Other efficient variants have been developed to improve upon the standard BERT, such as RoBERTa and ALBERT. Moreover, due to the multilingual nature of text on social media that could affect the model decision on a given tween, it is becoming essential to examine multilingual models such as XLM-RoBERTa trained on 100 languages and how did it compare to unilingual models. The RoBERTa based model proved to be the most capable model and achieved the highest F1 score for the tasks. Another critical aspect of a well-rounded offensive language detection system is the speed at which a model can be trained and make inferences. In that respect, we have considered the model run-time and fine-tuned the very efficient implementation of FastText called BlazingText that achieved good results, which is much faster than BERT-based models.
【2】 Enhancing the Generalization for Intent Classification and Out-of-Domain Detection in SLU 标题:增强SLU中意图分类和域外检测的泛化能力
作者:Yilin Shen,Yen-Chang Hsu,Avik Ray,Hongxia Jin 机构:Samsung Research America 链接:https://arxiv.org/abs/2106.14464 摘要:意图分类是口语理解中的一项重要任务。由于大多数模型都是用预先收集的域内(IND)训练话语建立的,因此它们检测不受支持的域外(OOD)话语的能力在实际应用中起着至关重要的作用。最近的研究表明,使用额外的数据和标签可以提高OOD的检测性能,但是收集这些数据的成本可能会很高。该文提出了一种只训练IND数据的模型,同时支持IND-intent分类和OOD检测。我们的方法设计了一个新的域正则化模块(DRM)来减少香草分类器的过度自信现象,在这两种情况下都取得了较好的泛化效果。此外,DRM还可以作为基于神经网络的意图分类器中最后一层的替代品,提供了一种低成本的改进策略。对四个数据集的评估表明,我们基于BERT和RoBERTa模型的方法与现有方法和我们为比较创建的强大基线相比,达到了最先进的性能。 摘要:Intent classification is a major task in spoken language understanding (SLU). Since most models are built with pre-collected in-domain (IND) training utterances, their ability to detect unsupported out-of-domain (OOD) utterances has a critical effect in practical use. Recent works have shown that using extra data and labels can improve the OOD detection performance, yet it could be costly to collect such data. This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection. Our method designs a novel domain-regularized module (DRM) to reduce the overconfident phenomenon of a vanilla classifier, achieving a better generalization in both cases. Besides, DRM can be used as a drop-in replacement for the last layer in any neural network-based intent classifier, providing a low-cost strategy for a significant improvement. The evaluation on four datasets shows that our method built on BERT and RoBERTa models achieves state-of-the-art performance against existing approaches and the strong baselines we created for the comparisons.
【3】 Persian Causality Corpus (PerCause) and the Causality Detection Benchmark 标题:波斯因果语料库(PerCause)与因果关系检测基准
作者:Zeinab Rahimi,Mehrnoush ShamsFard 机构:NLP Research Lab., Shahid Beheshti University, Tehran, Iran 备注:20 pages, 6 figures and 10 tables 链接:https://arxiv.org/abs/2106.14165 摘要:文本中因果关系的识别是自然语言处理中具有挑战性的问题之一;具体来说,用波斯语等低资源语言。在这项研究中,我们为波斯语准备了一个因果关系人类注释语料库,包括4446个句子和5128个因果关系,并为每个关系指定了原因、结果和因果标记三个标签(如果可能的话)。我们使用这个语料库来训练一个系统来检测因果要素的边界。在此基础上,我们提出了三种机器学习方法和两个深度学习系统的因果关系检测基准。性能评估结果表明,CRF分类器的F-测度为0.76,CRF深度学习方法的精度为91.4%。 摘要:Recognizing causal elements and causal relations in text is one of the challenging issues in natural language processing; specifically, in low resource languages such as Persian. In this research we prepare a causality human annotated corpus for the Persian language which consists of 4446 sentences and 5128 causal relations and three labels of cause, effect and causal mark -- if possibl -- are specified for each relation. We have used this corpus to train a system for detecting causal elements boundaries. Also, we present a causality detection benchmark for three machine learning methods and two deep learning systems based on this corpus. Performance evaluations indicate that our best total result is obtained through CRF classifier which has F-measure of 0.76 and the best accuracy obtained through Bi-LSTM-CRF deep learning method with Accuracy equal to %91.4.
识别/分类(2篇)
【1】 Classification of Contract-Amendment Relationships 标题:合同变更关系分类
作者:Fuqi Song 机构:Data Science, Hyperlex, Rue de la Grange Batelière, Paris, France 链接:https://arxiv.org/abs/2106.14619 摘要:在合同生命周期管理(CLM)中,管理和跟踪主协议及其相关修订至关重要,以便随时了解不同的到期日和义务。一个自动化的解决方案可以方便日常工作,提高法律从业人员的工作效率。本文提出了一种基于机器学习和自然语言处理的文档修正关系检测方法。该算法以OCR(Optical Character Recognition)和NER(Named Entity Recognition)预处理的两个PDF文档为输入,建立每个文档对的特征,并对它们之间的关系进行分类。我们在一个由1124对英文和法文合同修订文件组成的数据集上进行了不同配置的实验。最好的结果得到了91%的F1分数,与基于启发式的基线相比,该分数超过了23%。 摘要:In Contract Life-cycle Management (CLM), managing and tracking the master agreements and their associated amendments is essential, in order to be kept informed with different due dates and obligations. An automatic solution can facilitate the daily jobs and improve the efficiency of legal practitioners. In this paper, we propose an approach based on machine learning (ML) and Natural Language Processing (NLP) to detect the amendment relationship between two documents. The algorithm takes two PDF documents preprocessed by OCR (Optical Character Recognition) and NER (Named Entity Recognition) as input, and then it builds the features of each document pair and classifies the relationship. We experimented with different configurations on a dataset consisting of 1124 pairs of contract-amendment documents in English and French. The best result obtained a F1-score of 91%, which outperformed 23% compared to a heuristic-based baseline.
【2】 A Span-Based Model for Joint Overlapped and Discontinuous Named Entity Recognition 标题:一种基于跨度的节点重叠不连续命名实体识别模型
作者:Fei Li,Zhichao Lin,Meishan Zhang,Donghong Ji 机构:. Department of Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China, . School of New Media and Communication, Tianjin University, China 备注:Accepted in the main conference of ACL 2021 链接:https://arxiv.org/abs/2106.14373 摘要:重叠和不连续命名实体识别的研究越来越受到人们的关注。以往的工作主要集中在重叠或不连续的实体上。在本文中,我们提出了一个新的基于跨度的模型,可以同时识别重叠实体和不连续实体。该模型包括两个主要步骤。首先,通过遍历所有可能的文本跨度来识别实体片段,从而可以识别重叠的实体。其次,我们进行关系分类来判断给定的一对实体片段是重叠的还是连续的。这样不仅可以识别不连续实体,而且可以对重叠实体进行双重检测。总体而言,我们的模型本质上可以看作是一种关系抽取范式。在多个基准数据集(CLEF、GENIA和ACE05)上的实验结果表明,我们的模型对于重叠和不连续的NER具有很强的竞争力。 摘要:Research on overlapped and discontinuous named entity recognition (NER) has received increasing attention. The majority of previous work focuses on either overlapped or discontinuous entities. In this paper, we propose a novel span-based model that can recognize both overlapped and discontinuous entities jointly. The model includes two major steps. First, entity fragments are recognized by traversing over all possible text spans, thus, overlapped entities can be recognized. Second, we perform relation classification to judge whether a given pair of entity fragments to be overlapping or succession. In this way, we can recognize not only discontinuous entities, and meanwhile doubly check the overlapped entities. As a whole, our model can be regarded as a relation extraction paradigm essentially. Experimental results on multiple benchmark datasets (i.e., CLEF, GENIA and ACE05) show that our model is highly competitive for overlapped and discontinuous NER.
Zero/Few/One-Shot|迁移|自适应(1篇)
【1】 Multimodal Few-Shot Learning with Frozen Language Models 标题:基于冻结语言模型的多模态少发式学习
作者:Maria Tsimpoukelli,Jacob Menick,Serkan Cabi,S. M. Ali Eslami,Oriol Vinyals,Felix Hill 机构:DeepMind, University College London 链接:https://arxiv.org/abs/2106.13884 摘要:在足够大的范围内训练时,自回归语言模型表现出显著的学习新语言任务的能力。在这里,我们提出了一个简单,但有效的方法,将这种少数镜头的学习能力转移到一个多模式的设置(视觉和语言)。使用对齐的图像和字幕数据,我们训练一个视觉编码器,将每幅图像表示为一系列连续的嵌入,这样一个预先训练的、冻结的语言模型就会生成相应的字幕。由此产生的系统是一个多模态的少数镜头学习者,具有惊人的能力,学习各种新的任务时,条件的例子,表现为一个序列的多重交织图像和文本嵌入。我们证明,它可以快速学习新对象和新的视觉类别的单词,只使用少数的例子进行视觉问答,并利用外部知识,通过测量一个单一的模型对各种已建立和新的基准。 摘要:When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
检索(1篇)
【1】 A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques 标题:浅谈深度影响、COIL和信息检索技术的概念框架
作者:Jimmy Lin,Xueguang Ma 机构:David R. Cheriton School of Computer Science, University of Waterloo 链接:https://arxiv.org/abs/2106.14807 摘要:信息检索中表征学习的最新进展可以组织在一个概念框架中,该框架建立了两对对比:稀疏表征与密集表征以及无监督表征与学习表征。稀疏学习表示可以进一步分解为扩展和项权重分量。这个框架使我们能够理解最近提出的技术之间的关系,例如DPR、ANCE、DeepCT、DeepImpact和COIL,此外,我们的分析揭示的差距指出,就尚未探索的技术而言,存在“低挂果实”。我们提出了一种称为“unicil”的新技术,这是COIL的一个简单扩展,据我们所知,在流行的msmarco通道排名数据集上实现了稀疏检索的最新技术。我们使用Anserini IR工具包的实现是基于Lucene搜索库构建的,因此与标准的反向索引完全兼容。 摘要:Recent developments in representational learning for information retrieval can be organized in a conceptual framework that establishes two pairs of contrasts: sparse vs. dense representations and unsupervised vs. learned representations. Sparse learned representations can further be decomposed into expansion and term weighting components. This framework allows us to understand the relationship between recently proposed techniques such as DPR, ANCE, DeepCT, DeepImpact, and COIL, and furthermore, gaps revealed by our analysis point to "low hanging fruit" in terms of techniques that have yet to be explored. We present a novel technique dubbed "uniCOIL", a simple extension of COIL that achieves to our knowledge the current state-of-the-art in sparse retrieval on the popular MS MARCO passage ranking dataset. Our implementation using the Anserini IR toolkit is built on the Lucene search library and thus fully compatible with standard inverted indexes.
表征(1篇)
【1】 Word2Box: Learning Word Representation Using Box Embeddings 标题:Word2Box:使用Box嵌入学习单词表示
作者:Shib Sankar Dasgupta,Michael Boratko,Shriya Atmakuri,Xiang Lorraine Li,Dhruvesh Patel,Andrew McCallum 机构:College of Information and Computer Sciences, University of Massachusetts, Amherst 备注:Work in progress 链接:https://arxiv.org/abs/2106.14361 摘要:词汇的学习向量表示是自然语言处理中最基本的主题之一,它能够捕捉在各种自然语言处理任务中有用的句法和语义关系。然而,向量表示可能会受到限制,因为点积相似性等典型的评分将向量在空间中的位置和大小交织在一起。表征学习领域的令人兴奋的创新已经提出了替代的基本表征,如分布、双曲向量或区域。我们的模型Word2Box采用了一种基于区域的方法来解决单词表示问题,将单词表示为$n$维的矩形。这些表示独立地编码位置和宽度,并提供额外的几何操作,如交集和包含,这使得它们能够对向量难以处理的共现模式进行建模。我们展示了在各种单词相似性任务上的改进性能,特别是在不太常见的单词上,并对Word2Box提供的额外独特表达能力进行了定性分析。 摘要:Learning vector representations for words is one of the most fundamental topics in NLP, capable of capturing syntactic and semantic relationships useful in a variety of downstream NLP tasks. Vector representations can be limiting, however, in that typical scoring such as dot product similarity intertwines position and magnitude of the vector in space. Exciting innovations in the space of representation learning have proposed alternative fundamental representations, such as distributions, hyperbolic vectors, or regions. Our model, Word2Box, takes a region-based approach to the problem of word representation, representing words as $n$-dimensional rectangles. These representations encode position and breadth independently and provide additional geometric operations such as intersection and containment which allow them to model co-occurrence patterns vectors struggle with. We demonstrate improved performance on various word similarity tasks, particularly on less common words, and perform a qualitative analysis exploring the additional unique expressivity provided by Word2Box.
Word2Vec|文本|单词(2篇)
【1】 Traditional Machine Learning and Deep Learning Models for Argumentation Mining in Russian Texts 标题:俄语文本议论挖掘的传统机器学习和深度学习模型
作者:Irina Fishcheva,Valeriya Goloviznina,Evgeny Kotelnikov 机构: Vyatka State University, Russia; ITMO University 备注:13 pages, 6 tables, 4 figures. Accepted to Dialogue-2021 conference 链接:https://arxiv.org/abs/2106.14438 摘要:议论文挖掘是计算语言学的一个研究领域,它致力于从文本中提取论据,对论据和论据之间的关系进行分类,构建议论文结构。俄语这一领域研究的一个重要障碍是缺乏带注释的俄语文本语料库。本文在劝说性论文语料库(PersEssays)机器翻译的基础上,对俄语版的议论文微文本语料库(ArgMicro)进行扩展,探讨了提高议论文挖掘质量的可能性。为了使这两个语料库结合使用成为可能,我们在ArgMicro和PersEssays中使用的方案的基础上提出了一种联合参数标注方案。我们利用传统的机器学习技术(SVM、Bagging和XGBoost)和深层神经网络(BERT模型)将议论文语篇单元(adu)分为两类:pro和opp。提出了一个XGBoost模型和BERT模型的集成,这两个模型都显示了ADUs分类的最高性能。 摘要:Argumentation mining is a field of computational linguistics that is devoted to extracting from texts and classifying arguments and relations between them, as well as constructing an argumentative structure. A significant obstacle to research in this area for the Russian language is the lack of annotated Russian-language text corpora. This article explores the possibility of improving the quality of argumentation mining using the extension of the Russian-language version of the Argumentative Microtext Corpus (ArgMicro) based on the machine translation of the Persuasive Essays Corpus (PersEssays). To make it possible to use these two corpora combined, we propose a Joint Argument Annotation Scheme based on the schemes used in ArgMicro and PersEssays. We solve the problem of classifying argumentative discourse units (ADUs) into two classes - "pro" ("for") and "opp" ("against") using traditional machine learning techniques (SVM, Bagging and XGBoost) and a deep neural network (BERT model). An ensemble of XGBoost and BERT models was proposed, which showed the highest performance of ADUs classification for both corpora.
【2】 Integrating topic modeling and word embedding to characterize violent deaths 标题:整合主题建模和词语嵌入来刻画暴力死亡
作者:Alina Arseniev-Koehler,Susan D. Cochran,Vickie M. Mays,Kai-Wei Chang,Jacob Gates Foster 机构:Department of Sociology, University of California-Los Angeles, Los Angeles, CA ,; bDepartment of Epidemiology, UCLA Fielding School of Public Health and 链接:https://arxiv.org/abs/2106.14365 摘要:人们越来越需要在许多领域的文本数据中识别潜在模式的方法。本文提出了一种新的方法来识别语料库中的主题,并将文档表示为主题序列。话语原子主题建模借鉴了机器学习理论的进步,将主题建模和单词嵌入结合起来,充分利用了两者的不同功能。我们首先识别一组向量(“话语原子”),它们提供了嵌入空间的稀疏表示。原子向量可以解释为潜在的主题:通过生成模型,原子映射到词的分布上;人们还可以推断出产生一系列单词的主题。我们用一个未充分利用的文本的突出例子来说明我们的方法:美国国家暴力死亡报告系统(NVDRS)。NVDRS用结构化变量和非结构化叙述总结了暴力死亡事件。我们在叙述中确定了225个潜在的主题(例如,死亡准备和身体攻击);这些主题中的许多并没有被现有的结构化变量所捕获。基于已知的性别自杀和杀人模式,以及最近关于语义空间中性别偏见的研究,我们确定了主题的性别偏见(例如,关于止痛药的主题是女性的)。然后,我们比较了性别偏见的话题,他们的流行率在叙述中的女性与男性受害者。研究结果提供了有关致命暴力及其性别性质报告的详细定量图片。我们的方法为文本数据中的主题建模提供了一种灵活且广泛适用的方法。 摘要:There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a set of vectors ("discourse atoms") that provide a sparse representation of an embedding space. Atom vectors can be interpreted as latent topics: Through a generative model, atoms map onto distributions over words; one can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the U.S. National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured narratives. We identify 225 latent topics in the narratives (e.g., preparation for death and physical aggression); many of these topics are not captured by existing structured variables. Motivated by known patterns in suicide and homicide by gender, and recent research on gender biases in semantic space, we identify the gender bias of our topics (e.g., a topic about pain medication is feminine). We then compare the gender bias of topics to their prevalence in narratives of female versus male victims. Results provide a detailed quantitative picture of reporting about lethal violence and its gendered nature. Our method offers a flexible and broadly applicable approach to model topics in text data.
其他神经网络|深度学习|模型|建模(6篇)
【1】 A Theory of Language Learning 标题:一种语言学习理论
作者:Robert Worden 机构:Wellcome Centre for Human Neuroimaging, Institute of Neurology, University College, London, London, United Kingdom 链接:https://arxiv.org/abs/2106.14612 摘要:描述了一种语言学习理论,它利用贝叶斯归纳法对特征结构(脚本)和脚本函数进行归纳。语言中的每一个词义都由一个m-script来表示,m-script是一个脚本函数,它体现了单词的所有语法和语义。M-scripts形成了一个完全词汇化的统一语法,可以支持成人语言。每一个单词m-script都可以从大约六个学习例子中学习到。这个理论已经被实现为一个计算机模型,它可以从零词汇中引导学习一门语言。贝叶斯学习机制(1)能够学习任意复杂的语义和句法结构(2) 快速:从每个例子中学习这些结构(3) 鲁棒性:在存在大量无关噪声的情况下进行学习;(4)自我修复:能够获得内隐的负面证据,利用它来学习异常。学习语言的孩子显然都是(1)-(4),而联结主义理论在(1)和(2)上失败,符号理论在(3)和(4)上失败。这一理论与语言习得的许多关键事实,包括对其他理论有疑问的事实,是一致的。它与100多个关于词汇习得、短语结构、词形、补语和控制、助词、动词论元结构、间隙和移动的关键跨语言研究结果进行了比较,几乎所有的案例都给出了非强迫一致性,而没有额外的假设。 摘要:A theory of language learning is described, which uses Bayesian induction of feature structures (scripts) and script functions. Each word sense in a language is mentally represented by an m-script, a script function which embodies all the syntax and semantics of the word. M-scripts form a fully-lexicalised unification grammar, which can support adult language. Each word m-script can be learnt robustly from about six learning examples. The theory has been implemented as a computer model, which can bootstrap-learn a language from zero vocabulary. The Bayesian learning mechanism is (1) Capable: to learn arbitrarily complex meanings and syntactic structures; (2) Fast: learning these structures from a few examples each; (3) Robust: learning in the presence of much irrelevant noise, and (4) Self-repairing: able to acquire implicit negative evidence, using it to learn exceptions. Children learning language are clearly all of (1) - (4), whereas connectionist theories fail on (1) and (2), and symbolic theories fail on (3) and (4). The theory is in good agreement with many key facts of language acquisition, including facts which are problematic for other theories. It is compared with over 100 key cross-linguistic findings about acquisition of the lexicon, phrase structure, morphology, complementation and control, auxiliaries, verb argument structures, gaps and movement - in nearly all cases giving unforced agreement without extra assumptions.
【2】 Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits 标题:时域稀疏重叠语音训练:目标语音分离和个人VAD收益的联合学习
作者:Qingjian Lin,Lin Yang,Xuyang Wang,Luyuan Xie,Chen Jia,Junjie Wang 机构:AI Lab, Lenovo Research, Beijing, China 备注:Rejected by Interspeech 2021. Plan to commit to ICASSP 2022 链接:https://arxiv.org/abs/2106.14371 摘要:目标语音分离是根据提供的附加说话人身份信息,从混合语音中过滤出特定说话人的语音的过程。最近的工作通过直接在时域处理信号已经取得了相当大的进步。其中大部分采用完全重叠的混合语音进行训练。然而,由于现实生活中的大多数会话都是随机发生的,并且很少重叠,因此我们认为,使用不同重叠率的数据进行训练是有益的。要做到这一点,一个不可避免的问题是,普遍使用的SI-SNR损失没有定义沉默的来源。本文提出了加权信噪比损失,并结合目标语音分离和个人VAD的联合学习。加权的SI-SNR损失施加了一个与目标说话人的持续时间成比例的权重因子,当目标说话人不在时,该权重因子返回零。同时,个人VAD生成面具并将非目标语音设置为静默。实验表明,该方法在完全重叠语音上的SDR比基线提高了1.73db,在干净噪声条件下的稀疏重叠语音上的SDR比基线提高了4.17db和0.9db。此外,在性能略有下降的情况下,我们的模型可以减少推理的时间开销。 摘要:Target speech separation is the process of filtering a certain speaker's voice out of speech mixtures according to the additional speaker identity information provided. Recent works have made considerable improvement by processing signals in the time domain directly. The majority of them take fully overlapped speech mixtures for training. However, since most real-life conversations occur randomly and are sparsely overlapped, we argue that training with different overlap ratio data benefits. To do so, an unavoidable problem is that the popularly used SI-SNR loss has no definition for silent sources. This paper proposes the weighted SI-SNR loss, together with the joint learning of target speech separation and personal VAD. The weighted SI-SNR loss imposes a weight factor that is proportional to the target speaker's duration and returns zero when the target speaker is absent. Meanwhile, the personal VAD generates masks and sets non-target speech to silence. Experiments show that our proposed method outperforms the baseline by 1.73 dB in terms of SDR on fully overlapped speech, as well as by 4.17 dB and 0.9 dB on sparsely overlapped speech of clean and noisy conditions. Besides, with slight degradation in performance, our model could reduce the time costs in inference.
【3】 Effective Cascade Dual-Decoder Model for Joint Entity and Relation Extraction 标题:一种有效的联合实体和关系提取的级联双解码器模型
作者:Lianbo Ma,Huimin Ren,Xiliang Zhang 链接:https://arxiv.org/abs/2106.14163 摘要:从文本中提取关系三元组是知识图构造中的一项基本任务。现有方法普遍采用单一模型联合提取实体和关系的方法,这种方法往往存在三重重叠问题。也就是说,在一个句子中有多个关系三元组共享相同的实体。在这项工作中,我们提出了一个有效的级联双解码器的方法来提取重叠的关系三元组,其中包括一个文本特定的关系解码器和一个关系对应的实体解码器。我们的方法很简单:文本特定关系解码器根据句子的文本语义检测句子中的关系,并将其作为额外的特征来指导实体提取;对于每个具有可训练嵌入的提取关系,关系对应实体解码器使用基于跨度的标记方案来检测对应的头部和尾部实体。这样就很自然地解决了重叠三重问题。在两个公共数据集上的实验表明,在严格的评价指标下,该方法优于现有的方法,并获得了较好的F1成绩。我们的实现可在https://github.com/prastunlp/DualDec. 摘要:Extracting relational triples from texts is a fundamental task in knowledge graph construction. The popular way of existing methods is to jointly extract entities and relations using a single model, which often suffers from the overlapping triple problem. That is, there are multiple relational triples that share the same entities within one sentence. In this work, we propose an effective cascade dual-decoder approach to extract overlapping relational triples, which includes a text-specific relation decoder and a relation-corresponded entity decoder. Our approach is straightforward: the text-specific relation decoder detects relations from a sentence according to its text semantics and treats them as extra features to guide the entity extraction; for each extracted relation, which is with trainable embedding, the relation-corresponded entity decoder detects the corresponding head and tail entities using a span-based tagging scheme. In this way, the overlapping triple problem is tackled naturally. Experiments on two public datasets demonstrate that our proposed approach outperforms state-of-the-art methods and achieves better F1 scores under the strict evaluation metric. Our implementation is available at https://github.com/prastunlp/DualDec.
【4】 Visual Conceptual Blending with Large-scale Language and Vision Models 标题:大规模语言和视觉模型的视觉概念融合
作者:Songwei Ge,Devi Parikh 机构:University of Maryland, Facebook AI Research & Georgia Tech 链接:https://arxiv.org/abs/2106.14127 摘要:我们提出这样一个问题:最近的大规模语言和图像生成模型能在多大程度上融合视觉概念?给定一个任意的对象,我们识别一个相关的对象,并使用一个语言模型生成两个对象混合的一个句子描述。然后,我们使用基于文本的图像生成模型生成混合的视觉描述。定量和定性评价表明,语言模型优于传统的概念整合方法,最近的大规模图像生成模型优于先前的视觉描述模型。 摘要:We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts? Given an arbitrary object, we identify a relevant object and generate a single-sentence description of the blend of the two using a language model. We then generate a visual depiction of the blend using a text-based image generation model. Quantitative and qualitative evaluations demonstrate the superiority of language models over classical methods for conceptual blending, and of recent large-scale image generation models over prior models for the visual depiction.
【5】 PhyCRNet: Physics-informed Convolutional-Recurrent Network for Solving Spatiotemporal PDEs 标题:PhyCRNet:求解时空偏微分方程的物理信息卷积-递归网络
作者:Pu Ren,Chengping Rao,Yang Liu,Jianxun Wang,Hao Sun 机构:Department of Civil and Environmental Engineering, Northeastern University, Boston, MA , USA, Department of Mechanical and Industrial Engineering, Northeastern University, Boston, MA , USA 备注:22 pages 链接:https://arxiv.org/abs/2106.14103 摘要:偏微分方程(pde)在建模和仿真问题中起着基础性的作用。深入学习的最新进展显示了物理信息神经网络(PINNs)解决偏微分方程的巨大潜力,它可以作为数据驱动建模和逆分析的基础。然而,现有的基于全连通NNs的PINN方法对低维时空参数化具有内在的局限性。此外,由于初始/边界条件(I/BCs)是通过惩罚软施加的,因此解的质量在很大程度上依赖于超参数调整。为此,我们提出了一种新的基于物理信息的卷积循环学习结构(PhyCRNet和PhyCRNet-s),用于求解无标记数据的偏微分方程。提出了一种用于低维空间特征提取和时间演化学习的编译码卷积长短时记忆网络。损失函数被定义为聚合的离散化PDE残差,而I/bc在网络中被硬编码以确保强制满足(例如,周期性边界填充)。通过显式模拟时间推进的自回归和剩余连接,网络得到进一步增强。通过求解三个非线性偏微分方程(如二维Burgers方程、$\lambda$-$\omega$和FitzHugh-Nagumo反应扩散方程),对我们提出的方法的性能进行了评估,并与art基线算法的开始进行了比较。数值结果表明,该方法在求解精度、可外极性和可推广性等方面具有优越性。 摘要:Partial differential equations (PDEs) play a fundamental role in modeling and simulating problems across a wide range of disciplines. Recent advances in deep learning have shown the great potential of physics-informed neural networks (PINNs) to solve PDEs as a basis for data-driven modeling and inverse analysis. However, the majority of existing PINN methods, based on fully-connected NNs, pose intrinsic limitations to low-dimensional spatiotemporal parameterizations. Moreover, since the initial/boundary conditions (I/BCs) are softly imposed via penalty, the solution quality heavily relies on hyperparameter tuning. To this end, we propose the novel physics-informed convolutional-recurrent learning architectures (PhyCRNet and PhyCRNet-s) for solving PDEs without any labeled data. Specifically, an encoder-decoder convolutional long short-term memory network is proposed for low-dimensional spatial feature extraction and temporal evolution learning. The loss function is defined as the aggregated discretized PDE residuals, while the I/BCs are hard-encoded in the network to ensure forcible satisfaction (e.g., periodic boundary padding). The networks are further enhanced by autoregressive and residual connections that explicitly simulate time marching. The performance of our proposed methods has been assessed by solving three nonlinear PDEs (e.g., 2D Burgers' equations, the $\lambda$-$\omega$ and FitzHugh Nagumo reaction-diffusion equations), and compared against the start-of-the-art baseline algorithms. The numerical results demonstrate the superiority of our proposed methodology in the context of solution accuracy, extrapolability and generalizability.
【6】 UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning 标题:UMIC:一种基于对比学习的无参考图像字幕度量
作者:Hwanhee Lee,Seunghyun Yoon,Franck Dernoncourt,Trung Bui,Kyomin Jung 机构:Dept. of Electrical and Computer Engineering, Seoul National University, Seoul, Korea, Adobe Research, San Jose, CA, USA 备注:ACL 2021 链接:https://arxiv.org/abs/2106.14019 摘要:尽管BERTScore等各种文本生成度量方法取得了成功,但是由于描述的多样性,在没有足够的参考字幕的情况下仍然很难对图像字幕进行评价。在本文中,我们介绍了一种新的度量UMIC,一种无参考的图像字幕度量,它不需要参考字幕来评价图像字幕。本文以视觉和语言为基础,通过对比学习训练UMIC识别否定字幕。同时,我们观察到了以前的基准数据集(即人类注释)在图像字幕度量上的关键问题,并在生成的字幕上引入了一个新的人类注释集合。我们在四个数据集(包括我们的新数据集)上验证了UMIC,并表明UMIC比所有以前需要多个引用的度量具有更高的相关性。我们发布了基准数据集和预先训练的模型来计算UMIC。 摘要:Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references. We release the benchmark dataset and pre-trained models to compute the UMIC.
其他(12篇)
【1】 What's in a Measurement? Using GPT-3 on SemEval 2021 Task 8 -- MeasEval 标题:量具里有什么?在SemEval 2021任务8--测量上使用GPT-3
作者:Curt Kohler,Ron Daniel Jr 机构:Elsevier Labs 链接:https://arxiv.org/abs/2106.14720 摘要:2020年夏天,OpenAI大张旗鼓地发布了GPT-3自回归语言模型。尽管该模型在多个领域的任务上显示出了希望,但它并不总是清楚结果是什么时候精心挑选的,还是什么时候是未经修饰的产出。我们特别感兴趣的是GPT-3能给SemEval 2021 MeasEval任务带来什么好处——确定科学文献中的测量及其相关属性。我们已经尝试过用多轮回答问题来解决这个问题。我们想看看是否可以利用GPT-3的少量学习功能,更容易地开发出比我们以前的工作性能更好的解决方案。不幸的是,我们在这方面没有取得成功。本文讨论了我们使用的方法、遇到的挑战和观察到的结果。我们遇到的一些问题仅仅是由于目前的技术水平。例如,提示和应答的大小限制了可以提供的训练信号的数量。其他的更基本。我们不知道生成模型能很好地保留事实信息。而且,提示更改的影响是不可预测的,因此很难可靠地提高性能。 摘要:In the summer of 2020 OpenAI released its GPT-3 autoregressive language model to much fanfare. While the model has shown promise on tasks in several areas, it has not always been clear when the results were cherry-picked or when they were the unvarnished output. We were particularly interested in what benefits GPT-3 could bring to the SemEval 2021 MeasEval task - identifying measurements and their associated attributes in scientific literature. We had already experimented with multi-turn questions answering as a solution to this task. We wanted to see if we could use GPT-3's few-shot learning capabilities to more easily develop a solution that would have better performance than our prior work. Unfortunately, we have not been successful in that effort. This paper discusses the approach we used, challenges we encountered, and results we observed. Some of the problems we encountered were simply due to the state of the art. For example, the limits on the size of the prompt and answer limited the amount of the training signal that could be offered. Others are more fundamental. We are unaware of generative models that excel in retaining factual information. Also, the impact of changes in the prompts is unpredictable, making it hard to reliably improve performance.
【2】 A new system for evaluating brand importance: A use case from the fashion industry 标题:一种新的品牌重要性评估系统:来自时尚业的用例
作者:A. Fronzetti Colladon,F. Grippa,L. Segneri 备注:None 链接:https://arxiv.org/abs/2106.14657 摘要:如今,品牌经理和营销专家可以利用大量的数据来揭示消费者认知的模式和趋势,监测品牌与期望话题之间的积极或消极关联。在本研究中,我们应用语义品牌评分(SBS)指标来评估品牌在时装业中的重要性。为此,我们使用SBS商业智能应用程序(sbsb-BI)来测量和可视化文本数据,SBS-BI依赖于文本挖掘和社会网络分析的方法和工具。我们收集并分析了2021年3月5日至3月12日期间约20.6万条涉及时尚品牌Fendi、Gucci和Prada的推文,通过对流行性、多样性和连通性三个SBS维度的分析发现,Gucci占据了话语的主导地位,具有较高的SBS价值。本文以这一案例为例,通过对大量文本数据的分析,提出了一个新的品牌重要性和形象评价体系。 摘要:Today brand managers and marketing specialists can leverage huge amount of data to reveal patterns and trends in consumer perceptions, monitoring positive or negative associations of brands with respect to desired topics. In this study, we apply the Semantic Brand Score (SBS) indicator to assess brand importance in the fashion industry. To this purpose, we measure and visualize text data using the SBS Business Intelligence App (SBS BI), which relies on methods and tools of text mining and social network analysis. We collected and analyzed about 206,000 tweets that mentioned the fashion brands Fendi, Gucci and Prada, during the period from March 5 to March 12, 2021. From the analysis of the three SBS dimensions - prevalence, diversity and connectivity - we found that Gucci dominated the discourse, with high values of SBS. We use this case study as an example to present a new system for evaluating brand importance and image, through the analysis of (big) textual data.
【3】 Timestamping Documents and Beliefs 标题:为文档和信仰添加时间戳
作者:Swayambhu Nath Ray 机构:Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India 备注:Master's Report 链接:https://arxiv.org/abs/2106.14622 摘要:我们可以获得的大部分文本信息都是时变的。在一个信息是动态的世界里,给它们加上时间戳是一项非常重要的任务。文档是一个很好的信息源,用于许多任务,如情感分析、评论分类等。文档创建日期的知识有助于完成一些任务,如摘要、事件提取、临时信息提取等。不幸的是,对于web上的大多数文档,时间戳元数据错误或丢失。因此,文档断代是一个具有挑战性的问题,它需要对文档的时间结构以及文档的上下文信息进行推理。以前的文件日期系统主要依赖于手工制作的功能,而忽略了这样的文件内部结构。本文提出了一种基于图卷积网络(GCN)的文档定年方法NeuralDater,它以一种有原则的方式联合利用文档的句法和时态图结构。我们还指出了NeuralDater的一些局限性,并试图以一种更灵活、更直观的方式利用文档中的上下文和时间信息,提出了一种基于注意的文档日期系统AD3:attentient Deep Document Dater。据我们所知,这些都是首次应用深度学习方法的任务。通过对真实世界数据集的大量实验,我们发现我们的模型显著优于最新的基线。 摘要:Most of the textual information available to us are temporally variable. In a world where information is dynamic, time-stamping them is a very important task. Documents are a good source of information and are used for many tasks like, sentiment analysis, classification of reviews etc. The knowledge of creation date of documents facilitates several tasks like summarization, event extraction, temporally focused information extraction etc. Unfortunately, for most of the documents on the web, the time-stamp meta-data is either erroneous or missing. Thus document dating is a challenging problem which requires inference over the temporal structure of the document alongside the contextual information of the document. Prior document dating systems have largely relied on handcrafted features while ignoring such document-internal structures. In this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way. We also pointed out some limitations of NeuralDater and tried to utilize both context and temporal information in documents in a more flexible and intuitive manner proposing AD3: Attentive Deep Document Dater, an attention-based document dating system. To the best of our knowledge these are the first application of deep learning methods for the task. Through extensive experiments on real-world datasets, we find that our models significantly outperforms state-of-the-art baselines by a significant margin.
【4】 What's in a Scientific Name? 标题:学名里有什么?
作者:Henrique Ferraz de Arruda,Luciano da Fontoura Costa 机构:S˜ao Carlos Institute of Physics, University of S˜ao Paulo, PO Box ,-, S˜ao Carlos, SP, Brazil, ) 链接:https://arxiv.org/abs/2106.14610 摘要:在很大程度上,词语可以理解为与出现的模式或类别相对应,以表示在给定的时间和空间中特别重要或有用的概念和结构。词的特点是不完全一般也不完全具体,因为同一个词可以根据具体情况具体化或关联到几个不同的上下文。事实上,词语的实例化和关联方式代表了一个特别有趣的方面,它可以实质性地帮助更好地理解它们所处的环境。科学词汇也不例外。在目前的工作中,我们探讨了一组特别相关的词之间的联系,从这个意义上说,这些词不仅在一些领域经常使用,而且还代表了目前与科学中一些主要的长期挑战有关的概念。更具体地说,本文报告的研究考虑了“预测”、“模型”、“优化”、“复杂”、“熵”、“随机”、“确定性”、“模式”和“数据库”等词。为了补充分析,我们还得到了一个网络,表示所采用的地区之间的关系。发现了许多有趣的结果。首先也是最重要的是,有几个词在不同领域有明显不同的联系。生物学被发现与计算机科学相关,与数据库共享关联。此外,在大多数情况下,“复杂”、“模型”和“预测”这几个词都有很强的联系。 摘要:To a good extent, words can be understood as corresponding to patterns or categories that appeared in order to represent concepts and structures that are particularly important or useful in a given time and space. Words are characterized by not being completely general nor specific, in the sense that the same word can be instantiated or related to several different contexts, depending on specific situations. Indeed, the way in which words are instantiated and associated represents a particularly interesting aspect that can substantially help to better understand the context in which they are employed. Scientific words are no exception to that. In the present work, we approach the associations between a set of particularly relevant words in the sense of being not only frequently used in several areas, but also representing concepts that are currently related to some of the main standing challenges in science. More specifically, the study reported here takes into account the words "prediction", "model", "optimization", "complex", "entropy", "random", "deterministic", "pattern", and "database". In order to complement the analysis, we also obtain a network representing the relationship between the adopted areas. Many interesting results were found. First and foremost, several of the words were observed to have markedly distinct associations in different areas. Biology was found to be related to computer science, sharing associations with databases. Furthermore, for most of the cases, the words "complex", "model", and "prediction" were observed to have several strong associations.
【5】 Quantifying Social Biases in NLP: A Generalization and Empirical Comparison of Extrinsic Fairness Metrics 标题:量化自然语言中的社会偏见:外在公平度量的推广和实证比较
作者:Paula Czarnowska,Yogarshi Vyas,Kashif Shah 机构:University of Cambridge, Amazon AI 备注:Accepted for publication in Transaction of the Association for Computational Linguistics (TACL), 2021. The arXiv version is a pre-MIT Press publication version 链接:https://arxiv.org/abs/2106.14574 摘要:测量偏差是更好地理解和解决NLP/ML模型不公平的关键。这通常是通过公平性度量来实现的,公平性度量量化了一个模型在一系列人口群体中的行为差异。在这项工作中,我们揭示了NLP中使用的公平性度量之间的差异和相似性。首先,我们将广泛的现有指标统一到三个广义公平性指标之下,揭示了它们之间的联系。接下来,我们对现有的度量标准进行了广泛的实证比较,并证明可以通过我们的广义度量标准的参数选择差异来系统地解释所观察到的偏差度量差异。 摘要:Measuring bias is key for better understanding and addressing unfairness in NLP/ML models. This is often done via fairness metrics which quantify the differences in a model's behaviour across a range of demographic groups. In this work, we shed more light on the differences and similarities between the fairness metrics used in NLP. First, we unify a broad range of existing metrics under three generalized fairness metrics, revealing the connections between them. Next, we carry out an extensive empirical comparison of existing metrics and demonstrate that the observed differences in bias measurement can be systematically explained via differences in parameter choices for our generalized metrics.
【6】 RadGraph: Extracting Clinical Entities and Relations from Radiology Reports 标题:RadGraph:从放射学报告中提取临床实体和关系
作者:Saahil Jain,Ashwin Agrawal,Adriel Saporta,Steven QH Truong,Du Nguyen Duong,Tan Bui,Pierre Chambon,Yuhao Zhang,Matthew P. Lungren,Andrew Y. Ng,Curtis P. Langlotz,Pranav Rajpurkar 机构:Stanford University, VinBrain, VinUniversity 链接:https://arxiv.org/abs/2106.14463 摘要:从自由文本放射报告中提取结构化的临床信息可以使放射报告信息用于各种重要的医疗保健应用程序。在我们的工作中,我们提出了RadGraph,一个数据集的实体和关系在全文胸部X射线放射学报告的基础上,我们设计了一个新的信息提取模式结构放射学报告。我们发布了一个开发数据集,该数据集包含来自MIMIC-CXR数据集(14579个实体和10889个关系)的500份放射报告的经委员会认证的放射学家注释,以及一个测试数据集,它包含两组独立的board certified radiologist注解,用于100份放射报告,在MIMIC-CXR和CheXpert数据集中平均分配。利用这些数据集,我们训练并测试了一个深度学习模型RadGraph Benchmark,该模型在MIMIC-CXR和CheXpert测试集上的关系提取的micro F1分别达到0.82和0.73。此外,我们还发布了一个推断数据集,其中包含由RadGraph Benchmark自动生成的注释,这些注释跨越220763份MIMIC-CXR报告(约600万个实体和400万个关系)和500份CheXpert报告(13783个实体和9908个关系),并映射到相关胸片。我们免费提供的数据集可以促进医学自然语言处理方面的广泛研究,以及计算机视觉和多模式学习(与胸片相关)。 摘要:Extracting structured clinical information from free-text radiology reports can enable the use of radiology report information for a variety of critical healthcare applications. In our work, we present RadGraph, a dataset of entities and relations in full-text chest X-ray radiology reports based on a novel information extraction schema we designed to structure radiology reports. We release a development dataset, which contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Using these datasets, we train and test a deep learning model, RadGraph Benchmark, that achieves a micro F1 of 0.82 and 0.73 on relation extraction on the MIMIC-CXR and CheXpert test sets respectively. Additionally, we release an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs. Our freely available dataset can facilitate a wide range of research in medical natural language processing, as well as computer vision and multi-modal learning when linked to chest radiographs.
【7】 Current Landscape of the Russian Sentiment Corpora 标题:俄罗斯情感语料库的现状
作者:Evgeny Kotelnikov 机构: Vyatka State University, Russia; ITMO University 备注:12 pages, 5 tables, 2 figures. Accepted to Dialogue-2021 conference 链接:https://arxiv.org/abs/2106.14434 摘要:目前,用于情感分析的俄语语料库有十几个,在文本来源、领域、大小、情感类的数量和比例、标注方法等方面存在差异。本文通过对俄语语料库的研究,提出了俄语语料库的定性和定量特征,从而对俄语语料库的现状进行了分析。提出了一种基于标注质量的语料库排序方法,可用于训练和测试语料库的选择。基于深度神经网络模型BERT,研究了训练数据集对情绪分析性能的影响。通过对复习语料库的实验,我们得出结论:平均而言,随着训练语料库数量的增加,模型的质量提高。基于BERT模型,首次获得了ROMIP研讨会评论语料库的质量分数。同时,本研究提出了建立情绪分析通用模型的任务。 摘要:Currently, there are more than a dozen Russian-language corpora for sentiment analysis, differing in the source of the texts, domain, size, number and ratio of sentiment classes, and annotation method. This work examines publicly available Russian-language corpora, presents their qualitative and quantitative characteristics, which make it possible to get an idea of the current landscape of the corpora for sentiment analysis. The ranking of corpora by annotation quality is proposed, which can be useful when choosing corpora for training and testing. The influence of the training dataset on the performance of sentiment analysis is investigated based on the use of the deep neural network model BERT. The experiments with review corpora allow us to conclude that on average the quality of models increases with an increase in the number of training corpora. For the first time, quality scores were obtained for the corpus of reviews of ROMIP seminars based on the BERT model. Also, the study proposes the task of the building a universal model for sentiment analysis.
【8】 Political Ideology and Polarization of Policy Positions: A Multi-dimensional Approach 标题:政治意识形态与政策立场分化:一个多维的视角
作者:Barea Sinno,Bernardo Oviedo,Katherine Atwell,Malihe Alikhani,Junyi Jessy Li 机构: Political Science, Rutgers University, Computer Science, Linguistics, The University of Texas at Austin, Computer Science, University of Pittsburgh 链接:https://arxiv.org/abs/2106.14387 摘要:分析政治意识形态和两极分化,对于增进我们对社会政治背景的理解至关重要。最近的研究在理解新闻媒体的意识形态偏见(即立场)方面取得了很大进展。在这项工作中,我们采取了一种新颖的方法,并研究了讨论中的政策的意识形态,梳理出立场和意识形态的微妙共存。结合政治学的理论解释,我们将意识形态视为一个多维结构,并介绍了第一个历时性的新闻文章数据集,这些文章的政治意识形态是由训练有素的政治学家和语言学家在段落层面进行注释的。我们展示了这个框架能够定量分析两极分化,一个时间的,多方面的衡量意识形态距离的方法。我们进一步提出了意识形态预测的基线模型。 摘要:Analyzing political ideology and polarization is of critical importance in advancing our understanding of the political context in society. Recent research has made great strides towards understanding the ideological bias (i.e., stance) of news media along a left-right spectrum. In this work, we take a novel approach and study the ideology of the policy under discussion teasing apart the nuanced co-existence of stance and ideology. Aligned with the theoretical accounts in political science, we treat ideology as a multi-dimensional construct, and introduce the first diachronic dataset of news articles whose political ideology under discussion is annotated by trained political scientists and linguists at the paragraph-level. We showcase that this framework enables quantitative analysis of polarization, a temporal, multifaceted measure of ideological distance. We further present baseline models for ideology prediction.
【9】 Draw Me a Flower: Grounding Formal Abstract Structures Stated in Informal Natural Language 标题:给我画一朵花:非正式自然语言表述的形式抽象结构的基础
作者:Royi Lachmy,Valentina Pyatkin,Reut Tsarfaty 机构:Computer Science Department, Bar Ilan University, Allen Institute for Artificial Intelligence 链接:https://arxiv.org/abs/2106.14321 摘要:抽象概念的形成和解释是人类交际的核心过程。特别是,当人们给出和执行用自然语言(NL)表述的复杂指令时,人们可能会自然而然地唤起诸如对象、循环、条件和函数之类的抽象结构,以高效、准确地表达他们的意图。然而,在自然语言/人工智能中,对自然语言中的解释和基础抽象还没有进行系统的研究。为了引出自然发生的抽象,我们开发了六边形参考游戏,玩家在二维六边形板上描述越来越复杂的图像,其他玩家需要按照这些说明重新创建图像。使用这个游戏,我们收集了六边形数据集,其中包括164个图像和3000多个自然发生的指令,丰富的各种抽象。我们对从六边形数据集派生的从指令到执行任务的基线模型的结果证实,对于当前的系统来说,NL中更高级别的抽象确实更具挑战性。因此,这个数据集为扎根语义分析提供了一个新的、具有挑战性的维度,我们建议社区将其作为未来的基准,在NLP应用程序中探索更复杂和更高级的通信。 摘要:Forming and interpreting abstraction is a core process in human communication. In particular, when giving and performing complex instructions stated in natural language (NL), people may naturally evoke abstract constructs such as objects, loops, conditions and functions to convey their intentions in an efficient and precise way. Yet, interpreting and grounding abstraction stated in NL has not been systematically studied in NLP/AI. To elicit naturally-occurring abstractions in NL we develop the Hexagons referential game, where players describe increasingly complex images on a two-dimensional Hexagons board, and other players need to follow these instructions to recreate the images. Using this game we collected the Hexagons dataset, which consists of 164 images and over 3000 naturally-occurring instructions, rich with diverse abstractions. Results of our baseline models on an instruction-to-execution task derived from the Hexagons dataset confirm that higher-level abstractions in NL are indeed more challenging for current systems to process. Thus, this dataset exposes a new and challenging dimension for grounded semantic parsing, and we propose it for the community as a future benchmark to explore more sophisticated and high-level communication within NLP applications.
【10】 Analyzing Research Trends in Inorganic Materials Literature Using NLP 标题:利用自然语言处理分析无机材料文献的研究动向
作者:Fusataka Kuniyoshi,Jun Ozawa,Makoto Miwa 机构: Panasonic Corporation, Oaza Kadoma, Kadoma-shi, Osaka ,-, Japan, Toyota Technological Institute,–,–, Hisakata, Tempaku-ku, Nagoya,–, National Institute of Advanced Industrial Science and Technology (AIST),–,–, Aomi, Koto-ku, Tokyo,–, Japan 备注:Accepted to ECML-PKDD2021. Preprint 链接:https://arxiv.org/abs/2106.14157 摘要:在无机材料科学领域,通过机器阅读大量的论文来提取材料的物理性质和合成过程等知识的需求越来越大。这是因为材料研究人员查阅了许多论文,以便为材料合成提出有希望的实验条件。但是,只有少数系统可以提取材质名称及其特性。本研究提出了一个大规模的自然语言处理(NLP)管道,用于从材料科学文献中提取材料名称和性质,以便在材料科学中搜索和检索结果。因此,我们提出了一个用于提取材料名称和属性的标签定义,并据此构建了一个语料库,其中包含从301篇论文中提取的836个注释段落,用于训练命名实体识别(NER)模型。实验结果证明了该模型的实用性;提取成功,micro-F1评分为78.1%。为了证明我们的方法的有效性,我们将训练好的NER模型应用于12895篇材料科学论文,对真实世界的自动标注语料库进行了全面的评估。我们通过可视化NLP管道的输出来分析材料科学的趋势。例如,国别分析表明,近年来,关于钙钛矿型太阳能电池材料“MoS2”的论文数量在中国迅速增加,而在美国却在减少。此外,根据逐年分析的情况,催化剂材料“PEDOT:PSS”的处理温度在200度以下移动,处理时间超过5小时的报告数量略有增加。 摘要:In the field of inorganic materials science, there is a growing demand to extract knowledge such as physical properties and synthesis processes of materials by machine-reading a large number of papers. This is because materials researchers refer to many papers in order to come up with promising terms of experiments for material synthesis. However, there are only a few systems that can extract material names and their properties. This study proposes a large-scale natural language processing (NLP) pipeline for extracting material names and properties from materials science literature to enable the search and retrieval of results in materials science. Therefore, we propose a label definition for extracting material names and properties and accordingly build a corpus containing 836 annotated paragraphs extracted from 301 papers for training a named entity recognition (NER) model. Experimental results demonstrate the utility of this NER model; it achieves successful extraction with a micro-F1 score of 78.1%. To demonstrate the efficacy of our approach, we present a thorough evaluation on a real-world automatically annotated corpus by applying our trained NER model to 12,895 materials science papers. We analyze the trend in materials science by visualizing the outputs of the NLP pipeline. For example, the country-by-year analysis indicates that in recent years, the number of papers on "MoS2," a material used in perovskite solar cells, has been increasing rapidly in China but decreasing in the United States. Further, according to the conditions-by-year analysis, the processing temperature of the catalyst material "PEDOT:PSS" is shifting below 200 degree, and the number of reports with a processing time exceeding 5 h is increasing slightly.
【11】 Core Challenges in Embodied Vision-Language Planning 标题:体验式视觉语言规划的核心挑战
作者:Jonathan Francis,Nariaki Kitamura,Felix Labelle,Xiaopeng Lu,Ingrid Navarro,Jean Oh 机构:Language Technologies Institute, Carnegie Mellon University, Forbes Ave., Pittsburgh, PA, USA, Human-Machine Collaboration, Bosch Research Pittsburgh, Smallman St., Pittsburgh, PA, USA 备注:35 pages 链接:https://arxiv.org/abs/2106.13948 摘要:多模态机器学习和人工智能领域的最新进展导致了计算机视觉、自然语言处理和具体化人工智能交叉点的挑战性任务的发展。尽管许多方法和先前的调查研究都描述了其中的一个或两个维度,但并没有在所有三个维度的中心进行整体分析。此外,即使考虑到这些主题的组合,也更多地侧重于描述(例如)当前的体系结构方法,而不是同时说明该领域的高级别挑战和机遇。在这篇综述文章中,我们讨论了具身视觉语言规划(EVLP)任务,这是一系列突出的具身导航和操作问题,它们共同使用计算机视觉和自然语言。我们提出了一种分类法来统一这些任务,并对新的和当前的算法方法、度量、模拟环境以及用于EVLP任务的数据集进行了深入的分析和比较。最后,我们提出了我们认为新的EVLP工作应该寻求解决的核心挑战,并且我们提倡任务构建,以实现模型的通用性和进一步的实际部署。 摘要:Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
【12】 Persian Rhetorical Structure Theory 标题:波斯修辞结构理论
作者:Sara Shahmohammadi,Hadi Veisi,Ali Darzi 机构:MSc Graduate, University of Tehran, Assistant Professor 备注:19 Pages, 7 Figures 链接:https://arxiv.org/abs/2106.13833 摘要:近年来,人们对语篇分析和语篇句法分析的兴趣与日俱增,许多语篇注释语料库和语篇句法分析工具应运而生。在本文中,我们提出了一个建立在修辞结构理论框架下的波斯语语语篇注释语料库,以及一个建立在DPLP解析器之上的语篇解析器,这是一个开源的语篇解析器。我们的语料库由150篇新闻文本组成,每篇平均400字左右。以英语RST语篇树库语料库的注释准则为基础,利用18种语篇关系对语料库文本进行注释。我们的文本级语篇分析器是在DPLP语篇分析器的基础上使用gold切分进行训练的,DPLP语篇分析器采用基于大幅度过渡的方法来解决语篇分析问题。我们的语篇分析器在跨(S)、核(N)和关系(R)检测方面的性能分别为78%、64%和44%。 摘要:Over the past years, interest in discourse analysis and discourse parsing has steadily grown, and many discourse-annotated corpora and, as a result, discourse parsers have been built. In this paper, we present a discourse-annotated corpus for the Persian language built in the framework of Rhetorical Structure Theory as well as a discourse parser built upon the DPLP parser, an open-source discourse parser. Our corpus consists of 150 journalistic texts, each text having an average of around 400 words. Corpus texts were annotated using 18 discourse relations and based on the annotation guideline of the English RST Discourse Treebank corpus. Our text-level discourse parser is trained using gold segmentation and is built upon the DPLP discourse parser, which uses a large-margin transition-based approach to solve the problem of discourse parsing. The performance of our discourse parser in span (S), nuclearity (N) and relation (R) detection is around 78%, 64%, 44% respectively, in terms of F1 measure.