
Update!H5支持摘要折叠,体验更佳!点击阅读原文访问arxivdaily.com,涵盖CS|物理|数学|经济|统计|金融|生物|电气领域,更有搜索、收藏等功能!
q-fin金融,共计4篇
cs.SD语音,共计3篇
eess.AS音频处理,共计7篇
1.q-fin金融:
【1】 The Use of Quantile Methods in Economic History 标题:分位数方法在经济史中的运用 链接:https://arxiv.org/abs/2108.06055
作者:Damian Clarke,Manuel Llorca Jaña,Daniel Pailañir 机构:Manuel Llorca Ja˜na‡, Daniel Paila˜nir§ 摘要:分位数回归和分位数处理效应方法是考虑超出平均值的事件或感兴趣变量的经济影响的强大计量工具。分位数方法的使用允许检查某些自变量对连续因变量的整个分布的影响。经济史上许多定量环境中的计量都将连续结果变量作为关键输入。在许多其他案例中,人类身高和人口统计、经济增长、收入和工资以及作物产量通常被记录为连续测量,并由经济历史学家收集和研究。在本文中,我们描述和讨论了分位数回归在经济史研究中的广泛应用,回顾了该领域的最新定量文献,并提供了一个基于19世纪和20世纪50多年测量的20000份人类身高记录的使用这些方法的示例。我们认为,在经济史的文献中,有相当大的空间可以令人信服地、有效地应用分位数回归方法。 摘要:Quantile regression and quantile treatment effect methods are powerful econometric tools for considering economic impacts of events or variables of interest beyond the mean. The use of quantile methods allows for an examination of impacts of some independent variable over the entire distribution of continuous dependent variables. Measurement in many quantative settings in economic history have as a key input continuous outcome variables of interest. Among many other cases, human height and demographics, economic growth, earnings and wages, and crop production are generally recorded as continuous measures, and are collected and studied by economic historians. In this paper we describe and discuss the broad utility of quantile regression for use in research in economic history, review recent quantitive literature in the field, and provide an illustrative example of the use of these methods based on 20,000 records of human height measured across 50-plus years in the 19th and 20th centuries. We suggest that there is considerably more room in the literature on economic history to convincingly and productively apply quantile regression methods.
【2】 Spatial Distribution of Supply and the Role of Market Thickness: Theory and Evidence from Ride Sharing 标题:供给的空间分布与市场厚度的作用:来自拼车的理论和证据 链接:https://arxiv.org/abs/2108.05954
作者:Soheil Ghili,Vineet Kumar 摘要:本文研究了交通市场中密度经济的影响,重点是乘车共享。我们的理论模型预测,(i)密度经济使驾驶员的供应偏离密度较低的地区,(ii)对于较小的平台,这种偏离将更加明显,(iii)乘骑共享平台不认为这种偏离是有效的,因此使用价格和工资来缓解(但不是消除)这种偏离。然后,我们开发了一个具有简单实现和有限数据需求的通用实证策略,以测试供需的空间偏差。将我们的方法应用于纽约市(NYC)的骑乘水平、多平台数据,我们确实找到了向更繁忙地区倾斜供应的证据,尤其是对于较小的平台。我们将讨论我们的分析对业务战略(如空间定价)和公共政策(如拆分或缩小乘骑共享平台的后果)的影响 摘要:This paper studies the effects of economies of density in transportation markets, focusing on ridesharing. Our theoretical model predicts that (i) economies of density skew the supply of drivers away from less dense regions, (ii) the skew will be more pronounced for smaller platforms, and (iii) rideshare platforms do not find this skew efficient and thus use prices and wages to mitigate (but not eliminate) it. We then develop a general empirical strategy with simple implementation and limited data requirements to test for spatial skew of supply from demand. Applying our method to ride-level, multi-platform data from New York City (NYC), we indeed find evidence for a skew of supply toward busier areas, especially for smaller platforms. We discuss the implications of our analysis for business strategy (e.g., spatial pricing) and public policy (e.g., consequences of breaking up or downsizing a rideshare platform)
【3】 Partisan affect and political outsiders 标题:党派情绪和政治局外人 链接:https://arxiv.org/abs/2108.05943
作者:Fernanda Herrera 机构:School of Global Policy and Strategy, University of California, San Diego 摘要:我们研究了在选举前的提名过程中引入政治局外人的影响。为此,我们开发了一个顺序博弈,在这个博弈中,政客——内部人士和外部人士——向一个政党提出一个平台提议,然后政党决定接受哪个提议;这个过程与投票一致。在对政党候选人匹配的评估中嵌入了党派情感,这是一个包含选民对政党态度的变量。党派影响可能会以积极或消极的方式影响选民对比赛的评价。我们描述了导致提名局外人的条件,并确定她作为潜在候选人的介绍是否会对获胜的政策和选民的福利产生任何影响。我们发现,局外人的胜利通常会导致政策两极分化,而党派影响比意识形态极端主义对福利的影响更为显著。 摘要:We examine the effects of introducing a political outsider to the nomination process leading to an election. To this end, we develop a sequential game where politicians -- insiders and outsiders -- make a platform offer to a party, and parties in turn decide which offer to accept; this process conforms the voting ballot. Embedded in the evaluation of a party-candidate match is partisan affect, a variable comprising the attitudes of voters towards the party. Partisan affect may bias the electorate's appraisal of a match in a positive or negative way. We characterize the conditions that lead to the nomination of an outsider and determine whether her introduction as a potential candidate has any effect on the winning policy and on the welfare of voters. We find that the victory of an outsider generally leads to policy polarization, and that partisan affect has a more significant effect on welfare than ideology extremism.
【4】 Analysis of the Indian Chemical Industry in the Post-Covid Era 标题:浅析后冰盖时代的印度化学工业 链接:https://arxiv.org/abs/2108.06066
作者:Anandlogesh R R,Breasha Gupta,Divika Agarwal,Rasika Joshi 机构:CH,B, BE,B, HS,H 摘要:印度化学工业的故事是一个卓越和有希望的故事。作为一个始终如一的价值创造者,化学工业仍然是一个极具吸引力的机会枢纽,即使在全球不确定的环境中也是如此。本文旨在分析各种驱动因素、主要参与者在基本面分析中的表现,以及由于大流行后世界的各种地缘政治和宏观经济趋势而影响该行业表现的各种趋势。 摘要:The story of the Chemical Industry in India is one of outperformance and promise. A consistent value creator, the chemical industry remains an attractive hub of opportunities, even in an environment of global uncertainty. This paper aims to analyze the various driving factors, the performance of the key players over fundamental analysis, and the various trends that would shape the performance of the industry due to the various geopolitical and macroeconomic trends in the post-pandemic world.
2.cs.SD语音:
【1】 W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training 标题:W2v-BERT:结合对比学习和掩蔽语言建模的自监督语音预训练 链接:https://arxiv.org/abs/2108.06209
作者:Yu-An Chung,Yu Zhang,Wei Han,Chung-Cheng Chiu,James Qin,Ruoming Pang,Yonghui Wu 机构:MIT Computer Science and Artificial Intelligence Laboratory, Google Brain 摘要:受蒙面语言建模(MLM)在预训练自然语言处理模型中取得成功的启发,我们提出了w2v-BERT,探索用于自监督语音表征学习的蒙面语言建模。w2v-BERT是一个结合对比学习和MLM的框架,前者训练模型将输入的连续语音信号离散化为一组有限的判别语音标记,而后者训练模型通过解决使用离散化标记的蒙面预测任务来学习上下文化语音表示。与现有的基于传销的语音预训练框架(如HuBERT,它依赖于迭代的重新聚类和重新训练过程)或vq-wav2vec(它连接两个单独训练的模块)不同,w2v-BERT可以通过同时求解两个自监督任务(对比任务和传销任务)以端到端的方式进行优化。我们的实验表明,当使用Libri-Light~60k语料库作为无监督数据时,w2v-BERT在LibriSpeech基准测试中取得了与当前最先进的预训练模型相比较的结果。特别是,与已发布的模型(如基于一致性的wav2vec~2.0和HuBERT)相比,我们的模型显示,在测试清洁和测试其他子集时,相对功率降低了~5\%到~10\%。当应用到谷歌的语音搜索流量数据集时,w2v BERT比我们基于内部一致性的wav2vec~2.0的性能要高出约30%。 摘要:Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively.
【2】 Pruning vs XNOR-Net: A Comprehensive Study on Deep Learning for Audio Classification in Microcontrollers 标题:剪枝VS XNOR-Net:深度学习在微控制器音频分类中的综合研究 链接:https://arxiv.org/abs/2108.06128
作者:Md Mohaimenuzzaman,Christoph Bergmeir,Bernd Meyer 机构: CHRISTOPH BERGMEIR 1 AND BERND MEYER 1 1Department of Data Science and AI, Monash University 备注:10 pages 摘要:深度学习在许多与物联网相关的应用领域取得了巨大成功,例如,计算机视觉和机器听力。为了充分利用物联网深度学习的力量,这些技术最终必须直接推向边缘。显而易见的挑战是,如果模型被彻底缩小,深度学习技术只能在资源严格受限的边缘设备上实现。这项任务依赖于不同的模型压缩技术,如网络修剪、量化和XNOR网络的最新进展。本文研究了这些技术在微控制器音频分类中的适用性。我们提出了一个用于端到端原始音频分类的XNOR网络,并对该方法与剪枝和量化方法进行了全面的实证研究。我们表明,使用XNOR对原始音频进行分类,在减少32倍的内存需求和58倍的计算需求的同时,可以获得与常规全精度网络相当的性能。然而,随着类的数量显著增加,性能下降,基于剪枝和量化的压缩技术成为首选技术,能够满足相同的空间限制,但需要大约8倍的计算量。我们表明,这些见解在原始音频分类和使用标准基准集的图像分类之间是一致的。据我们所知,这是首次将XNOR应用于端到端音频分类并在替代技术的背景下对其进行评估的研究。所有代码都可以在GitHub上公开获得。 摘要:Deep Learning has celebrated resounding successes in many application areas of relevance to the Internet-of-Things, for example, computer vision and machine listening. To fully harness the power of deep leaning for the IoT, these technologies must ultimately be brought directly to the edge. The obvious challenge is that deep learning techniques can only be implemented on strictly resource-constrained edge devices if the models are radically downsized. This task relies on different model compression techniques, such as network pruning, quantization and the recent advancement of XNOR-Net. This paper examines the suitability of these techniques for audio classification in microcontrollers. We present an XNOR-Net for end-to-end raw audio classification and a comprehensive empirical study comparing this approach with pruning-and-quantization methods. We show that raw audio classification with XNOR yields comparable performance to regular full precision networks for small numbers of classes while reducing memory requirements 32-fold and computation requirements 58-fold. However, as the number of classes increases significantly, performance degrades and pruning-and-quantization based compression techniques take over as the preferred technique being able to satisfy the same space constraints but requiring about 8x more computation. We show that these insights are consistent between raw audio classification and image classification using standard benchmark sets.To the best of our knowledge, this is the first study applying XNOR to end-to-end audio classification and evaluating it in the context of alternative techniques. All code is publicly available on GitHub.
【3】 Parameter Tuning of Time-Frequency Masking Algorithms for Reverberant Artifact Removal within the Cochlear Implant Stimulus 标题:人工耳蜗刺激内混响伪影去除的时频掩蔽算法参数调整 链接:https://arxiv.org/abs/2108.05929
作者:Lidea K. Shahidi,Leslie M. Collins,Boyla O. Mainsah 机构:Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA 备注:5 pages, 4 figures 摘要:人工耳蜗植入者在混响环境中难以理解语音。为了恢复语音感知,可以从耳蜗植入刺激中去除由混响反射主导的伪影。可以通过应用增益值矩阵(一种称为时频掩蔽的技术)来识别和去除伪影。增益值由oracle算法确定,该算法使用无失真信号的知识来最小化由混响反射控制的信号分量的保留。实际上,增益值是从失真信号中估计出来的,oracle算法提供了估计目标。不同的oracle技术用于确定增益值,每种技术都必须参数化以设置信号保留量。这项工作评估了哪些oracle掩蔽策略和参数化能够在混响条件下,通过对具有声码的正常听力个体进行在线语音清晰度测试,使人工耳蜗植入用户的语音清晰度得到最佳改善。 摘要:Cochlear implant users struggle to understand speech in reverberant environments. To restore speech perception, artifacts dominated by reverberant reflections can be removed from the cochlear implant stimulus. Artifacts can be identified and removed by applying a matrix of gain values, a technique referred to as time-frequency masking. Gain values are determined by an oracle algorithm that uses knowledge of the undistorted signal to minimize retention of the signal components dominated by reverberant reflections. In practice, gain values are estimated from the distorted signal, with the oracle algorithm providing the estimation objective. Different oracle techniques exist for determining gain values, and each technique must be parameterized to set the amount of signal retention. This work assesses which oracle masking strategies and parameterizations lead to the best improvements in speech intelligibility for cochlear implant users in reverberant conditions using online speech intelligibility testing of normal-hearing individuals with vocoding.
3.eess.AS音频处理:
【1】 Enhancing audio quality for expressive Neural Text-to-Speech 标题:提高表现力神经文语转换的音频质量 链接:https://arxiv.org/abs/2108.06270
作者:Abdelhamid Ezzerg,Adam Gabrys,Bartosz Putrycz,Daniel Korzekwa,Daniel Saez-Trigueros,David McHardy,Kamil Pokora,Jakub Lachowicz,Jaime Lorenzo-Trueba,Viacheslav Klimkov 机构:Amazon Text-to-Speech Research 备注:6 pages, 4 figures, 2 tables, SSW 2021 摘要:人工语音合成在自然度方面取得了巨大的飞跃,因为最近的文本到语音(TTS)系统能够产生与人类录音质量相似的语音。然而,并不是所有的说话风格都很容易建模:即使是对于最近的TTS架构,高表达性的声音仍然具有挑战性,因为在生成的音频的表达能力和信号质量之间似乎存在着权衡。在本文中,我们提出了一套技术,可以在不使用额外数据的情况下提高高表达性语音的信号质量。所提出的技术包括:在训练期间调整自回归循环的粒度;在声学建模中使用生成对抗网络;以及在声学模型和神经声码器中使用变分自动编码器。我们发现,当这些技术结合在一起时,就一个富有表现力的名人声音的穆什拉分数而言,这些技术大大缩小了基线系统和录音之间感知自然度的差距39%。 摘要:Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.
【2】 Feature learning for efficient ASR-free keyword spotting in low-resource languages 标题:用于低资源语言中高效无ASR关键词检测的特征学习 链接:https://arxiv.org/abs/2108.06174
作者:Ewald van der Westhuizen,Herman Kamper,Raghav Menon,John Quinn,Thomas Niesler 机构:Department of Electrical and Electronic Engineering, Stellenbosch University, South, Africa, UN Global Pulse, Kampala, Uganda 备注:37 pages, 14 figures, Preprint accepted for publication in Computer Speech and Language 摘要:我们认为特征学习有效的关键字发现,可以应用于严重不足的设置。目标是支持联合国在非洲一些几乎没有语文资源的地区的人道主义救济方案。对于此类语言的快速开发,我们依赖于一组小的、易于编译的孤立关键字。这些关键词模板使用动态时间扭曲(DTW)应用于大量域内但未翻译的语音语料库。由此产生的DTW对齐分数用于训练卷积神经网络(CNN),该网络的计算效率高出几个数量级,适合实时应用。我们通过在这个几乎零资源的环境中识别鲁棒的声学特征来优化这个神经网络关键词检测器。首先,我们使用多语言瓶颈特性(BNF)提取器合并来自资源丰富但不相关语言的信息。接下来,我们考虑从自动编码器(AE)在域内训练,但未转录数据提取的特征。最后,我们考虑对应的自动编码器(CAE)的特点,在小套的域标记数据进行精细调整。在南非英语和低资源语言Luganda中进行的实验表明,BNF和CAE特性比基线MFCC实现了5%的相对性能改进。然而,使用BNF作为CAE的输入,在ROC曲线下区域(AUC)中,与MFCC相比,结果是27%以上的相对改善,前10名检索次数是前者的两倍多。我们表明,使用这些特性,CNN-DTW关键字检测仪的性能几乎与DTW关键字检测仪一样好,同时优于仅在关键字模板上训练的基线CNN。使用BNF衍生CAE功能的CNN-DTW关键字观察员代表了一种高效的方法,其性能具有竞争力,适合在资源严重不足的情况下快速部署。 摘要:We consider feature learning for efficient keyword spotting that can be applied in severely under-resourced settings. The objective is to support humanitarian relief programmes by the United Nations in parts of Africa in which almost no language resources are available. For rapid development in such languages, we rely on a small, easily-compiled set of isolated keywords. These keyword templates are applied to a large corpus of in-domain but untranscribed speech using dynamic time warping (DTW). The resulting DTW alignment scores are used to train a convolutional neural network (CNN) which is orders of magnitude more computationally efficient and suitable for real-time application. We optimise this neural network keyword spotter by identifying robust acoustic features in this almost zero-resource setting. First, we incorporate information from well-resourced but unrelated languages using a multilingual bottleneck feature (BNF) extractor. Next, we consider features extracted from an autoencoder (AE) trained on in-domain but untranscribed data. Finally, we consider correspondence autoencoder (CAE) features which are fine-tuned on the small set of in-domain labelled data. Experiments in South African English and Luganda, a low-resource language, show that BNF and CAE features achieve a 5% relative performance improvement over baseline MFCCs. However, using BNFs as input to the CAE results in a more than 27% relative improvement over MFCCs in ROC area-under-the-curve (AUC) and more than twice as many top-10 retrievals. We show that, using these features, the CNN-DTW keyword spotter performs almost as well as the DTW keyword spotter while outperforming a baseline CNN trained only on the keyword templates. The CNN-DTW keyword spotter using BNF-derived CAE features represents an efficient approach with competitive performance suited to rapid deployment in a severely under-resourced scenario.
【3】 Multilingual training set selection for ASR in under-resourced Malian languages 标题:资源不足的马里语种ASR的多语种训练集选择 链接:https://arxiv.org/abs/2108.06164
作者:Ewald van der Westhuizen,Trideba Padhi,Thomas Niesler 机构:Department of Electrical and Electronic Engineering, Stellenbosch University, Stellenbosch, South Africa 备注:12 pages, 4 figures, Accepted for presentation at SPECOM 2021 摘要:我们为资源严重不足的两种马里语言Bambara和Maasina Fulde提供了第一个语音识别系统。联合国将利用这些系统作为监测系统的一部分,向非洲农村地区的人道主义方案提供信息和支助。我们已经在Bambara和Maasina Fulde汇编了数据集,但由于这些数据集非常小,我们利用其他语言中六个同样资源不足的数据集进行多语言训练。我们特别关注用于多语言训练的多语言语音数据池的最佳组合。我们发现,虽然通过包括所有六种额外语言来最大化训练池,可以提高两种目标语言的语音识别能力,但通过更明智的选择可以获得更好的性能。我们的实验表明,只添加一种语言就可以提供最好的性能。对于Bambara来说,这门额外的语言是Maasina Fulde,它的引入使相对单词错误率降低了6.7%,而在合并所有六门额外语言时,相对减少了2.4%。就Maasina Fulde而言,仅添加Luganda时实现了最佳性能,导致相对单词错误率提高9.4%,而合并所有六种语言时相对提高3.9%。我们得出结论,即使在资源高度不足的环境下,仔细选择语言外数据对于多语言训练也是值得的,而且数据越多越好的一般假设并不总是成立的。 摘要:We present first speech recognition systems for the two severely under-resourced Malian languages Bambara and Maasina Fulfulde. These systems will be used by the United Nations as part of a monitoring system to inform and support humanitarian programmes in rural Africa. We have compiled datasets in Bambara and Maasina Fulfulde, but since these are very small, we take advantage of six similarly under-resourced datasets in other languages for multilingual training. We focus specifically on the best composition of the multilingual pool of speech data for multilingual training. We find that, although maximising the training pool by including all six additional languages provides improved speech recognition in both target languages, substantially better performance can be achieved by a more judicious choice. Our experiments show that the addition of just one language provides best performance. For Bambara, this additional language is Maasina Fulfulde, and its introduction leads to a relative word error rate reduction of 6.7%, as opposed to a 2.4% relative reduction achieved when pooling all six additional languages. For the case of Maasina Fulfulde, best performance was achieved when adding only Luganda, leading to a relative word error rate improvement of 9.4% as opposed to a 3.9% relative improvement when pooling all six languages. We conclude that careful selection of the out-of-language data is worthwhile for multilingual training even in highly under-resourced settings, and that the general assumption that more data is better does not always hold.
【4】 Joint Spatio-Temporal Discretisation of Nonlinear Active Cochlear Models 标题:非线性主动耳蜗模型的时空联合离散化 链接:https://arxiv.org/abs/2108.05993
作者:T. Dang,V. Sethu,E. Ambikairajah,J. Epps,H. Li 机构: 2 and Haizhou Li 3 1Department of Computer Science and Technology, University of Cambridge, UK 2School of Electrical Engineering and Telecommunications, University of New South Wales, Australia 3Department of Electrical and Computer Engineering 摘要:受生物启发的听觉模型在开发可紧密集成到语音和音频处理系统中的有效音频表示方面发挥着重要作用。目前耳蜗的计算模型通常用微分方程组表示,不能直接用于计算语音处理系统。具体而言,这些模型在空间上是离散的,在时间上是连续的。本文提出了一种联合离散(空间和时间离散)的耳蜗模型,该模型允许在固定时间间隔进行处理,适用于离散时间语音和音频处理系统。该模型考虑了耳蜗中的主动反馈机制,这是传统语音处理前端所缺乏的一个核心特性,使其具有显著的动态范围压缩能力。该模型是通过将已建立的半离散(空间离散和时间连续)耳蜗模型以状态空间形式联合离散而得到的。然后,我们证明了所提出的联合离散化实现在其特性方面与半离散模型相匹配,最后给出了所提出模型的稳定性分析。 摘要:Biologically inspired auditory models play an important role in developing effective audio representations that can be tightly integrated into speech and audio processing systems. Current computational models of the cochlea are typically expressed in terms of systems of differential equations and do not directly lend themselves for use in computational speech processing systems. Specifically, these models are spatially discrete and temporally continuous. This paper presents a jointly discretised (spatially and temporally discrete) model of the cochlea which allows for processing at fixed time intervals suited to discrete time speech and audio processing systems. The proposed model takes into account the active feedback mechanism in the cochlea, a core characteristic lacking in traditional speech processing front-ends, which endows it with significant dynamic range compression capability. This model is derived by jointly discretising an established semi-discretised (spatially discrete and temporally continuous) cochlear model in a state space form. We then demonstrate that the proposed jointly discretised implementation matches the semi-discrete model in terms of its characteristics and finally present stability analyses of the proposed model.
【5】 W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training 标题:W2v-BERT:结合对比学习和掩蔽语言建模的自监督语音预训练 链接:https://arxiv.org/abs/2108.06209
作者:Yu-An Chung,Yu Zhang,Wei Han,Chung-Cheng Chiu,James Qin,Ruoming Pang,Yonghui Wu 机构:MIT Computer Science and Artificial Intelligence Laboratory, Google Brain 摘要:受蒙面语言建模(MLM)在预训练自然语言处理模型中取得成功的启发,我们提出了w2v-BERT,探索用于自监督语音表征学习的蒙面语言建模。w2v-BERT是一个结合对比学习和MLM的框架,前者训练模型将输入的连续语音信号离散化为一组有限的判别语音标记,而后者训练模型通过解决使用离散化标记的蒙面预测任务来学习上下文化语音表示。与现有的基于传销的语音预训练框架(如HuBERT,它依赖于迭代的重新聚类和重新训练过程)或vq-wav2vec(它连接两个单独训练的模块)不同,w2v-BERT可以通过同时求解两个自监督任务(对比任务和传销任务)以端到端的方式进行优化。我们的实验表明,当使用Libri-Light~60k语料库作为无监督数据时,w2v-BERT在LibriSpeech基准测试中取得了与当前最先进的预训练模型相比较的结果。特别是,与已发布的模型(如基于一致性的wav2vec~2.0和HuBERT)相比,我们的模型显示,在测试清洁和测试其他子集时,相对功率降低了~5\%到~10\%。当应用到谷歌的语音搜索流量数据集时,w2v BERT比我们基于内部一致性的wav2vec~2.0的性能要高出约30%。 摘要:Motivated by the success of masked language modeling~(MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks~(the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light~60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec~2.0 and HuBERT, our model shows~5\% to~10\% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec~2.0 by more than~30\% relatively.
【6】 Parameter Tuning of Time-Frequency Masking Algorithms for Reverberant Artifact Removal within the Cochlear Implant Stimulus 标题:人工耳蜗刺激内混响伪影去除的时频掩蔽算法参数调整 链接:https://arxiv.org/abs/2108.05929
作者:Lidea K. Shahidi,Leslie M. Collins,Boyla O. Mainsah 机构:Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA 备注:5 pages, 4 figures 摘要:人工耳蜗植入者在混响环境中难以理解语音。为了恢复语音感知,可以从耳蜗植入刺激中去除由混响反射主导的伪影。可以通过应用增益值矩阵(一种称为时频掩蔽的技术)来识别和去除伪影。增益值由oracle算法确定,该算法使用无失真信号的知识来最小化由混响反射控制的信号分量的保留。实际上,增益值是从失真信号中估计出来的,oracle算法提供了估计目标。不同的oracle技术用于确定增益值,每种技术都必须参数化以设置信号保留量。这项工作评估了哪些oracle掩蔽策略和参数化能够在混响条件下,通过对具有声码的正常听力个体进行在线语音清晰度测试,使人工耳蜗植入用户的语音清晰度得到最佳改善。 摘要:Cochlear implant users struggle to understand speech in reverberant environments. To restore speech perception, artifacts dominated by reverberant reflections can be removed from the cochlear implant stimulus. Artifacts can be identified and removed by applying a matrix of gain values, a technique referred to as time-frequency masking. Gain values are determined by an oracle algorithm that uses knowledge of the undistorted signal to minimize retention of the signal components dominated by reverberant reflections. In practice, gain values are estimated from the distorted signal, with the oracle algorithm providing the estimation objective. Different oracle techniques exist for determining gain values, and each technique must be parameterized to set the amount of signal retention. This work assesses which oracle masking strategies and parameterizations lead to the best improvements in speech intelligibility for cochlear implant users in reverberant conditions using online speech intelligibility testing of normal-hearing individuals with vocoding.
【7】 Improving Music Performance Assessment with Contrastive Learning 标题:运用对比学习改进音乐表演评价 链接:https://arxiv.org/abs/2108.01711
作者:Pavan Seshadri,Alexander Lerch 机构:Center for Music Technology, Georgia Institute of Technology 备注:To appear at 22nd International Society for Music Information Retrieval Conference, Online, 2021 摘要:过去已经提出了几种客观音乐表现评估(MPA)的自动方法,但是,现有系统还不能可靠地预测与专业评委相同的准确度。本研究调查对比学习作为一种潜在的方法,以改善现有的MPA系统。对比学习是表征学习中广泛使用的一种技术,它可以学习一个能够单独聚类多个类的结构化潜在空间。它已被证明能产生基于图像的分类问题的最新结果。我们将适用于回归任务的加权对比损失应用于卷积神经网络,并表明对比损失可导致MPA回归任务的性能提高。我们的结果表明,基于对比的方法能够通过在神经网络的潜在空间内创建更好的类簇来匹配并超过MPA回归任务的SoTA性能。 摘要:Several automatic approaches for objective music performance assessment (MPA) have been proposed in the past, however, existing systems are not yet capable of reliably predicting ratings with the same accuracy as professional judges. This study investigates contrastive learning as a potential method to improve existing MPA systems. Contrastive learning is a widely used technique in representation learning to learn a structured latent space capable of separately clustering multiple classes. It has been shown to produce state of the art results for image-based classification problems. We introduce a weighted contrastive loss suitable for regression tasks applied to a convolutional neural network and show that contrastive loss results in performance gains in regression tasks for MPA. Our results show that contrastive-based methods are able to match and exceed SoTA performance for MPA regression tasks by creating better class clusters within the latent space of the neural networks.