
q-fin金融,共计7篇
cs.SD语音,共计5篇
eess.AS音频处理,共计6篇
1.q-fin金融:
【1】 La mujer a través de los personajes femeninos en el cine de temática financiera -- Women through female characters in financial topics films 标题:La Mujer a Través de Los Persajes Femeninos en el Motion de temática Financiera--从金融题材电影中的女性角色看女性 链接:https://arxiv.org/abs/2112.04366
作者:I. Martín-de-Santos 备注:None 摘要:分析和评估经济金融主题电影中的女性角色,这些电影从电影开始就出现在大约30部相关电影中,从各种参考书目中选择。描述性、比较性和公正性研究。定量技术用于测量女性在前痛苦、阳性值、阴性值、是否符合Bechdel试验方面的影响,并对结果进行评估。它是基于两个以前的有约束力的出版物:Marzal等人,《真实的危机》。2008年金融危机在当代视听(2018年)和《金融管理通过电影》中的表现。对大学教育中最有趣的电影进行分析和批判性研究(20219年)。现实电影版本与今天的实际事件进行对比。 摘要:Analysis and assessment of female characters in economic financial themes films that have appeared in some thirty relevant films, from the beginning of cinema, selected from various bibliographies. Descriptive, comparative and impartial study. Quantitative techniques are applied to measure the influence of women in terms of protagonism, positive values, negative values, compliance or non-compliance with the Bechdel test, and results are evaluated. It is based on two previous binding publications: Marzal et al. The crisis of the real. Representations of the 2008 financial crisis in contemporary audiovisual (2018) and Mart\'in-de-Santos, Financial administration through cinema. Analytical and critical study of the most interesting films for university education (20219. Realistic film versions are contrasted with actual events today.
【2】 Aproximacion a los estudios sobre la economia en la Segunda Republica espanola hasta 1936 -- Approaches to the economics of the Spanish Second Republic prior to 1936 标题:1936--1936年前西班牙第二共和国经济近况--西班牙第二共和国经济近况--对1936年前西班牙第二共和国经济状况的几点探讨--兼论西班牙第二共和国在1936年前的经济发展状况--兼论1936年前西班牙第二共和国的经济状况 链接:https://arxiv.org/abs/2112.04332
作者:I. Martin-de-Santos 备注:None 摘要:第二共和国时期西班牙经济的宏观经济数据不准确,根据获得的数据对历史事件的解释存在分歧和误导性。为了解决主要由严重的经济不平等引起的社会问题,制定了仓促的法律,但这些法律往往只不过是宣布了良好的政策d意图。在普里莫·德里维拉将军独裁统治结束时,人们开始感觉到西班牙在国际经济衰退之后遭受了损失。经济政策是根据宪法制定的,但是,尽管第一个两年期和第二个两年期之间存在差异,但仍有一种倾向,即维持上一个两年期的指导方针尽管如此,它最终未能实现其目标,这主要是因为在一场更为重大的社会危机的影响下,政府频繁更迭,这场社会危机将经济问题置于次要地位。 摘要:Macroeconomic data on the Spanish economy during the Second Republic is not accurate, the interpretation of historical events from the figures obtained is divergent and misleading. Hasty laws were enacted in attempts to resolve social problems arising mainly from deep economic inequalities, but they were often nothing more than declarations of good intentions. Spain suffered in the aftermath of the international economic downturn as it began to be felt at the end of the dictatorship of General Primo de Rivera. Economic policy was developed under the Constitution,but, despite the differences between the first and second biennium, there was a tendency to maintain the guidelines from the previous stage and in general, sometimes unfairly, it aimed at least to avoid the destabilization of the financial system. Nonetheless, it ultimately failed to achieve its goals, mainly because of the frequent changes of government mediated by a social crisis of greater significance that had relegated economic issues into the background.
【3】 Mathematical Model of International Trade and Global Economy 标题:国际贸易与全球经济的数学模型 链接:https://arxiv.org/abs/2112.04297
作者:N. S. Gonchar,O. P. Dovzhyk,A. S. Zhokhin,W. H. Kozyrski,A. P. Makhort 备注:39 pages 摘要:这项工作得到乌克兰国家科学院物理和天文学系基础研究项目“开放系统中非平衡过程的数学模型”N 0120U100857的部分支持。 摘要:This work was partially supported by the Program of Fundamental Research of the Department of Physics and Astronomy of the National Academy of Sciences of Ukraine "Mathematical models of non equilibrium processes in open systems" N 0120U100857.
【4】 Do fundamentals shape the price response? A critical assessment of linear impact models 标题:基本面因素会影响价格反应吗?线性碰撞模型的批判性评估 链接:https://arxiv.org/abs/2112.04245
作者:Michele Vodret,Iacopo Mastromatteo,Bence Tóth,Michael Benzaquen 摘要:我们比较了平稳Kyle模型(一种微观基础的多步骤线性价格影响模型,其中市场价格通过订单流中编码的信息预测基本面)和传播模型(一种纯数据驱动模型,其中交易以时间衰减核机械地影响价格)的预测。我们发现,值得注意的是,由于小时间尺度上普遍性的出现,这两个模型在高频预测完全相同的价格动态。另一方面,我们发现这些模型在影响函数的总体强度上存在分歧,我们能够将其与市场过度波动的数量联系起来。我们揭示了市场对已签署订单流的次线性反应的高频机制与价格对订单流不平衡的线性反应的低频机制之间的交叉。总的来说,我们将市场微观结构(交易量价格响应的次线性)文献的结果与宏观经济相关时间尺度(通常假设线性关系)的结果进行了协调。 摘要:We compare the predictions of the stationary Kyle model, a microfounded multi-step linear price impact model in which market prices forecast fundamentals through information encoded in the order flow, with those of the propagator model, a purely data-driven model in which trades mechanically impact prices with a time-decaying kernel. We find that, remarkably, both models predict the exact same price dynamics at high frequency, due to the emergence of universality at small time scales. On the other hand, we find those models to disagree on the overall strength of the impact function by a quantity that we are able to relate to the amount of excess-volatility in the market. We reveal a crossover between a high-frequency regime in which the market reacts sub-linearly to the signed order flow, to a low-frequency regime in which prices respond linearly to order flow imbalances. Overall, we reconcile results from the literature on market microstructure (sub-linearity in the price response to traded volumes) with those relating to macroeconomically relevant timescales (in which a linear relation is typically assumed).
【5】 Global Financial Cycle, Commodity Terms of Trade and Financial Spreads in Emerging Markets and Developing Economies 标题:全球金融周期、商品贸易条件与新兴市场和发展中经济体的金融利差 链接:https://arxiv.org/abs/2112.04218
作者:Jorge Carrera,Gabriel Montes-Rojas,Fernando Toledo 备注:32 pages, 14 figures, 2 tables 摘要:我们研究全球金融周期和全球流动性条件中的冲击向新兴和发展中经济体的扩散。我们表明,根据其对外贸易模式(作为商品的净出口国或净进口国)进行分类,可以评估国际货币溢出的相对重要性及其对国内金融周期波动性的影响,即金融利差和风险的变异系数。鉴于商品贸易在这些国家经济结构中的相对重要性,我们的研究表明,商品贸易平衡的符号和规模是合理化全球金融和流动性状况影响的关键参数。因此,商品对外贸易的符号和数量将决定对各国金融利差的影响。我们在1999年第1季度至2020年第4季度为33个国家实施了一个双方程动态面板数据模型,该模型确定了全球条件对这些国家的商品贸易条件和金融利差的影响,首先以直接方式,然后通过反馈机制,贸易条件对利差具有非对称的额外影响。 摘要:We study the diffusion of shocks in the global financial cycle and global liquidity conditions to emerging and developing economies. We show that the classification according to their external trade patterns (as commodities' net exporters or net importers) allows to evaluate the relative importance of international monetary spillovers and their impact on the domestic financial cycle volatility -i.e., the coefficient of variation of financial spreads and risks. Given the relative importance of commodity trade in the economic structure of these countries, our study reveals that the sign and size of the trade balance of commodity goods are key parameters to rationalize the impact of global financial and liquidity conditions. Hence, the sign and volume of commodity external trade will define the effect on countries' financial spreads. We implement a two-equation dynamic panel data model for 33 countries during 1999:Q1-2020:Q4 that identifies the effect of global conditions on the countries' commodities terms of trade and financial spreads, first in a direct way, and then by a feedback mechanism by which the terms of trade have an asymmetric additional influence on spreads.
【6】 Sustainability Manifesto for Financial Products: Carbon Equivalence Principle 标题:金融产品可持续性宣言:碳当量原则 链接:https://arxiv.org/abs/2112.04181
作者:Chris Kenyon,Mourad Berrahoui,Andrea Macrina 备注:12 pages, 1 table, 1 figure 摘要:可持续性是金融市场的一个关键点,而“绿色”标签正是为了解决这一问题。收购金融产品的“绿色”标签带来了潜在的好处,因此该标签具有争议性和吸引力。然而,这样一个二元标签不足以代表碳影响——我们将碳作为可持续性的有用简化。碳影响的范围为零。碳排放和碳封存都可能是金融产品的结果。二元标签不允许碳中和的投资和煤发电厂之间的区别。碳影响有时间和持续时间,人工林需要时间生长,燃煤电厂需要时间排放。因此,我们提出了金融产品的碳当量原则(CEP):金融产品的碳效应应包括在与现有银行系统兼容的关联条款清单中。这可以是一个单一的流程,即总结碳流程,也可以是一个在体积和时间上描述碳影响的链接条款表。CEP意味着投资的碳影响随资金而变化。使碳影响与现有银行系统保持一致,可以直接调整金融产品的使用和可持续性,改进不兼容的披露建议。 摘要:Sustainability is a key point for financial markets and the label "Green" is an attempt to address this. Acquisition of the label "Green" for financial products carries potential benefits, hence the controversy and attractiveness of the label. However, such a binary label inadequately represents the carbon impact - we use carbon as a useful simplification of sustainability. Carbon impact has a range either size of zero. Both carbon emissions, and sequestration of carbon, are possible results of financial products. A binary label does not allow differentiation between a carbon neutral investment and a coal power plant. Carbon impact has timing and duration, a planted forest takes time to grow, a coal power plant takes time to emit. Hence we propose the Carbon Equivalence Principle (CEP) for financial products: that the carbon effect of a financial product shall be included as a linked term sheet compatible with existing bank systems. This can either be a single flow, i.e., a summary carbon flow, or a linked termsheet describing the carbon impacts in volume and time. The CEP means that the carbon impact of investment follows the money. Making carbon impacts consistent with existing bank systems enables direct alignment of financial product use and sustainability, improving on non-compatible disclosure proposals.
【7】 Generative Adversarial Network (GAN) and Enhanced Root Mean Square Error (ERMSE): Deep Learning for Stock Price Movement Prediction 标题:生成性对抗性网络(GAN)和增强型均方根误差(ERMSE):股价走势预测的深度学习 链接:https://arxiv.org/abs/2112.03946
作者:Ashish Kumar,Abeer Alsadoon,P. W. C. Prasad,Salma Abdullah,Tarik A. Rashid,Duong Thu Hang Pham,Tran Quoc Vinh Nguyen 备注:18 pages. Multimed Tools Appl, 2021 摘要:股票价格走势预测在金融界和学术界都具有重要意义。股票价格包含复杂、不完整、模糊的信息,因此预测其发展趋势是一项极其困难的任务。预测和分析金融数据是一个非线性、时间相关的问题。随着机器学习和深度学习的迅速发展,通过专门设计的网络可以更有效地完成这项任务。本文旨在通过使用生成式对抗网络的深度学习体系结构,提高预测精度,最大限度地减少预测误差损失。提出了一种由相空间重构(PSR)方法重构价格序列和生成性对抗网络(GAN)组成的通用模型,GAN是以长短时记忆(LSTM)为生成模型的两个神经网络和卷积神经网络(CNN)的组合作为对抗训练的判别模型来预测股市。LSTM将根据历史基本指标信息生成新实例,然后CNN将估计数据是由LSTM预测的还是真实的。研究发现,生成性对抗网络(GAN)在增强LSTM均方根误差方面表现良好,因为它在预测方向方面的准确度提高了4.35%,处理时间和RMSE分别减少了78秒和0.029秒。这项研究在股票指数的准确性方面提供了更好的结果。似乎该系统专注于最小化均方根误差和处理时间,提高方向预测精度,并在股票指数精度方面提供更好的结果。 摘要:The prediction of stock price movement direction is significant in financial circles and academic. Stock price contains complex, incomplete, and fuzzy information which makes it an extremely difficult task to predict its development trend. Predicting and analysing financial data is a nonlinear, time-dependent problem. With rapid development in machine learning and deep learning, this task can be performed more effectively by a purposely designed network. This paper aims to improve prediction accuracy and minimizing forecasting error loss through deep learning architecture by using Generative Adversarial Networks. It was proposed a generic model consisting of Phase-space Reconstruction (PSR) method for reconstructing price series and Generative Adversarial Network (GAN) which is a combination of two neural networks which are Long Short-Term Memory (LSTM) as Generative model and Convolutional Neural Network (CNN) as Discriminative model for adversarial training to forecast the stock market. LSTM will generate new instances based on historical basic indicators information and then CNN will estimate whether the data is predicted by LSTM or is real. It was found that the Generative Adversarial Network (GAN) has performed well on the enhanced root mean square error to LSTM, as it was 4.35% more accurate in predicting the direction and reduced processing time and RMSE by 78 secs and 0.029, respectively. This study provides a better result in the accuracy of the stock index. It seems that the proposed system concentrates on minimizing the root mean square error and processing time and improving the direction prediction accuracy, and provides a better result in the accuracy of the stock index.
2.cs.SD语音:
【1】 Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval 标题:一举多得--用于视频检索的多模式融合转换器 链接:https://arxiv.org/abs/2112.04446
作者:Nina Shvetsova,Brian Chen,Andrew Rouditchenko,Samuel Thomas,Brian Kingsbury,Rogerio Feris,David Harwath,James Glass,Hilde Kuehne 摘要:最近,视频数据的多模式学习受到了越来越多的关注,因为它可以在不需要人工注释的情况下训练语义上有意义的嵌入,从而实现Zero-Shot检索和分类等任务。在这项工作中,我们提出了一种多模态、模态不可知的融合变换方法,该方法学习在多模态(如视频、音频和文本)之间交换信息,并将它们集成到联合的多模态表示中,以获得聚合多模态时间信息的嵌入。我们建议在训练系统时,同时对所有东西进行组合损失,包括单个模态和成对模态,明确排除任何附加项,如位置或模态编码。在测试时,生成的模型可以处理和融合任意数量的输入模式。此外,Transformer的隐式特性允许处理不同长度的输入。为了评估所提出的方法,我们在大规模HowTo100M数据集上训练模型,并在四个具有挑战性的基准数据集上评估生成的嵌入空间,从而获得Zero-Shot视频检索和Zero-Shot视频动作定位的最新结果。 摘要:Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.
【2】 Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features 标题:训练具有自监督特征的鲁棒零发语音转换模型 链接:https://arxiv.org/abs/2112.04424
作者:Trung Dang,Dung Tran,Peter Chin,Kazuhito Koishida 摘要:无监督零炮语音转换(VC)的目的是在不依赖并行训练数据的情况下,修改话语的说话人特征以匹配看不见的目标说话人。最近,语音表征的自监督学习被证明可以在不使用转录本的情况下生成有用的语言单元,转录本可以直接传递到VC模型。在本文中,我们证明了使用长度重采样解码器可以获得高质量的音频样本,这使得VC模型能够与不同的语言特征提取器和声码器协同工作,而无需对相同的序列长度进行操作。我们证明了我们的方法在VCTK数据集上的性能优于许多基线。在不修改架构的情况下,我们进一步证明了a)使用来自同一说话人的不同音频片段对,b)添加循环一致性损失,以及c)添加说话人分类损失可以帮助学习更好的说话人嵌入。我们使用这些技术对LibriTTS进行训练的模型实现了最佳性能,生成的音频样本能够很好地传输到目标说话人的声音,同时保留了在字符错误率方面与实际人类话语相当的语言内容。 摘要:Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c) adding a speaker classification loss can help to learn a better speaker embedding. Our model trained on LibriTTS using these techniques achieves the best performance, producing audio samples transferred well to the target speaker's voice, while preserving the linguistic content that is comparable with actual human utterances in terms of Character Error Rate.
【3】 Learning music audio representations via weak language supervision 标题:通过弱语言监督学习音乐音频表征 链接:https://arxiv.org/abs/2112.04214
作者:Ilaria Manco,Emmanouil Benetos,Elio Quinton,Gyorgy Fazekas 备注:5 pages, 5 figures 摘要:用于音乐信息检索的音频表示通常通过任务特定方式的监督学习来学习。尽管该方案能够有效地产生最先进的结果,但对于模型可能具有的应用范围而言,该方案缺乏灵活性,并且需要大量注释数据集。在这项工作中,我们提出了一个问题,即是否有可能利用弱对齐文本作为学习通用音乐音频表示的唯一监督信号。为了解决这个问题,我们设计了一个多模式的音乐和语言预训练体系结构(MuLaP),通过一组代理任务进行优化。以嘈杂的自然语言描述的形式提供微弱的监督,传达曲目的整体音乐内容。在预训练之后,我们将模型的音频主干转移到一组音乐音频分类和回归任务中。我们通过比较同一音频主干与不同训练策略产生的音频表示的性能,证明了我们方法的有效性,并表明我们的预训练方法在所有任务和数据集上始终取得可比或更高的分数。我们的实验还证实,MuLaP有效地利用音频字幕对来学习与文献中的纯音频和跨模态自监督方法相竞争的表示。 摘要:Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address this question, we design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. Weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. After pre-training, we transfer the audio backbone of the model to a set of music audio classification and regression tasks. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies and show that our pre-training method consistently achieves comparable or higher scores on all tasks and datasets considered. Our experiments also confirm that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
【4】 Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization 标题:基于简单暹罗网络和自监督正则化的自监督说话人确认 链接:https://arxiv.org/abs/2112.04459
作者:Mufan Sang,Haoqi Li,Fang Liu,Andrew O. Arnold,Li Wan 备注:Submitted to ICASSP 2022 摘要:训练没有说话人标签的说话人鉴别和鲁棒的说话人验证系统仍然具有挑战性,值得探索。在本研究中,我们提出了一个有效的自监督学习框架和一种新的正则化策略来促进自监督说话人表征学习。与基于对比学习的自监督学习方法不同,所提出的自监督正则化方法(SSReg)只关注正数据对潜在表示之间的相似性。我们还探讨了替代在线数据增强策略在时域和频域上的有效性。通过我们强大的在线数据扩充策略,所提出的SSReg显示了不使用负对的自监督学习的潜力,并且它可以通过简单的暹罗网络结构显著提高自监督说话人表示学习的性能。在VoxCeleb数据集上的综合实验表明,通过添加有效的自监督正则化,我们提出的自监督方法获得了23.4%的相对改进,并且优于以前的其他工作。 摘要:Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the proposed self-supervised regularization (SSReg) focuses exclusively on the similarity between the latent representations of positive data pairs. We also explore the effectiveness of alternative online data augmentation strategies on both the time domain and frequency domain. With our strong online data augmentation strategy, the proposed SSReg shows the potential of self-supervised learning without using negative pairs and it can significantly improve the performance of self-supervised speaker representation learning with a simple Siamese network architecture. Comprehensive experiments on the VoxCeleb datasets demonstrate that our proposed self-supervised approach obtains a 23.4% relative improvement by adding the effective self-supervised regularization and outperforms other previous works.
【5】 A study on native American English speech recognition by Indian listeners with varying word familiarity level 标题:不同词汇熟悉度的印度听者对美国本土英语语音识别的研究 链接:https://arxiv.org/abs/2112.04151
作者:Abhayjeet Singh,Achuth Rao MV,Rakesh Vaideeswaran,Chiranjeevi Yarra,Prasanta Kumar Ghosh 备注:6 pages, 5 figues, COCOSDA 2021 摘要:在这项研究中,不同印第安血统的听者被要求去听和识别美国人所说的胆小的话语。当每个听众识别一个话语时,我们有三种反应:1。句子难度评分,2。演讲者难度评分,和3。话语的转录。根据这些抄本,计算单词错误率(WER),并将其用作评估识别句子和原始句子之间相似性的指标。本研究中选择的句子根据单词出现的频率分为三组:简单、中等和困难。我们观察到,句子、说话人难度评分和WER从容易的句子类别增加到困难的句子类别。我们还比较了在以下三种声学模型(AM)和语言模型(LM)组合下使用三种自动语音识别(ASR)的人类语音识别性能:ASR1)AM使用来自印度裔说话人的录音和基于TIMIT文本的LM进行训练,ASR2)AM使用来自美国原住民的录音,LM使用LIBRI语音语料库构建的上文本,ASR3)AM使用来自美国原住民的录音,LM使用LIBRI语音和TIMIT文本构建。我们观察到高铁的性能与ASR1相似,而ASR3的性能最好。对说话人出生地的分析表明,与其他几个出生地相比,印度听众更难识别某些出生地的说话人的话语 摘要:In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.The sentences selected in this study are categorized into three groups: Easy, Medium and Hard, based on the frequency ofoccurrence of the words in them. We observe that the sentence, speaker difficulty ratings and the WERs increase from easy to hard categories of sentences. We also compare the human speech recognition performance with that using three automatic speech recognition (ASR) under following three combinations of acoustic model (AM) and language model(LM): ASR1) AM trained with recordings from speakers of Indian origin and LM built on TIMIT text, ASR2) AM using recordings from native American speakers and LM built ontext from LIBRI speech corpus, and ASR3) AM using recordings from native American speakers and LM build on LIBRI speech and TIMIT text. We observe that HSR performance is similar to that of ASR1 whereas ASR3 achieves the best performance. Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities
3.eess.AS音频处理:
【1】 Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization 标题:基于简单暹罗网络和自监督正则化的自监督说话人确认 链接:https://arxiv.org/abs/2112.04459
作者:Mufan Sang,Haoqi Li,Fang Liu,Andrew O. Arnold,Li Wan 备注:Submitted to ICASSP 2022 摘要:训练没有说话人标签的说话人鉴别和鲁棒的说话人验证系统仍然具有挑战性,值得探索。在本研究中,我们提出了一个有效的自监督学习框架和一种新的正则化策略来促进自监督说话人表征学习。与基于对比学习的自监督学习方法不同,所提出的自监督正则化方法(SSReg)只关注正数据对潜在表示之间的相似性。我们还探讨了替代在线数据增强策略在时域和频域上的有效性。通过我们强大的在线数据扩充策略,所提出的SSReg显示了不使用负对的自监督学习的潜力,并且它可以通过简单的暹罗网络结构显著提高自监督说话人表示学习的性能。在VoxCeleb数据集上的综合实验表明,通过添加有效的自监督正则化,我们提出的自监督方法获得了23.4%的相对改进,并且优于以前的其他工作。 摘要:Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the proposed self-supervised regularization (SSReg) focuses exclusively on the similarity between the latent representations of positive data pairs. We also explore the effectiveness of alternative online data augmentation strategies on both the time domain and frequency domain. With our strong online data augmentation strategy, the proposed SSReg shows the potential of self-supervised learning without using negative pairs and it can significantly improve the performance of self-supervised speaker representation learning with a simple Siamese network architecture. Comprehensive experiments on the VoxCeleb datasets demonstrate that our proposed self-supervised approach obtains a 23.4% relative improvement by adding the effective self-supervised regularization and outperforms other previous works.
【2】 A study on native American English speech recognition by Indian listeners with varying word familiarity level 标题:不同词汇熟悉度的印度听者对美国本土英语语音识别的研究 链接:https://arxiv.org/abs/2112.04151
作者:Abhayjeet Singh,Achuth Rao MV,Rakesh Vaideeswaran,Chiranjeevi Yarra,Prasanta Kumar Ghosh 备注:6 pages, 5 figues, COCOSDA 2021 摘要:在这项研究中,不同印第安血统的听者被要求去听和识别美国人所说的胆小的话语。当每个听众识别一个话语时,我们有三种反应:1。句子难度评分,2。演讲者难度评分,和3。话语的转录。根据这些抄本,计算单词错误率(WER),并将其用作评估识别句子和原始句子之间相似性的指标。本研究中选择的句子根据单词出现的频率分为三组:简单、中等和困难。我们观察到,句子、说话人难度评分和WER从容易的句子类别增加到困难的句子类别。我们还比较了在以下三种声学模型(AM)和语言模型(LM)组合下使用三种自动语音识别(ASR)的人类语音识别性能:ASR1)AM使用来自印度裔说话人的录音和基于TIMIT文本的LM进行训练,ASR2)AM使用来自美国原住民的录音,LM使用LIBRI语音语料库构建的上文本,ASR3)AM使用来自美国原住民的录音,LM使用LIBRI语音和TIMIT文本构建。我们观察到高铁的性能与ASR1相似,而ASR3的性能最好。对说话人出生地的分析表明,与其他几个出生地相比,印度听众更难识别某些出生地的说话人的话语 摘要:In this study, listeners of varied Indian nativities are asked to listen and recognize TIMIT utterances spoken by American speakers. We have three kinds of responses from each listener while they recognize an utterance: 1. Sentence difficulty ratings, 2. Speaker difficulty ratings, and 3. Transcription of the utterance. From these transcriptions, word error rate (WER) is calculated and used as a metric to evaluate the similarity between the recognized and the original sentences.The sentences selected in this study are categorized into three groups: Easy, Medium and Hard, based on the frequency ofoccurrence of the words in them. We observe that the sentence, speaker difficulty ratings and the WERs increase from easy to hard categories of sentences. We also compare the human speech recognition performance with that using three automatic speech recognition (ASR) under following three combinations of acoustic model (AM) and language model(LM): ASR1) AM trained with recordings from speakers of Indian origin and LM built on TIMIT text, ASR2) AM using recordings from native American speakers and LM built ontext from LIBRI speech corpus, and ASR3) AM using recordings from native American speakers and LM build on LIBRI speech and TIMIT text. We observe that HSR performance is similar to that of ASR1 whereas ASR3 achieves the best performance. Speaker nativity wise analysis shows that utterances from speakers of some nativity are more difficult to recognize by Indian listeners compared to few other nativities
【3】 Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval 标题:一举多得--用于视频检索的多模式融合转换器 链接:https://arxiv.org/abs/2112.04446
作者:Nina Shvetsova,Brian Chen,Andrew Rouditchenko,Samuel Thomas,Brian Kingsbury,Rogerio Feris,David Harwath,James Glass,Hilde Kuehne 摘要:最近,视频数据的多模式学习受到了越来越多的关注,因为它可以在不需要人工注释的情况下训练语义上有意义的嵌入,从而实现Zero-Shot检索和分类等任务。在这项工作中,我们提出了一种多模态、模态不可知的融合变换方法,该方法学习在多模态(如视频、音频和文本)之间交换信息,并将它们集成到联合的多模态表示中,以获得聚合多模态时间信息的嵌入。我们建议在训练系统时,同时对所有东西进行组合损失,包括单个模态和成对模态,明确排除任何附加项,如位置或模态编码。在测试时,生成的模型可以处理和融合任意数量的输入模式。此外,Transformer的隐式特性允许处理不同长度的输入。为了评估所提出的方法,我们在大规模HowTo100M数据集上训练模型,并在四个具有挑战性的基准数据集上评估生成的嵌入空间,从而获得Zero-Shot视频检索和Zero-Shot视频动作定位的最新结果。 摘要:Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.
【4】 Audio-Visual Synchronisation in the wild 标题:野外的视听同步 链接:https://arxiv.org/abs/2112.04432
作者:Honglie Chen,Weidi Xie,Triantafyllos Afouras,Arsha Nagrani,Andrea Vedaldi,Andrew Zisserman 摘要:在本文中,我们考虑的音频-视频同步应用到视频“在野生”(即一般类以外的语音)。作为一项新的任务,我们确定并管理一个具有高视听相关性的测试集,即VGG声音同步。我们比较了一些基于Transformer的架构变体,这些变体专门设计用于模拟任意长度的音频和视频信号,同时显著降低了训练期间的内存需求。我们进一步对策划的数据集进行深入分析,并为开放域视听同步定义评估指标。我们将我们的方法应用于标准唇读语音基准测试LRS2和LRS3,并在各个方面进行了烧蚀。最后,我们在新的VGG声音同步视频数据集中为超过160个不同类别的普通视听同步设置了第一个基准。在所有情况下,我们提出的模型都比以前的先进模型有很大的优势。 摘要:In this paper, we consider the problem of audio-visual synchronisation applied to videos `in-the-wild' (ie of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin.
【5】 Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features 标题:训练具有自监督特征的鲁棒零发语音转换模型 链接:https://arxiv.org/abs/2112.04424
作者:Trung Dang,Dung Tran,Peter Chin,Kazuhito Koishida 摘要:无监督零炮语音转换(VC)的目的是在不依赖并行训练数据的情况下,修改话语的说话人特征以匹配看不见的目标说话人。最近,语音表征的自监督学习被证明可以在不使用转录本的情况下生成有用的语言单元,转录本可以直接传递到VC模型。在本文中,我们证明了使用长度重采样解码器可以获得高质量的音频样本,这使得VC模型能够与不同的语言特征提取器和声码器协同工作,而无需对相同的序列长度进行操作。我们证明了我们的方法在VCTK数据集上的性能优于许多基线。在不修改架构的情况下,我们进一步证明了a)使用来自同一说话人的不同音频片段对,b)添加循环一致性损失,以及c)添加说话人分类损失可以帮助学习更好的说话人嵌入。我们使用这些技术对LibriTTS进行训练的模型实现了最佳性能,生成的音频样本能够很好地传输到目标说话人的声音,同时保留了在字符错误率方面与实际人类话语相当的语言内容。 摘要:Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c) adding a speaker classification loss can help to learn a better speaker embedding. Our model trained on LibriTTS using these techniques achieves the best performance, producing audio samples transferred well to the target speaker's voice, while preserving the linguistic content that is comparable with actual human utterances in terms of Character Error Rate.
【6】 Learning music audio representations via weak language supervision 标题:通过弱语言监督学习音乐音频表征 链接:https://arxiv.org/abs/2112.04214
作者:Ilaria Manco,Emmanouil Benetos,Elio Quinton,Gyorgy Fazekas 备注:5 pages, 5 figures 摘要:用于音乐信息检索的音频表示通常通过任务特定方式的监督学习来学习。尽管该方案能够有效地产生最先进的结果,但对于模型可能具有的应用范围而言,该方案缺乏灵活性,并且需要大量注释数据集。在这项工作中,我们提出了一个问题,即是否有可能利用弱对齐文本作为学习通用音乐音频表示的唯一监督信号。为了解决这个问题,我们设计了一个多模式的音乐和语言预训练体系结构(MuLaP),通过一组代理任务进行优化。以嘈杂的自然语言描述的形式提供微弱的监督,传达曲目的整体音乐内容。在预训练之后,我们将模型的音频主干转移到一组音乐音频分类和回归任务中。我们通过比较同一音频主干与不同训练策略产生的音频表示的性能,证明了我们方法的有效性,并表明我们的预训练方法在所有任务和数据集上始终取得可比或更高的分数。我们的实验还证实,MuLaP有效地利用音频字幕对来学习与文献中的纯音频和跨模态自监督方法相竞争的表示。 摘要:Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address this question, we design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. Weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. After pre-training, we transfer the audio backbone of the model to a set of music audio classification and regression tasks. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies and show that our pre-training method consistently achieves comparable or higher scores on all tasks and datasets considered. Our experiments also confirm that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
机器翻译,仅供参考