Loading [MathJax]/jax/output/CommonHTML/config.js
前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
社区首页 >专栏 >Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers

作者头像
用户6026865
发布于 2023-03-02 13:23:52
发布于 2023-03-02 13:23:52
6050
举报

- Hello Edge: Keyword Spotting on Microcontrollers -

Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy forgood user experience.

Recently, neural networks have become an attractive choicefor KWS architecture because of their superior accuracy compared to traditionalspeech processing algorithms.

Due to its always-on nature, KWS application hashighly constrained power budget and typically runs on tiny microcontrollers withlimited memory and compute capability.

The design of neural network architecturefor KWS must consider these constraints. In this work, we perform neural networkarchitecture evaluation and exploration for running KWS on resource-constrainedmicrocontrollers.

We train various neural network architectures for keywordspotting published in literature to compare their accuracy and memory/compute requirements.

We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers withoutsacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures.DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNNmodel with similar number of parameters.

Introduction

Deep learning algorithms have evolved to a stage where they have surpassed human accuracies in avariety of cognitive tasks including image classification and conversational speech recognition.

Motivated by the recent breakthroughs in deep learning based speech recognition technologies,speech is increasingly becoming a more natural way to interact with consumer electronic devices, forexample, Amazon Echo, Google Home and smart phones.

However, always-on speech recognitionis not energy-efficient and may also cause network congestion to transmit continuous audio streamfrom billions of these devices to the cloud. Furthermore, such a cloud based solution adds latencyto the application, which hurts user experience.

There are also privacy concerns when audio iscontinuously transmitted to the cloud. To mitigate these concerns, the devices first detect predefinedkeyword(s) such as "Alexa", "Ok Google", "Hey Siri", etc., which is commonly known as keywordspotting (KWS). Detection of keyword wakes up the device and then activates the full scale speechrecognition either on device or in the cloud.

In some applications, the sequence of keywordscan be used as voice commands to a smart device such as a voice-enabled light bulb. Since KWSsystem is always-on, it should have very low power consumption to maximize battery life.

On theother hand, the KWS system should detect the keywords with high accuracy and low latency, forbest user experience.

These conflicting system requirements make KWS an active area of researchever since its inception over 50 years ago. Recently, with the renaissance of artificial neuralnetworks in the form of deep learning algorithms, neural network (NN) based KWS has become verypopular.

Low power consumption requirement for keyword spotting systems make microcontrollers an obviouschoice for deploying KWS in an always-on system.

Microcontrollers are low-cost energy-efficientprocessors that are ubiquitous in our everyday life with their presence in a variety of devices rangingfrom home appliances, automobiles and consumer electronics to wearables.

However, deployment ofneural network based KWS on microcontrollers comes with following challenges:

  • Limited memory footprint: Typical microcontroller systems have only tens to few hundred KB ofmemory available. The entire neural network model, including input/output, weights and activations,has to fit within this small memory budget.
  • Limited compute resources: Since KWS is always-on, the real-time requirement limits the totalnumber of operations per neural network inference.

These microcontroller resource constraints in conjunction with the high accuracy and low latencyrequirements of KWS call for a resource-constrained neural network architecture exploration to findlean neural network structures suitable for KWS, which is the primary focus of our work. The maincontributions of this work are as follows:

  • We first train the popular KWS neural net models from the literature on Googlespeech commands dataset and compare them in terms of accuracy, memory footprintand number of operations per inference.
  • In addition, we implement a new KWS model using depth-wise separable convolutions andpoint-wise convolutions, inspired by the success of resource-efficient MobileNet [10] incomputer vision. This model outperforms the other prior models in all aspects of accuracy,model size and number of operations.
  • Finally, we perform resource-constrained neural network architecture exploration and presentcomprehensive comparison of different network architectures within a set of compute andmemory constraints of typical microcontrollers. The code, model definitions and pretrainedmodels are available at https://github.com/ARM-software/ML-KWS-for-MCU.

Background

Keyword Spotting (KWS) System

A typical KWS system consists of a feature extractor and a neural network based classifier as shownin Fig. 1. First, the input speech signal of length L is framed into overlapping frames of length lwith a stride s, giving a total of T =L−ls + 1 frames. From each frame, F speech features areextracted, generating a total of T × F features for the entire input speech signal of length L.

Log-melfilter bank energies (LFBE) and Mel-frequency cepstral coefficients (MFCC) are the commonlyused human-engineered speech features in deep learning based speech-recognition, that are adaptedfrom traditional speech processing techniques.

Feature extraction using LFBE or MFCC involvestranslating the time-domain speech signal into a set of frequency-domain spectral coefficients, whichenables dimensionality compression of the input signal.

The extracted speech feature matrix is fedinto a classifier module, which generates the probabilities for the output classes. In a real-worldscenario where keywords need to be identified from a continuous audio stream, a posterior handlingmodule averages the output probabilities of each output class over a period of time, improving theoverall confidence of the prediction.

Traditional speech recognition technologies for KWS use Hidden Markov Models (HMMs) andViterbi decoding. While these approaches achieve reasonable accuracies, they are hardto train and are computationally expensive during inference.

Other techniques explored for KWSinclude discriminative models adopting a large-margin problem formulation or recurrent neuralnetworks (RNN). Although these methods significantly outperform HMM based KWS in termsof accuracy, they suffer from large detection latency.

Traditional speech recognition technologies for KWS use Hidden Markov Models (HMMs) andViterbi decoding. While these approaches achieve reasonable accuracies, they are hardto train and are computationally expensive during inference.

Other techniques explored for KWSinclude discriminative models adopting a large-margin problem formulation or recurrent neuralnetworks (RNN). Although these methods significantly outperform HMM based KWS in termsof accuracy, they suffer from large detection latency.

KWS models using deep neural networks (DNN)based on fully-connected layers with rectified linear unit (ReLU) activation functions are introduced, which outperforms the HMM models with a very small detection latency.

Furthermore,low-rank approximation techniques are used to compress the DNN model weights achieving similaraccuracy with less hardware resources. The main drawback of DNNs is that they ignorethe local temporal and spectral correlation in the input speech features.

In order to exploit thesecorrelations, different variants of convolutional neural network (CNN) based KWS are explored, which demonstrate higher accuracy than DNNs.

The drawback of CNNs in modeling timevarying signals (e.g. speech) is that they ignore long term temporal dependencies. Combining thestrengths of CNNs and RNNs, convolutional recurrent neural network based KWS is investigated and demonstrate the robustness of the model to noise.

While all the prior KWS neural networksare trained with cross entropy loss function, a max-pooling based loss function for training KWSmodel with long short-term memory (LSTM) is proposed, which achieves better accuracy thanthe DNNs and LSTMs trained with cross entropy loss.

Although many neural network models for KWS are presented in literature, it is difficult to make afair comparison between them as they are all trained and evaluated on different proprietary datasets(e.g. "TalkType" dataset, "Alexa" dataset, etc.) with different input speech featuresand audio duration.

Also, the primary focus of prior research has been to maximize the accuracywith a small memory footprint model, without explicit constraints of underlying hardware, such aslimits on number of operations per inference.

In contrast, this work is more hardware-centric andtargeted towards neural network architectures that maximize accuracy on microcontroller devices.The constraints on memory and compute significantly limit the neural network parameters and thenumber of operations.

Microcontroller Systems

A typical microcontroller system consists of a processor core, an on-chip SRAM block and anon-chip embedded flash.

Table 1 shows some commercially available microcontroller developmentboards with Arm Cortex-M processor cores with different compute capabilities running at differentfrequencies (16 MHz to 216 MHz), consisting of a wide range of on-chip memory (SRAM: 8 KB to320 KB; Flash: 128 KB to 1 MB).

The program binary, usually preloaded into the non-volatile flash,is loaded into the SRAM at startup and the processor runs the program with the SRAM as the maindata memory. Therefore, the size of the SRAM limits the size of memory that the software can use.

Other than the memory footprint, performance (i.e., operations per second) is also a constraining factorfor running neural networks on microcontrollers.

Most microcontrollers are designed for embeddedapplications with low cost and high energy-efficiency as the primary targets, and do not have highthroughput for compute-intensive workloads such as neural networks.

Some microcontrollers haveintegrated DSP instructions that can be useful for running neural network workloads. For example,Cortex-M4 and Cortex-M7 have integrated SIMD and MAC instructions that can be used to acceleratelow-precision computation in neural networks.

Neural Network Architectures for KWS

This section gives an overview of all the different neural network architectures explored in thiswork including the deep neural network (DNN), convolutional neural network (CNN), recurrentneural network (RNN), convolutional recurrent neural network (CRNN) and depthwise separableconvolutional neural network (DS-CNN).

Deep Neural Network (DNN)

The DNN is a standard feed-forward neural network made of a stack of fully-connected layers andnon-linear activation layers. The input to the DNN is the flattened feature matrix, which feeds into astack of d hidden fully-connected layers each with n neurons. Typically, each fully-connected layeris followed by a rectified linear unit (ReLU) based activation function. At the output is a linear layerfollowed by a softmax layer generating the output probabilities of the k keywords, which are used forfurther posterior handling.

Convolutional Neural Network (CNN)

One main drawback of DNN based KWS is that they fail to efficiently model the local temporaland spectral correlation in the speech features.

CNNs exploit this correlation by treating the inputtime-domain and spectral-domain features as an image and performing 2-D convolution operationsover it.

The convolution layers are typically followed by batch normalization, ReLU basedactivation functions and optional max/average pooling layers, which reduce the dimensionality ofthe features. During inference, the parameters of batch normalization can be folded into the weightsof the convolution layers. In some cases, a linear low-rank layer, which is simply a fully-connectedlayer without non-linear activation, is added in between the convolution layers and dense layers forthe purpose of reducing parameters and accelerating training .

Recurrent Neural Network (RNN)

RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition [20], language modeling [21], translation [22], etc. RNNs not only exploit the temporal relationbetween the input signal, but also capture the long-term dependencies using "gating" mechanism

Convolutional Recurrent Neural Network (CRNN)

Convolution recurrent neural network is a hybrid of CNN and RNN, which takes advantages ofboth. It exploits the local temporal/spatial correlation using convolution layers and global temporaldependencies in the speech features using recurrent layers. As shown in Fig. 3, a CRNN model

Depthwise Separable Convolutional Neural Network (DS-CNN)

Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard3-D convolution operation [29] and has been used to achieve compact network architectures in thearea of computer vision

Experiments and ResultsConclusions

Hardware optimized neural network architecture is key to get efficient results on memory and computeconstrained microcontrollers.

We trained various neural network architectures for keyword spottingpublished in literature on Google speech commands dataset to compare their accuracy and memoryrequirements vs. operations per inference, from the perspective of deployment on microcontrollersystems.

We quantized representative trained 32-bit floating-point KWS models into 8-bit fixed-pointversions demonstrating that these models can easily be quantized for deployment without any lossin accuracy, even without retraining. Furthermore, we trained a new KWS model using depthwiseseparable convolution layers, inspired from MobileNet.

Based on typical microcontroller systems,we derived three sets of memory/compute constraints for the neural networks and performed resourceconstrained neural network architecture exploration to find the best networks achieving maximumaccuracy within these constraints.

In all three sets of memory/compute constraints, depthwiseseparable CNN model (DS-CNN) achieves the best accuracies of 94.4%, 94.9% and 95.4% comparedto the other model architectures within those constraints, which shows good scalability of the DS-CNNmodel.

The code, model definitions and pretrained models are available at https://github.com/ARMsoftware/ML-KWS-for-MCU.

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-12-29,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 SmellLikeAISpirit 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
编辑精选文章
换一批
重磅纯干货 | 超级赞的语音识别/语音合成经典论文的路线图(1982-2018.5)
网址:https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers
用户7623498
2020/08/04
1.3K0
生物启发的终生学习系列论文The Neural Adaptive Computing Laboratory
Neural architectures trained with back-propagation of errors are susceptible to catastrophic forgetting. In other words, old information acquired by these models is lost when new information for new tasks is acquired. This makes building models that continually learn extremely difficult if not near impossible. The focus of the NAC group's research is to draw from models of cognition and biological neurocircuitry, as well as theories of mind and brain functionality, to construct new learning procedures and architectures that generalize across tasks and continually adapt to novel situations, combining input from multiple modalities/sensory channels. The NAC team is focused with developing novel, neurocognitively-inspired learning algorithms and memory architectures for artificial neural systems (for both non-spiking and spiking neurons). Furthermore, we explore and develop nature-inspired metaheuristic optimization algorithms, ranging from (neuro-)evolution to ant colony optimization to hybrid procedures. We primarily are concerned with the various sub-problems associated with lifelong machine learning, which subsumes online/stream learning, transfer learning, multi-task learning, multi-modal/input learning, and semi-supervised learning.
CreateAMind
2023/09/12
2120
生物启发的终生学习系列论文The Neural Adaptive Computing Laboratory
TensorFlow Lite for Microcontroller
Microcontrollers (MCUs) are the tiny computers that power our technological environment. There are over 30 billion of them manufactured every year, embedded in everything from household appliances to fitness trackers.
用户6026865
2023/03/03
1.1K0
TensorFlow Lite for Microcontroller
专栏 | 如何对比评价各种深度神经网络硬件?不妨给它们跑个分
矽说专栏 作者:唐杉 作者简介:唐杉博士先后在 T3G(STE)、中科院计算所、紫光展锐(RDA)工作。具有 15 年以上的芯片设计经验,在 3G/4G 通信基带处理,专用处理器 ASIP,多核 SoC 架构,ESL 级设计和 Domain-specific 计算等方面有深入研究和实际经验。近一年来主要关注 Deep Learning 处理器和相关技术,现在 Synopsys 公司任职。公众号:StarryHeavensAbove。 面对越来越多的 DNN 专用处理器设计(芯片和 IP),我们很自然的需要解
机器之心
2018/05/09
9730
专栏 | 如何对比评价各种深度神经网络硬件?不妨给它们跑个分
金融/语音/音频处理学术速递[11.10]
q-fin金融,共计5篇 cs.SD语音,共计10篇 eess.AS音频处理,共计10篇 1.q-fin金融: 【1】 Do Firearm Markets Comply with Firearm R
公众号-arXiv每日学术速递
2021/11/17
4760
金融/语音/音频处理学术速递[8.18]
【1】 Analysis of Data Mining Process for Improvement of Production Quality in Industrial Sector 标题:提高工业部门生产质量的数据挖掘过程分析 链接:https://arxiv.org/abs/2108.07615
公众号-arXiv每日学术速递
2021/08/24
5940
CVPR 2019 | 近日新出论文汇总(含视频目标分割、GAN、度量学习、高效语义分割等主题)
CV君汇总了最近两天新出的CVPR 2019 论文,涵盖内容包括:度量学习、视频目标分割、GAN图像生成、基于RGB图像的物体表面网格生成、深度补全、高效卷积网络设计、高效语义分割。
CV君
2019/12/27
1.2K0
CVPR 2019 | 今日新出14篇论文汇总(来自微软、商汤、腾讯、斯坦福等)
今天新出了14篇CVPR2019的论文,CV君汇总了他们的简略信息,有代码的也一并列出了,感兴趣的朋友,可以文末下载细读。
CV君
2019/12/27
8350
【最新】机器学习顶会 NIPS 2017 Pre-Proceedings 论文列表(附pdf下载链接)
【导读】机器学习领域顶尖学术会议——神经信息处理系统进展大会(Advances in NeuralInformation Processing Systems,NIPS),就是放在整个计算机科学界,也是数一数二的顶级学术会议。今年的NIPS将于 12 月份在美国长滩举行,本届NIPS共收到 3240 篇论文投稿,录用 678 篇,录用率为 20.9%;其中包括 40 篇口头报告论文和 112 篇 spotlight 论文。微软共中了16篇论文,其中微软亚洲研究院有4篇,Google有23篇。清华大学,今年共
WZEARW
2018/04/10
2.8K0
【最新】机器学习顶会 NIPS 2017 Pre-Proceedings 论文列表(附pdf下载链接)
【Github2.2K星】PyTorch资源列表:450个NLP/CV/SP、论文实现、教程、示例
https://github.com/bharathgs/Awesome-pytorch-list
新智元
2018/11/22
7450
【论文推荐】最新九篇目标检测相关论文—常识性知识转移、尺度不敏感、多尺度位置感知、渐进式域适应、时间感知特征图、人机合作
【导读】专知内容组整理了最近七篇目标检测(Object Detection)相关文章,为大家进行介绍,欢迎查看! 1.Single-Shot Object Detection with Enriched Semantics(具有丰富语义的单次物体检测) ---- ---- 作者:Zhishuai Zhang,Siyuan Qiao,Cihang Xie,Wei Shen,Bo Wang,Alan L. Yuille 机构:Johns Hopkins University,Shanghai Universit
WZEARW
2018/04/13
1.4K0
【论文推荐】最新九篇目标检测相关论文—常识性知识转移、尺度不敏感、多尺度位置感知、渐进式域适应、时间感知特征图、人机合作
CRNN论文翻译——中英文对照
本文介绍了自然语言处理中的文本分类任务,以及基于深度学习方法的实现。通过具体案例,展示了文本分类在实际问题中的应用,并讨论了各种深度学习方法以及其优缺点。
Tyan
2017/12/28
2.3K0
CRNN论文翻译——中英文对照
金融/语音/音频处理学术速递[8.30]
【1】 European option pricing under generalized fractional Brownian motion 标题:广义分数布朗运动下的欧式期权定价 链接:https://arxiv.org/abs/2108.12042
公众号-arXiv每日学术速递
2021/09/16
4980
腾讯广点通:基于深度学习的图像语义分析及其应用
本文 转自“火光摇曳”博客:语义分析的一些方法(三),主要论述了基于深度学习方法的图像语义分析,包括图片分类、图片搜索、图片标注(image2text、image2sentence),以及训练深度神经网络的一些tricks,并介绍语义分析方法在腾讯广点通上的实际应用。以下为全文内容: 3 图片语义分析 3.1 图片分类 图片分类是一个最基本的图片语义分析方法。 基于深度学习的图片分类 传统的图片分类如下图所示,首先需要先手工提取图片特征,譬如SIFT, GIST,再经由VQ coding和Spatial p
用户1737318
2018/06/05
1.8K0
【代码+论文】最全LSTM在量化交易中的应用汇总
我们的论坛社区上线啦! https://bbs.mlqi.org 希望大家多去逛逛,学习交流,共享智慧。这个社区就是为大家解答、学习、交流在量化投资和机器学习方面的一个论坛。 今天,我们继续推出机器学习在量化投资中的应用系列——LSTM在量化交易中的应用汇总(代码+论文)。希望大家可以学习到很多知识。 这些资料是我们花了很长时间整理的。我们会一直秉承无偿分享的精神。给大家带来轻松的学习氛围。努力为中国的量化投资事业贡献一份力量! 网站资源 Stock Clusters https://github.co
量化投资与机器学习微信公众号
2018/01/29
4K0
【代码+论文】最全LSTM在量化交易中的应用汇总
【世界读书日】2018版十大引用数最高的深度学习论文集合
在过去的几年里,深度学习是机器学习和统计学习交叉领域的一个子集,强大的开源工具以及大数据的热潮让其取得了令人惊讶的进展。 本文根据微软学术的引用量作为评价指标,从中选取了10篇引用量最高的论文。希望在今天的读书日,能够给大家带来一份学习的干货。 Deep Learning, by Yann L., Yoshua B. & Geoffrey H. (2015) 引用次数:5716 Deep learning enables computational models that are composed of
量化投资与机器学习微信公众号
2018/05/28
3750
【资源】深度学习 Top100:近 5 年被引用次数最高论文(下载)
【新智元导读】这里是近5年100篇被引用次数最多的深度学习论文,覆盖了优化/训练方法、无监督/生成模型、卷积网络模型和图像分割/目标检测等十大子领域。重要的论文能够超越其应用领域让人获益。新智元在每个领域都选择了一篇论文重点介绍,这将是你纵览深度学习研究绝好的开始。 这里是100篇被引用次数最多的深度学习论文,从海量的相关论文中脱颖而出。无论其应用领域是什么,都值得一读,而在其各自的领域,它们是必读之作。 此前已经有一些很棒的深度学习论文的榜单了,比如说Deep Vision和Awesome Recurre
新智元
2018/03/27
9440
Github项目推荐 | 语义分割、实例分割、全景分割和视频分割的论文和基准列表
Segmentation.X - Papers and Benchmarks about semantic segmentation, instance segmentation, panoptic segmentation and video segmentation
AI研习社
2019/05/08
2.6K0
WHEN NOT TO USE DEEP LEARNING
转载自: http://hyperparameter.space/blog/when-not-to-use-deep-learning/
GavinZhou
2019/05/26
5620
一文看尽21篇目标检测最新论文(腾讯/Google/商汤/旷视/清华/浙大/CMU/华科/中科院等)
CVer 有几天没更新论文速递了,主要是这段时间的论文太多,而且质量较高的论文也不少,所以为了方便大家阅读,我已经将其中的目标检测(Object Detection)论文整理出来。本文分享的目标检测论文将同步推送到 github上,欢迎大家 star/fork(点击阅读原文,也可直接访问):
Amusi
2019/05/05
1.9K0
一文看尽21篇目标检测最新论文(腾讯/Google/商汤/旷视/清华/浙大/CMU/华科/中科院等)
推荐阅读
重磅纯干货 | 超级赞的语音识别/语音合成经典论文的路线图(1982-2018.5)
1.3K0
生物启发的终生学习系列论文The Neural Adaptive Computing Laboratory
2120
TensorFlow Lite for Microcontroller
1.1K0
专栏 | 如何对比评价各种深度神经网络硬件?不妨给它们跑个分
9730
金融/语音/音频处理学术速递[11.10]
4760
金融/语音/音频处理学术速递[8.18]
5940
CVPR 2019 | 近日新出论文汇总(含视频目标分割、GAN、度量学习、高效语义分割等主题)
1.2K0
CVPR 2019 | 今日新出14篇论文汇总(来自微软、商汤、腾讯、斯坦福等)
8350
【最新】机器学习顶会 NIPS 2017 Pre-Proceedings 论文列表(附pdf下载链接)
2.8K0
【Github2.2K星】PyTorch资源列表:450个NLP/CV/SP、论文实现、教程、示例
7450
【论文推荐】最新九篇目标检测相关论文—常识性知识转移、尺度不敏感、多尺度位置感知、渐进式域适应、时间感知特征图、人机合作
1.4K0
CRNN论文翻译——中英文对照
2.3K0
金融/语音/音频处理学术速递[8.30]
4980
腾讯广点通:基于深度学习的图像语义分析及其应用
1.8K0
【代码+论文】最全LSTM在量化交易中的应用汇总
4K0
【世界读书日】2018版十大引用数最高的深度学习论文集合
3750
【资源】深度学习 Top100:近 5 年被引用次数最高论文(下载)
9440
Github项目推荐 | 语义分割、实例分割、全景分割和视频分割的论文和基准列表
2.6K0
WHEN NOT TO USE DEEP LEARNING
5620
一文看尽21篇目标检测最新论文(腾讯/Google/商汤/旷视/清华/浙大/CMU/华科/中科院等)
1.9K0
相关推荐
重磅纯干货 | 超级赞的语音识别/语音合成经典论文的路线图(1982-2018.5)
更多 >
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档