前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
社区首页 >专栏 >Hello Edge: Keyword Spotting on Microcontrollers

Hello Edge: Keyword Spotting on Microcontrollers

作者头像
用户6026865
发布于 2023-03-02 13:23:52
发布于 2023-03-02 13:23:52
6050
举报

- Hello Edge: Keyword Spotting on Microcontrollers -

Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy forgood user experience.

Recently, neural networks have become an attractive choicefor KWS architecture because of their superior accuracy compared to traditionalspeech processing algorithms.

Due to its always-on nature, KWS application hashighly constrained power budget and typically runs on tiny microcontrollers withlimited memory and compute capability.

The design of neural network architecturefor KWS must consider these constraints. In this work, we perform neural networkarchitecture evaluation and exploration for running KWS on resource-constrainedmicrocontrollers.

We train various neural network architectures for keywordspotting published in literature to compare their accuracy and memory/compute requirements.

We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers withoutsacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures.DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNNmodel with similar number of parameters.

Introduction

Deep learning algorithms have evolved to a stage where they have surpassed human accuracies in avariety of cognitive tasks including image classification and conversational speech recognition.

Motivated by the recent breakthroughs in deep learning based speech recognition technologies,speech is increasingly becoming a more natural way to interact with consumer electronic devices, forexample, Amazon Echo, Google Home and smart phones.

However, always-on speech recognitionis not energy-efficient and may also cause network congestion to transmit continuous audio streamfrom billions of these devices to the cloud. Furthermore, such a cloud based solution adds latencyto the application, which hurts user experience.

There are also privacy concerns when audio iscontinuously transmitted to the cloud. To mitigate these concerns, the devices first detect predefinedkeyword(s) such as "Alexa", "Ok Google", "Hey Siri", etc., which is commonly known as keywordspotting (KWS). Detection of keyword wakes up the device and then activates the full scale speechrecognition either on device or in the cloud.

In some applications, the sequence of keywordscan be used as voice commands to a smart device such as a voice-enabled light bulb. Since KWSsystem is always-on, it should have very low power consumption to maximize battery life.

On theother hand, the KWS system should detect the keywords with high accuracy and low latency, forbest user experience.

These conflicting system requirements make KWS an active area of researchever since its inception over 50 years ago. Recently, with the renaissance of artificial neuralnetworks in the form of deep learning algorithms, neural network (NN) based KWS has become verypopular.

Low power consumption requirement for keyword spotting systems make microcontrollers an obviouschoice for deploying KWS in an always-on system.

Microcontrollers are low-cost energy-efficientprocessors that are ubiquitous in our everyday life with their presence in a variety of devices rangingfrom home appliances, automobiles and consumer electronics to wearables.

However, deployment ofneural network based KWS on microcontrollers comes with following challenges:

  • Limited memory footprint: Typical microcontroller systems have only tens to few hundred KB ofmemory available. The entire neural network model, including input/output, weights and activations,has to fit within this small memory budget.
  • Limited compute resources: Since KWS is always-on, the real-time requirement limits the totalnumber of operations per neural network inference.

These microcontroller resource constraints in conjunction with the high accuracy and low latencyrequirements of KWS call for a resource-constrained neural network architecture exploration to findlean neural network structures suitable for KWS, which is the primary focus of our work. The maincontributions of this work are as follows:

  • We first train the popular KWS neural net models from the literature on Googlespeech commands dataset and compare them in terms of accuracy, memory footprintand number of operations per inference.
  • In addition, we implement a new KWS model using depth-wise separable convolutions andpoint-wise convolutions, inspired by the success of resource-efficient MobileNet [10] incomputer vision. This model outperforms the other prior models in all aspects of accuracy,model size and number of operations.
  • Finally, we perform resource-constrained neural network architecture exploration and presentcomprehensive comparison of different network architectures within a set of compute andmemory constraints of typical microcontrollers. The code, model definitions and pretrainedmodels are available at https://github.com/ARM-software/ML-KWS-for-MCU.

Background

Keyword Spotting (KWS) System

A typical KWS system consists of a feature extractor and a neural network based classifier as shownin Fig. 1. First, the input speech signal of length L is framed into overlapping frames of length lwith a stride s, giving a total of T =L−ls + 1 frames. From each frame, F speech features areextracted, generating a total of T × F features for the entire input speech signal of length L.

Log-melfilter bank energies (LFBE) and Mel-frequency cepstral coefficients (MFCC) are the commonlyused human-engineered speech features in deep learning based speech-recognition, that are adaptedfrom traditional speech processing techniques.

Feature extraction using LFBE or MFCC involvestranslating the time-domain speech signal into a set of frequency-domain spectral coefficients, whichenables dimensionality compression of the input signal.

The extracted speech feature matrix is fedinto a classifier module, which generates the probabilities for the output classes. In a real-worldscenario where keywords need to be identified from a continuous audio stream, a posterior handlingmodule averages the output probabilities of each output class over a period of time, improving theoverall confidence of the prediction.

Traditional speech recognition technologies for KWS use Hidden Markov Models (HMMs) andViterbi decoding. While these approaches achieve reasonable accuracies, they are hardto train and are computationally expensive during inference.

Other techniques explored for KWSinclude discriminative models adopting a large-margin problem formulation or recurrent neuralnetworks (RNN). Although these methods significantly outperform HMM based KWS in termsof accuracy, they suffer from large detection latency.

Traditional speech recognition technologies for KWS use Hidden Markov Models (HMMs) andViterbi decoding. While these approaches achieve reasonable accuracies, they are hardto train and are computationally expensive during inference.

Other techniques explored for KWSinclude discriminative models adopting a large-margin problem formulation or recurrent neuralnetworks (RNN). Although these methods significantly outperform HMM based KWS in termsof accuracy, they suffer from large detection latency.

KWS models using deep neural networks (DNN)based on fully-connected layers with rectified linear unit (ReLU) activation functions are introduced, which outperforms the HMM models with a very small detection latency.

Furthermore,low-rank approximation techniques are used to compress the DNN model weights achieving similaraccuracy with less hardware resources. The main drawback of DNNs is that they ignorethe local temporal and spectral correlation in the input speech features.

In order to exploit thesecorrelations, different variants of convolutional neural network (CNN) based KWS are explored, which demonstrate higher accuracy than DNNs.

The drawback of CNNs in modeling timevarying signals (e.g. speech) is that they ignore long term temporal dependencies. Combining thestrengths of CNNs and RNNs, convolutional recurrent neural network based KWS is investigated and demonstrate the robustness of the model to noise.

While all the prior KWS neural networksare trained with cross entropy loss function, a max-pooling based loss function for training KWSmodel with long short-term memory (LSTM) is proposed, which achieves better accuracy thanthe DNNs and LSTMs trained with cross entropy loss.

Although many neural network models for KWS are presented in literature, it is difficult to make afair comparison between them as they are all trained and evaluated on different proprietary datasets(e.g. "TalkType" dataset, "Alexa" dataset, etc.) with different input speech featuresand audio duration.

Also, the primary focus of prior research has been to maximize the accuracywith a small memory footprint model, without explicit constraints of underlying hardware, such aslimits on number of operations per inference.

In contrast, this work is more hardware-centric andtargeted towards neural network architectures that maximize accuracy on microcontroller devices.The constraints on memory and compute significantly limit the neural network parameters and thenumber of operations.

Microcontroller Systems

A typical microcontroller system consists of a processor core, an on-chip SRAM block and anon-chip embedded flash.

Table 1 shows some commercially available microcontroller developmentboards with Arm Cortex-M processor cores with different compute capabilities running at differentfrequencies (16 MHz to 216 MHz), consisting of a wide range of on-chip memory (SRAM: 8 KB to320 KB; Flash: 128 KB to 1 MB).

The program binary, usually preloaded into the non-volatile flash,is loaded into the SRAM at startup and the processor runs the program with the SRAM as the maindata memory. Therefore, the size of the SRAM limits the size of memory that the software can use.

Other than the memory footprint, performance (i.e., operations per second) is also a constraining factorfor running neural networks on microcontrollers.

Most microcontrollers are designed for embeddedapplications with low cost and high energy-efficiency as the primary targets, and do not have highthroughput for compute-intensive workloads such as neural networks.

Some microcontrollers haveintegrated DSP instructions that can be useful for running neural network workloads. For example,Cortex-M4 and Cortex-M7 have integrated SIMD and MAC instructions that can be used to acceleratelow-precision computation in neural networks.

Neural Network Architectures for KWS

This section gives an overview of all the different neural network architectures explored in thiswork including the deep neural network (DNN), convolutional neural network (CNN), recurrentneural network (RNN), convolutional recurrent neural network (CRNN) and depthwise separableconvolutional neural network (DS-CNN).

Deep Neural Network (DNN)

The DNN is a standard feed-forward neural network made of a stack of fully-connected layers andnon-linear activation layers. The input to the DNN is the flattened feature matrix, which feeds into astack of d hidden fully-connected layers each with n neurons. Typically, each fully-connected layeris followed by a rectified linear unit (ReLU) based activation function. At the output is a linear layerfollowed by a softmax layer generating the output probabilities of the k keywords, which are used forfurther posterior handling.

Convolutional Neural Network (CNN)

One main drawback of DNN based KWS is that they fail to efficiently model the local temporaland spectral correlation in the speech features.

CNNs exploit this correlation by treating the inputtime-domain and spectral-domain features as an image and performing 2-D convolution operationsover it.

The convolution layers are typically followed by batch normalization, ReLU basedactivation functions and optional max/average pooling layers, which reduce the dimensionality ofthe features. During inference, the parameters of batch normalization can be folded into the weightsof the convolution layers. In some cases, a linear low-rank layer, which is simply a fully-connectedlayer without non-linear activation, is added in between the convolution layers and dense layers forthe purpose of reducing parameters and accelerating training .

Recurrent Neural Network (RNN)

RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition [20], language modeling [21], translation [22], etc. RNNs not only exploit the temporal relationbetween the input signal, but also capture the long-term dependencies using "gating" mechanism

Convolutional Recurrent Neural Network (CRNN)

Convolution recurrent neural network is a hybrid of CNN and RNN, which takes advantages ofboth. It exploits the local temporal/spatial correlation using convolution layers and global temporaldependencies in the speech features using recurrent layers. As shown in Fig. 3, a CRNN model

Depthwise Separable Convolutional Neural Network (DS-CNN)

Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard3-D convolution operation [29] and has been used to achieve compact network architectures in thearea of computer vision

Experiments and ResultsConclusions

Hardware optimized neural network architecture is key to get efficient results on memory and computeconstrained microcontrollers.

We trained various neural network architectures for keyword spottingpublished in literature on Google speech commands dataset to compare their accuracy and memoryrequirements vs. operations per inference, from the perspective of deployment on microcontrollersystems.

We quantized representative trained 32-bit floating-point KWS models into 8-bit fixed-pointversions demonstrating that these models can easily be quantized for deployment without any lossin accuracy, even without retraining. Furthermore, we trained a new KWS model using depthwiseseparable convolution layers, inspired from MobileNet.

Based on typical microcontroller systems,we derived three sets of memory/compute constraints for the neural networks and performed resourceconstrained neural network architecture exploration to find the best networks achieving maximumaccuracy within these constraints.

In all three sets of memory/compute constraints, depthwiseseparable CNN model (DS-CNN) achieves the best accuracies of 94.4%, 94.9% and 95.4% comparedto the other model architectures within those constraints, which shows good scalability of the DS-CNNmodel.

The code, model definitions and pretrained models are available at https://github.com/ARMsoftware/ML-KWS-for-MCU.

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-12-29,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 SmellLikeAISpirit 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
暂无评论
推荐阅读
微软、AWS后,英伟达也已接入DeepSeek,扎克伯格:是的,我们也将“偷师”DeepSeek
新春伊始,DeepSeek 凭借其极有竞争性的价格优势和精准的算法火遍了硅谷和华尔街。与此同时,各大科技巨头也迅速行动,纷纷下场无缝接入 DeepSeek 的 R1 模型服务,以期抢占 AI 领域先机。
深度学习与Python
2025/02/03
1970
微软、AWS后,英伟达也已接入DeepSeek,扎克伯格:是的,我们也将“偷师”DeepSeek
服软了?爆料英伟达联手DeepSeek推出NIM服务
北京时间1月31日,英伟达正式宣布在其官网上发布了基于DeepSeek R1 671b模型的“英伟达NIM微服务”预览版,并高调宣称DeepSeek-R1的推理能力处于“最先进”水平。
用户11203141
2025/03/06
610
服软了?爆料英伟达联手DeepSeek推出NIM服务
什么是检索增强生成(又名 RAG-Retrieval-Augmented Generation)及相关引用文档
检索增强生成是一种利用来自特定相关数据源的信息来增强生成式 AI 模型的准确性和可靠性的技术。
晓兵
2025/03/23
1670
什么是检索增强生成(又名 RAG-Retrieval-Augmented Generation)及相关引用文档
刚刚,老黄携GB300震撼登场!DeepSeek推理暴涨40倍加速全球最快,26年Rubin问世
更令人瞩目的是,DeepSeek R1推动的推理时Scaling,让Scaling Law出现了全新的发展路线。
新智元
2025/03/20
650
刚刚,老黄携GB300震撼登场!DeepSeek推理暴涨40倍加速全球最快,26年Rubin问世
DeepSeek霸榜一周:奥特曼终于承认在开源问题上处于“历史错误的一边”;迅雷斥资5亿收购“直男社区”虎扑 | Q资讯
DeepSeek-R1 霸榜一周:性能、价格与开源三重冲击;网传梁文锋回应冯骥国运论;阿里巴巴发布 AI 模型,声称超越 DeepSeek;马斯克掀起“硬核裁员”风暴:百万雇员面临“自愿离职”抉择;谷歌为 Pixel、Android 部门员工提供“自愿离职”机会;迅雷将以 5 亿元收购“直男社区”虎扑,曾最高估值达 77 亿元;Linux 基金会发布指南应对 OFAC 制裁挑战;谷歌开源 PebbleOS 操作系统......
深度学习与Python
2025/02/03
1620
DeepSeek霸榜一周:奥特曼终于承认在开源问题上处于“历史错误的一边”;迅雷斥资5亿收购“直男社区”虎扑 | Q资讯
对DeepSeek事件的复盘和展望
作者简介:腾讯云TVP、北京交通大学计算机学院教授、计算机科学系主任、交通大数据与人工智能教育部重点实验室副主任。主要研究方向为多模态计算、网络数据挖掘、可信与对齐、AI Agent等。曾获中科院院长特别奖、ACM中国新星奖,作为负责人先后承担相关方向的国家自然科学基金重点项目、(首批)新一代人工智能重大项目课题、北京市杰出青年基金和国家级青年人才计划,第一/二作者论文7次获得中国计算机学会推荐国际会议的主会论文奖项,以第二完成人获得中国电子学会自然科学一等奖和北京市科学技术奖。
TVP官方团队
2025/02/03
1.1K0
对DeepSeek事件的复盘和展望
英伟达B200打破摩尔定律!老黄顺便公开GPT-4的秘密
时隔两年,英伟达官宣新一代Blackwell架构,为AI带来30倍推理加速。定位直指“新工业革命的引擎” 。
量子位
2024/03/20
1670
英伟达B200打破摩尔定律!老黄顺便公开GPT-4的秘密
刚刚,DeepSeek开源MoE训练、推理EP通信库DeepEP,真太Open了!
上周五,DeepSeek 发推说本周将是开源周(OpenSourceWeek),并将连续开源五个软件库。
机器之心
2025/02/25
1650
刚刚,DeepSeek开源MoE训练、推理EP通信库DeepEP,真太Open了!
网上关于DeepSeek的说法,哪些是错误的?
这是个流传很广的说法,意指DeepSeek通过绕过CUDA,彻底摆脱了NVIDIA的依赖。
小白学视觉
2025/02/10
1510
网上关于DeepSeek的说法,哪些是错误的?
揭秘 NVIDIA Dynamo:分布式AI推理的高效引擎
随着生成式AI模型规模的爆炸式增长,企业面临推理成本激增、分布式部署复杂度高、资源利用率低等挑战。传统推理框架在跨多节点扩展时,常因KV缓存重复计算、GPU负载不均、通信延迟等问题导致性能瓶颈。NVIDIA Dynamo作为新一代开源推理框架,专为大规模分布式环境设计,通过解耦式服务、智能路由、动态资源调度等创新技术,将推理吞吐量提升30倍以上。本文将深入解析其核心架构、技术优势及实际应用场景,帮助开发者高效部署生成式AI模型,降低推理成本并释放GPU潜能。
数据存储前沿技术
2025/03/29
5980
揭秘 NVIDIA Dynamo:分布式AI推理的高效引擎
DeepSeek爆了,问问它《哪吒2》今年为啥爆火
最近,国产大模型 DeepSeek 因其惊人的文本生成以及推理能力(深度思考)在 AI 界刷屏,并且《哪吒 2》也勇夺国产电影票房冠军。
wayn
2025/02/08
1640
DeepSeek爆了,问问它《哪吒2》今年为啥爆火
NVIDIA AI应用平台NIM开发人员指南
英伟达 NIM 通过提供灵活的微服务套件,为 AI 推理提供了一个强大、可扩展且安全的平台。
云云众生s
2024/08/16
3490
NVIDIA AI应用平台NIM开发人员指南
AI届的拼多多登临iOS榜,DeepSeek到底是什么来头?
DeepSeek的App目前已经登陆iOS免费榜前十,并超过了Google Gemini和微软的Copilot等同类竞品,可谓异军突起。
AntDream
2025/02/04
2120
AI届的拼多多登临iOS榜,DeepSeek到底是什么来头?
DeepSeek遭暴力破解,攻击IP均来自美国!
今天是大年初一,鱼皮祝大家在新的一年里阖家幸福,天天暴富! 接下来我们进入正题,说一说DeepSeek的最新情况。 DeepSeek可以说是近期最耀眼的新星
程序员鱼皮
2025/02/04
5830
DeepSeek遭暴力破解,攻击IP均来自美国!
DeepSeek 全面分析报告
DeepSeek 是一款由中国人工智能初创公司 DeepSeek 开发的大型语言模型 (LLM),于 2025 年 1 月发布,迅速成为全球人工智能领域的一匹黑马。DeepSeek 不仅在性能上可与 OpenAI、Google 等巨头的模型相媲美,而且其训练成本和运行效率都显著优于竞争对手,引发了业界和市场的广泛关注。本报告将对 DeepSeek 进行全面分析,涵盖其公司概况、产品服务、技术优势、竞争格局、商业模式、财务状况以及未来发展前景等方面。
@小森
2025/02/23
4360
DeepSeek 全面分析报告
关于Grok3和DeepSeek背后苦涩教训引发的思考
Ilya Sutskever(前 OpenAI 联合创始人兼首席科学家)曾在在召开的 NeurIPS 会议上表示,大模型的预训练已经走到了尽头。 而 Noam Brown(OpenAI 研究员,曾带领团队开发出在德州扑克中战胜职业选手的 AI 系统 Pluribus)在关于 OpenAI O1 发布的采访中提到,提升 Test-Time Compute 是提升大模型答案质量的关键。2024 年的圣诞节前夕,一片节日气氛下,湾区的硅谷 AI 大佬、机构和投资者们正在深入探讨从 “Scaling Learning” 向 “Scaling Search” 转变的路径。而这一切的思考,都可以追溯到 Rich Sutton(强化学习领域的奠基人之一)在 2019 年发表的经典短文 The Bitter Lesson (苦涩的教训)。
致Great
2025/02/24
1580
关于Grok3和DeepSeek背后苦涩教训引发的思考
开发者:为NVIDIA基于NIM的AI应用商店做好准备
NIM(NVIDIA 推理微服务)是一个虚拟化容器,用于提供 AI 功能;该技术将为 NVIDIA AI 应用商店提供支持。
云云众生s
2024/06/13
1260
60页PPT全解:DeepSeek系列论文技术要点整理
经过断断续续的整理,终于把DeepSeek系列论文的核心知识点汇总PPT完成差不多可以和大家分享了。虽然现在发布实在太晚,已经有铺天盖地的对deepseek技术解读文章和分享,但我依然希望将自己对DeepSeek V1到DeepSeek R1的理解与总结呈现给大家。
languageX
2025/03/15
1.3K0
60页PPT全解:DeepSeek系列论文技术要点整理
DeepSeek大讨论:中国AI在成本、真实训练成本和封闭模型利润率影响方面的领导地位
在过去一周里,DeepSeek 成为全球唯一的话题。目前,DeepSeek 的日流量已经远远超过 Claude、Perplexity 甚至 Gemini。但对行业观察者来说,这并非“新”闻。我们已经讨论 DeepSeek 数月之久。这家公司并不新。SemiAnalysis 一直认为 DeepSeek 极具天赋,但公众此前并不关心。
用户11468091
2025/02/05
7480
DeepSeek大讨论:中国AI在成本、真实训练成本和封闭模型利润率影响方面的领导地位
0 帧起步,腾讯云 TI 平台 5 分钟 私有化 DeepSeek
新年本应该祝福满屏的朋友圈,忽然间被 Deepseek 刷屏,这款被《黑神话:悟空》的制作人冯骥的评述:“DeepSeek可能是国运级别的科技成果。至今已在160多个国家的应用商店屠榜。
不惑
2025/02/05
7290
0 帧起步,腾讯云 TI 平台 5 分钟 私有化 DeepSeek
推荐阅读
相关推荐
微软、AWS后,英伟达也已接入DeepSeek,扎克伯格:是的,我们也将“偷师”DeepSeek
更多 >
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档