高质量的浏览器端流式语音合成(Text-To-Speech, TTS)实现,核心在于三件事:
AudioBuffer
并在 Web Audio 中无缝拼接播放;本文梳理一套通用的前端流式 TTS 技术方案,不依赖具体项目,重点聚焦可迁移的工程实践与代码范式。
不少云厂商基于 HTTP 签名(如 HMAC-SHA256)来保护 TTS WebSocket 接入。通用做法是:
1) 组装签名原文:包含 host
, date
, request-line
等固定字段;
2) 使用 apiSecret
做 HMAC-SHA256,结果再做 Base64;
3) 将 authorization
、date
、host
作为查询参数拼到 WS URL。
示例(仅示意):
import CryptoJS from 'crypto-js';
function buildWsUrl({ baseUrl, host, path, apiKey, apiSecret }) {
const date = new Date().toGMTString();
const algorithm = 'hmac-sha256';
const headers = 'host date request-line';
const signatureOrigin = `host: ${host}\ndate: ${date}\nGET ${path} HTTP/1.1`;
const signatureSha = CryptoJS.HmacSHA256(signatureOrigin, apiSecret);
const signature = CryptoJS.enc.Base64.stringify(signatureSha);
const authOrigin = `api_key="${apiKey}", algorithm="${algorithm}", headers="${headers}", signature="${signature}"`;
const authorization = btoa(authOrigin);
const url = `${baseUrl}?authorization=${authorization}&date=${encodeURIComponent(date)}&host=${host}`;
return url.replace('https://', 'wss://');
}
要点:
toGMTString()
或 RFC 规范的日期字符串,避免时区歧义;TTS 往往要求 UTF-8 文本并做 Base64。为提升稳定性与音质:
TextEncoder
转 UTF-8,再转 Base64;function encodeUtf8Base64(text) {
const clean = text.trim().replace(/\s+/g, ' ');
const utf8 = new TextEncoder().encode(clean);
let binary = '';
for (const byte of utf8) binary += String.fromCharCode(byte);
return btoa(binary);
}
对“边打字边播报/聊天流式播报”场景,过小分片会导致频繁建连或过多请求,过大分片则首音出声慢。通用策略:
伪代码:
const thresholds = [10, 50, 100];
let idx = 0, trigger = thresholds[0];
let buffer = '';
let flushTimer = null;
function onTextChunk(chunk) {
buffer += chunk;
if (buffer.length >= trigger) {
enqueueTts(buffer);
buffer = '';
idx = Math.min(idx + 1, thresholds.length - 1);
trigger = thresholds[idx];
} else {
scheduleFlush();
}
}
function scheduleFlush(delay = 400) {
if (flushTimer) clearTimeout(flushTimer);
flushTimer = setTimeout(() => {
if (buffer) { enqueueTts(buffer); buffer = ''; }
}, delay);
}
如果服务端支持多并发通道,前端可以控制并发数(如 2)来兼顾吞吐与延迟:
acquire/release
控制最大并发;示意:
const MAX_CONCURRENCY = 2;
const active = new Set();
const waiters = [];
function acquire() {
return new Promise(resolve => {
if (active.size < MAX_CONCURRENCY) {
const id = Symbol(); active.add(id); resolve(id);
} else {
waiters.push(resolve);
}
});
}
function release(id) {
active.delete(id);
if (waiters.length) {
const next = waiters.shift();
const nid = Symbol(); active.add(nid); next(nid);
}
}
健壮性要点:
code != 0
视为失败,尽早释放并上报;status === 2
表示服务端完成当前合成,收尾并 resolve。指数退避重试示意:
async function withRetries(task, { retries = 3 } = {}) {
let attempt = 0;
while (true) {
try { return await task(); }
catch (e) {
if (attempt++ >= retries) throw e;
const delay = 2 ** (attempt - 1) * 1000;
await new Promise(r => setTimeout(r, delay));
}
}
}
云端常返回 Base64 编码的 PCM16 单声道数据(16kHz)。在 Web Audio 中需转成 Float32
并写入 AudioBuffer
:
async function decodePcm16ToBuffer(base64Audio, audioCtx, sampleRate = 16000) {
const bin = atob(base64Audio);
const bytes = new Uint8Array(bin.length);
for (let i = 0; i < bin.length; i++) bytes[i] = bin.charCodeAt(i);
const pcm = new Int16Array(bytes.buffer);
const floats = new Float32Array(pcm.length);
for (let i = 0; i < pcm.length; i++) floats[i] = pcm[i] / 32768;
const buffer = audioCtx.createBuffer(1, floats.length, sampleRate);
buffer.getChannelData(0).set(floats);
return buffer;
}
拼接播放的关键是按时间线连续排布各段:
function scheduleBuffers(buffers, audioCtx) {
let t = audioCtx.currentTime;
const sources = [];
for (const buf of buffers) {
const src = audioCtx.createBufferSource();
src.buffer = buf;
src.connect(audioCtx.destination);
src.start(t);
t += buf.duration;
sources.push(src);
}
return sources;
}
提示:AudioBufferSourceNode
一次性使用,启动后无法再次 start()
;若要实现“暂停后从中间继续”,需要记录偏移并用新的 Source 重新从偏移处调度。
良好的用户体验需要稳定的控制:
play
:创建或恢复 AudioContext
,拉起队列处理;pause
:停止当前 Source
,记录每段的已播时长(或统一为“从头重新拼接但带偏移”策略);resume
:用新的 Source
从偏移位置继续;stop
:停止全部,清空队列与缓冲,并重置阈值与计时器。简化状态示意:
stopped → playing → paused → (resume) → playing → (stop) → stopped
↘──────────── (stop) ───────────↗
实践建议:
Source
;AudioContext.state === 'suspended'
的自动恢复(用户手势触发后才允许发声)。可以用两条独立但协作的管线:
ttsQueue → synthesize → audioQueue
;audioQueue → decode → schedule → play
。循环条件:
ttsQueue
非空且没有正在合成的文本时,取一段去合成;audioQueue
非空且未在播放时,取下一块去解码播放;setTimeout
轻量轮询(避免同步递归导致阻塞)。常见可调参数:
vcn
;speed/volume/pitch
;aue
(如 raw
vs mp3
)和 auf
(采样率与位宽)。工程上建议将这些参数抽象为“业务配置 + 运行时覆盖”,便于快速切换音色与风格。
为缩短 MTTR(平均恢复时间),需要在关键链路打点与日志:
AudioContext
;// 1) 初始化
const audioCtx = new (window.AudioContext || window.webkitAudioContext)();
const ttsQueue = [], audioQueue = [];
let playing = false, paused = false;
// 2) 推入文本(支持流式)
function pushText(chunk) { /* 参照第 3 节的分片与冲刷策略 */ }
// 3) 合成
async function synthesize(text) {
const url = /* buildWsUrl(...) */;
const ws = new WebSocket(url);
return new Promise((resolve, reject) => {
const audioChunks = [];
ws.onopen = () => ws.send(JSON.stringify({ /* 参数 + encodeUtf8Base64(text) */ }));
ws.onmessage = e => {
const msg = JSON.parse(e.data);
if (msg.code !== 0) return reject(new Error('synthesis failed'));
if (msg.data?.audio) audioChunks.push(msg.data.audio);
if (msg.data?.status === 2) resolve(audioChunks);
};
ws.onerror = reject;
});
}
// 4) 播放
async function playNext() {
if (playing || paused) return;
if (!audioQueue.length) return;
playing = true;
const base64Chunks = audioQueue.shift();
const buffers = [];
for (const c of base64Chunks) buffers.push(await decodePcm16ToBuffer(c, audioCtx));
const sources = scheduleBuffers(buffers, audioCtx);
sources.at(-1).onended = () => { playing = false; playNext(); };
}
// 5) 控制
function pause() { /* 停止当前源并记录偏移,下次 resume 以新源继续 */ }
function resume() { /* 从记录的偏移处重建 source 并继续 */ }
function stop() { /* 清理状态、队列与计时器 */ }
要获得良好的“首音时间”与整体流畅度,需要在“文本分片策略 + 并发控制 + 音频拼接 + 状态机”上协同设计:
这套方法论适用于绝大多数基于 WebSocket 的云端 TTS 服务,能在浏览器端实现“低首响、不卡顿、可控”的语音播放体验。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。