语音 AI 技术
语音是最自然的人机交互方式。ASR(语音识别)、TTS(语音合成)和实时对话是三大核心能力。
语音 AI 全景
graph TB
A[语音 AI] --> B[语音识别 ASR]
A --> C[语音合成 TTS]
A --> D[实时对话]
A --> E[语音克隆]
B --> B1[Whisper]
B --> B2[Gemini]
B --> B3[腾讯 ASR]
C --> C1[OpenAI TTS]
C --> C2[ElevenLabs]
C --> C3[Edge TTS]
D --> D1[GPT-4o Realtime]
D --> D2[Gemini Live]
E --> E1[ElevenLabs]
E --> E2[XTTS]
style A fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
Whisper 语音识别
"""
Whisper ASR — 语音转文字
"""
from dataclasses import dataclass
@dataclass
class TranscriptionResult:
"""转录结果"""
text: str
language: str
duration_sec: float
segments: list[dict]
class WhisperASR:
"""Whisper 语音识别"""
MODELS = {
"whisper-1 (API)": {
"provider": "OpenAI",
"cost": "$0.006/分钟",
"语言": "99+种语言",
"优势": "简单、准确",
"劣势": "需联网",
},
"whisper-large-v3 (本地)": {
"provider": "开源",
"cost": "免费",
"语言": "99+种语言",
"优势": "离线、可定制",
"劣势": "需要 GPU (4GB+)",
},
"faster-whisper": {
"provider": "开源",
"cost": "免费",
"语言": "99+种语言",
"优势": "速度快 4x、内存少 2x",
"劣势": "需安装 CTranslate2",
},
}
@staticmethod
def api_usage() -> str:
"""OpenAI Whisper API 用法"""
return """
from openai import OpenAI
client = OpenAI()
# 基础转录
with open("audio.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
language="zh", # 指定语言可提高准确率
response_format="verbose_json",
timestamp_granularities=["segment"],
)
print(transcript.text)
# 带时间戳的输出
for segment in transcript.segments:
start = segment["start"]
end = segment["end"]
text = segment["text"]
print(f"[{start:.1f}s - {end:.1f}s] {text}")
"""
@staticmethod
def local_usage() -> str:
"""本地 Whisper 用法"""
return """
# pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel(
"large-v3",
device="cuda", # 或 "cpu"
compute_type="float16", # GPU 用 float16
)
segments, info = model.transcribe(
"audio.mp3",
language="zh",
beam_size=5,
vad_filter=True, # 自动过滤静音
)
print(f"语言: {info.language}, 概率: {info.language_probability:.0%}")
for segment in segments:
print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")
"""
# 方案对比
whisper = WhisperASR()
print("=== Whisper 方案对比 ===")
for name, info in whisper.MODELS.items():
print(f"\n{name}:")
for k, v in info.items():
print(f" {k}: {v}")
TTS 语音合成
"""
TTS 语音合成方案
"""
class TTSManager:
"""TTS 管理器"""
SOLUTIONS = {
"OpenAI TTS": {
"质量": "⭐⭐⭐⭐",
"中文": "良好",
"成本": "$15/1M 字符",
"延迟": "~1s",
"voices": ["alloy", "echo", "fable", "onyx", "nova", "shimmer"],
"用法": """
from openai import OpenAI
client = OpenAI()
response = client.audio.speech.create(
model="tts-1", # tts-1 快, tts-1-hd 质量高
voice="nova",
input="你好,这是 TTS 测试。",
speed=1.0, # 0.25 - 4.0
)
response.stream_to_file("output.mp3")
""",
},
"Edge TTS": {
"质量": "⭐⭐⭐⭐",
"中文": "优秀",
"成本": "免费",
"延迟": "~0.5s",
"用法": """
# pip install edge-tts
import edge_tts
import asyncio
async def speak(text, voice="zh-CN-XiaoxiaoNeural"):
communicate = edge_tts.Communicate(text, voice)
await communicate.save("output.mp3")
asyncio.run(speak("你好,这是免费的 TTS 方案。"))
# 中文音色: zh-CN-XiaoxiaoNeural, zh-CN-YunxiNeural
# 英文音色: en-US-JennyNeural, en-US-GuyNeural
""",
},
"ElevenLabs": {
"质量": "⭐⭐⭐⭐⭐",
"中文": "良好",
"成本": "$0.30/1K 字符",
"延迟": "~1.5s",
"特色": "语音克隆、情感控制",
},
}
@classmethod
def recommend(cls, scenario: str) -> str:
"""推荐 TTS 方案"""
recs = {
"低成本": "Edge TTS",
"最佳质量": "ElevenLabs",
"通用场景": "OpenAI TTS",
"中文场景": "Edge TTS",
"语音克隆": "ElevenLabs",
}
return recs.get(scenario, "OpenAI TTS")
tts = TTSManager()
print("=== TTS 方案对比 ===")
for name, info in tts.SOLUTIONS.items():
print(f"\n{name}:")
print(f" 质量: {info['质量']}, 中文: {info['中文']}, 成本: {info['成本']}")
实时语音对话
"""
实时语音对话架构
"""
class RealtimeVoiceChat:
"""实时语音对话"""
# 方案对比
ARCHITECTURES = {
"Pipeline 方案": {
"流程": "ASR → LLM → TTS",
"延迟": "2-5s",
"成本": "低",
"优势": "组件可替换、成熟方案",
"劣势": "延迟高、无法处理语气",
"适用": "客服、智能助手",
},
"原生多模态": {
"流程": "音频 → 多模态模型 → 音频",
"延迟": "0.3-1s",
"成本": "中",
"优势": "低延迟、理解语气情感",
"劣势": "方案较新、供应商锁定",
"代表": "GPT-4o Realtime API, Gemini Live",
"适用": "实时对话、面试模拟",
},
}
@staticmethod
def pipeline_architecture() -> dict:
"""Pipeline 架构"""
return {
"ASR": {
"选型": "Whisper / faster-whisper",
"延迟": "实时 ~0.5s",
},
"LLM": {
"选型": "GPT-4o-mini (流式)",
"延迟": "TTFT ~0.3s",
},
"TTS": {
"选型": "Edge TTS / OpenAI TTS",
"延迟": "~0.5s",
},
"总延迟": "~1.5-3s",
}
@staticmethod
def realtime_api_example() -> str:
"""GPT-4o Realtime API 示例结构"""
return """
# GPT-4o Realtime API (WebSocket)
# 音频直接输入输出,无需 ASR/TTS
import websockets
import json
async def realtime_chat():
url = "wss://api.openai.com/v1/realtime"
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(url, extra_headers=headers) as ws:
# 配置会话
await ws.send(json.dumps({
"type": "session.update",
"session": {
"model": "gpt-4o-realtime-preview",
"voice": "alloy",
"instructions": "你是一个中文语音助手",
}
}))
# 发送音频
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": "<base64_audio_data>",
}))
# 接收响应(文本 + 音频)
async for message in ws:
data = json.loads(message)
if data["type"] == "response.audio.delta":
# 播放音频
pass
elif data["type"] == "response.text.delta":
print(data["delta"], end="")
"""
# 使用
voice = RealtimeVoiceChat()
print("=== 语音对话方案对比 ===")
for name, info in voice.ARCHITECTURES.items():
print(f"\n{name}:")
print(f" 流程: {info['流程']}")
print(f" 延迟: {info['延迟']}")
print(f" 适用: {info['适用']}")
本章小结
| 任务 | 推荐方案 | 成本 |
|---|---|---|
| 语音转文字 | Whisper API / faster-whisper | $0.006/min / 免费 |
| 文字转语音(中文) | Edge TTS | 免费 |
| 文字转语音(英文) | OpenAI TTS | $15/1M chars |
| 高质量 TTS | ElevenLabs | $0.30/1K chars |
| 实时对话 | GPT-4o Realtime API | ~$0.06/min |
下一章:文档智能处理。