多模态 Agent
多模态 Agent 能够处理文本、图像、音频、视频等多种输入并执行跨模态任务。视觉能力让 Agent 可以理解屏幕、操控 GUI、阅读文档。
多模态能力层次
graph TB
A[多模态 Agent] --> B[感知层]
A --> C[理解层]
A --> D[行动层]
B --> B1[视觉:OCR / 物体检测]
B --> B2[听觉:语音识别]
B --> B3[文档:PDF / 表格解析]
C --> C1[场景理解]
C --> C2[多模态推理]
C --> C3[空间定位]
D --> D1[GUI 操作]
D --> D2[文档生成]
D --> D3[语音回复]
style A fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px
style C fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style D fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
视觉 Agent 架构
"""
多模态 Agent 系统
"""
from dataclasses import dataclass, field
from enum import Enum
from abc import ABC, abstractmethod
class ModalityType(Enum):
TEXT = "text"
IMAGE = "image"
AUDIO = "audio"
VIDEO = "video"
DOCUMENT = "document"
@dataclass
class MultimodalInput:
"""多模态输入"""
modality: ModalityType
content: str | bytes # 文本或二进制数据
metadata: dict = field(default_factory=dict)
@dataclass
class ScreenState:
"""屏幕状态"""
screenshot: bytes
elements: list[dict] = field(default_factory=list)
focused_element: str | None = None
url: str = ""
@dataclass
class GUIAction:
"""GUI 操作"""
action_type: str # click, type, scroll, screenshot
target: str # 元素描述或坐标
value: str = "" # 输入内容
class VisionTool(ABC):
"""视觉工具基类"""
@abstractmethod
def process(self, image_data: bytes) -> dict:
pass
class ScreenAnalyzer(VisionTool):
"""屏幕分析器"""
def process(self, image_data: bytes) -> dict:
"""分析屏幕截图,识别 UI 元素"""
# 实际实现会调用视觉模型
return {
"elements": [
{"type": "button", "text": "Submit", "bbox": [100, 200, 200, 240]},
{"type": "input", "text": "", "bbox": [50, 150, 300, 180]},
{"type": "text", "text": "Welcome", "bbox": [50, 50, 200, 80]},
],
"layout": "form_page",
}
class DocumentAnalyzer(VisionTool):
"""文档分析器"""
def process(self, image_data: bytes) -> dict:
"""分析文档图片"""
return {
"text_blocks": [],
"tables": [],
"figures": [],
"document_type": "report",
}
class MultimodalAgent:
"""多模态 Agent"""
def __init__(self, vision_llm, tools: dict[str, VisionTool]):
self.llm = vision_llm
self.tools = tools
self.action_history: list[GUIAction] = []
def perceive(self, inputs: list[MultimodalInput]) -> dict:
"""感知:处理多模态输入"""
perception = {}
for inp in inputs:
if inp.modality == ModalityType.IMAGE:
analyzer = self.tools.get("screen_analyzer")
if analyzer:
perception["screen"] = analyzer.process(inp.content)
elif inp.modality == ModalityType.DOCUMENT:
analyzer = self.tools.get("doc_analyzer")
if analyzer:
perception["document"] = analyzer.process(inp.content)
elif inp.modality == ModalityType.TEXT:
perception["text"] = inp.content
return perception
def plan_gui_actions(self, task: str, screen: ScreenState) -> list[GUIAction]:
"""规划 GUI 操作序列"""
# 分析屏幕
screen_info = self.tools["screen_analyzer"].process(screen.screenshot)
prompt = (
f"任务:{task}\n"
f"屏幕元素:{screen_info['elements']}\n"
f"历史操作:{[a.action_type for a in self.action_history[-5:]]}\n"
f"输出操作序列(JSON 数组)。"
)
result = self.llm.generate(prompt, images=[screen.screenshot])
# 解析动作
actions = self._parse_actions(result)
return actions
def execute_action(self, action: GUIAction, executor) -> ScreenState:
"""执行单个 GUI 操作"""
self.action_history.append(action)
if action.action_type == "click":
executor.click(action.target)
elif action.action_type == "type":
executor.type_text(action.target, action.value)
elif action.action_type == "scroll":
executor.scroll(action.target)
# 获取新屏幕状态
return executor.get_screen_state()
def _parse_actions(self, llm_output: str) -> list[GUIAction]:
"""解析 LLM 输出为操作列表"""
# 简化实现
return [GUIAction(action_type="click", target="submit_button")]
典型应用场景
graph LR
A[多模态 Agent 应用] --> B[Web 自动化]
A --> C[文档处理]
A --> D[GUI 测试]
A --> E[RPA 2.0]
B --> B1[表单填写
数据抓取] C --> C1[发票识别
合同审核] D --> D1[UI 回归测试
截图对比] E --> E1[跨应用操作
流程自动化] style A fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px
数据抓取] C --> C1[发票识别
合同审核] D --> D1[UI 回归测试
截图对比] E --> E1[跨应用操作
流程自动化] style A fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px
多模态 Agent 对比
| 项目 | 模态支持 | GUI 操作 | 开源 | 特点 |
|---|---|---|---|---|
| Claude Computer Use | 视觉+文本 | ✅ | ❌ | 原生屏幕理解 |
| GPT-4 Vision | 视觉+文本 | ❌ | ❌ | 强图像理解 |
| CogAgent | 视觉+文本 | ✅ | ✅ | 专注 GUI Agent |
| WebVoyager | 视觉+文本 | ✅ | ✅ | Web 导航 |
| SeeClick | 视觉 | ✅ | ✅ | 视觉定位点击 |
挑战与局限
| 挑战 | 现状 | 缓解策略 |
|---|---|---|
| 视觉幻觉 | 误读 UI 元素 | 多次截图验证 |
| 坐标偏差 | 点击位置不准 | Set-of-Mark 标注 |
| 长流程稳定性 | 多步操作易失败 | 检查点 + 回退 |
| 安全风险 | 操控真实系统 | 沙箱 + 权限控制 |
| 延迟 | 视觉推理较慢 | 混合文本+视觉 |
本章小结
| 主题 | 要点 |
|---|---|
| 感知-理解-行动 | 三层架构:视觉输入 → 多模态推理 → GUI 执行 |
| GUI 操作 | 屏幕分析 + 操作规划 + 执行 + 状态验证 |
| 核心挑战 | 视觉幻觉、坐标偏差、长流程稳定性 |
| 应用场景 | Web 自动化、文档处理、GUI 测试、RPA |
下一章:Agent 技术趋势