2 min read360 words

多模态 Agent

多模态 Agent 能够处理文本、图像、音频、视频等多种输入并执行跨模态任务。视觉能力让 Agent 可以理解屏幕、操控 GUI、阅读文档。

多模态能力层次

graph TB A[多模态 Agent] --> B[感知层] A --> C[理解层] A --> D[行动层] B --> B1[视觉：OCR / 物体检测] B --> B2[听觉：语音识别] B --> B3[文档：PDF / 表格解析] C --> C1[场景理解] C --> C2[多模态推理] C --> C3[空间定位] D --> D1[GUI 操作] D --> D2[文档生成] D --> D3[语音回复] style A fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px style C fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style D fill:#e8f5e9,stroke:#388e3c,stroke-width:2px

视觉 Agent 架构

"""
多模态 Agent 系统
"""
from dataclasses import dataclass, field
from enum import Enum
from abc import ABC, abstractmethod
class ModalityType(Enum):
TEXT = "text"
IMAGE = "image"
AUDIO = "audio"
VIDEO = "video"
DOCUMENT = "document"
@dataclass
class MultimodalInput:
"""多模态输入"""
modality: ModalityType
content: str | bytes       # 文本或二进制数据
metadata: dict = field(default_factory=dict)
@dataclass
class ScreenState:
"""屏幕状态"""
screenshot: bytes
elements: list[dict] = field(default_factory=list)
focused_element: str | None = None
url: str = ""
@dataclass
class GUIAction:
"""GUI 操作"""
action_type: str     # click, type, scroll, screenshot
target: str          # 元素描述或坐标
value: str = ""      # 输入内容
class VisionTool(ABC):
"""视觉工具基类"""
@abstractmethod
def process(self, image_data: bytes) -> dict:
pass
class ScreenAnalyzer(VisionTool):
"""屏幕分析器"""
def process(self, image_data: bytes) -> dict:
"""分析屏幕截图，识别 UI 元素"""
# 实际实现会调用视觉模型
return {
"elements": [
{"type": "button", "text": "Submit", "bbox": [100, 200, 200, 240]},
{"type": "input", "text": "", "bbox": [50, 150, 300, 180]},
{"type": "text", "text": "Welcome", "bbox": [50, 50, 200, 80]},
],
"layout": "form_page",
}
class DocumentAnalyzer(VisionTool):
"""文档分析器"""
def process(self, image_data: bytes) -> dict:
"""分析文档图片"""
return {
"text_blocks": [],
"tables": [],
"figures": [],
"document_type": "report",
}
class MultimodalAgent:
"""多模态 Agent"""
def __init__(self, vision_llm, tools: dict[str, VisionTool]):
self.llm = vision_llm
self.tools = tools
self.action_history: list[GUIAction] = []
def perceive(self, inputs: list[MultimodalInput]) -> dict:
"""感知：处理多模态输入"""
perception = {}
for inp in inputs:
if inp.modality == ModalityType.IMAGE:
analyzer = self.tools.get("screen_analyzer")
if analyzer:
perception["screen"] = analyzer.process(inp.content)
elif inp.modality == ModalityType.DOCUMENT:
analyzer = self.tools.get("doc_analyzer")
if analyzer:
perception["document"] = analyzer.process(inp.content)
elif inp.modality == ModalityType.TEXT:
perception["text"] = inp.content
return perception
def plan_gui_actions(self, task: str, screen: ScreenState) -> list[GUIAction]:
"""规划 GUI 操作序列"""
# 分析屏幕
screen_info = self.tools["screen_analyzer"].process(screen.screenshot)
prompt = (
f"任务：{task}\n"
f"屏幕元素：{screen_info['elements']}\n"
f"历史操作：{[a.action_type for a in self.action_history[-5:]]}\n"
f"输出操作序列（JSON 数组）。"
)
result = self.llm.generate(prompt, images=[screen.screenshot])
# 解析动作
actions = self._parse_actions(result)
return actions
def execute_action(self, action: GUIAction, executor) -> ScreenState:
"""执行单个 GUI 操作"""
self.action_history.append(action)
if action.action_type == "click":
executor.click(action.target)
elif action.action_type == "type":
executor.type_text(action.target, action.value)
elif action.action_type == "scroll":
executor.scroll(action.target)
# 获取新屏幕状态
return executor.get_screen_state()
def _parse_actions(self, llm_output: str) -> list[GUIAction]:
"""解析 LLM 输出为操作列表"""
# 简化实现
return [GUIAction(action_type="click", target="submit_button")]

典型应用场景

graph LR A[多模态 Agent 应用] --> B[Web 自动化] A --> C[文档处理] A --> D[GUI 测试] A --> E[RPA 2.0] B --> B1[表单填写
数据抓取] C --> C1[发票识别
合同审核] D --> D1[UI 回归测试
截图对比] E --> E1[跨应用操作
流程自动化] style A fill:#e8eaf6,stroke:#3f51b5,stroke-width:3px

多模态 Agent 对比

项目	模态支持	GUI 操作	开源	特点
Claude Computer Use	视觉+文本	✅	❌	原生屏幕理解
GPT-4 Vision	视觉+文本	❌	❌	强图像理解
CogAgent	视觉+文本	✅	✅	专注 GUI Agent
WebVoyager	视觉+文本	✅	✅	Web 导航
SeeClick	视觉	✅	✅	视觉定位点击

挑战与局限

挑战	现状	缓解策略
视觉幻觉	误读 UI 元素	多次截图验证
坐标偏差	点击位置不准	Set-of-Mark 标注
长流程稳定性	多步操作易失败	检查点 + 回退
安全风险	操控真实系统	沙箱 + 权限控制
延迟	视觉推理较慢	混合文本+视觉

本章小结

主题	要点
感知-理解-行动	三层架构：视觉输入 → 多模态推理 → GUI 执行
GUI 操作	屏幕分析 + 操作规划 + 执行 + 状态验证
核心挑战	视觉幻觉、坐标偏差、长流程稳定性
应用场景	Web 自动化、文档处理、GUI 测试、RPA

下一章：Agent 技术趋势