nanobot/docs/superpowers/specs/2026-04-26-multi-agent-mailbox-design.md
chengyongru 6daf91c620 feat(mailbox): add multi-agent mailbox channel plugin
File-system-based mailbox channel for inter-agent communication.
Zero modifications to existing code — pure channel plugin.

- MailboxManager: atomic file I/O, registry CRUD, message send/poll
- MailboxConfig: Pydantic config with camelCase aliasing
- MailboxChannel: polls inbox, routes messages via bus, callback routing
- Anti-loop protection: TTL decrement + circular trace detection
- allowFrom access control inherited from BaseChannel
- 29 unit/integration tests + standalone smoke test script

Design: docs/superpowers/specs/2026-04-26-multi-agent-mailbox-design.md
Plan: docs/superpowers/plans/2026-04-26-multi-agent-mailbox.md
2026-04-27 01:14:12 +08:00

431 lines
16 KiB
Markdown

# Multi-Agent Mailbox Communication Design
## Overview
Design a simple, zero-side-effect mechanism for multiple independent nanobot agent instances to communicate with each other. The mechanism uses a file-system-based mailbox system implemented as a standard nanobot channel plugin, requiring no modifications to existing code.
## Goals
- Multiple independent agent processes can discover and communicate with each other
- Peer-to-peer (fully connected) topology
- Asynchronous event-driven messaging
- "Boss experience": agents auto-delegate, auto-report progress, user only interacts through their normal channel (e.g., Feishu)
- Zero side effects: only new files added, no modifications to existing code
- Incorporate concepts from Google's A2A protocol (Agent Cards, Task lifecycle, Message Parts)
## Architecture
### Storage: File-System Mailbox
```
~/.nanobot/mailboxes/
├── _registry.json # Agent Cards (discovery)
├── researcher/
│ ├── inbox/ # Pending messages
│ │ └── 1745659200_coder.msg.json
│ └── processed/ # Archived messages
│ └── 1745659000_coder.msg.json
└── coder/
├── inbox/
└── processed/
```
Global path `~/.nanobot/mailboxes/` is used because different agents may have different working directories.
### Message File Naming
`{unix_timestamp}_{from_agent_id}.msg.json`
Atomic writes: write to `.tmp` file first, then `os.rename()` to prevent reading half-written messages.
### Agent Discovery (A2A Agent Card)
Each agent registers itself in `_registry.json` on startup and updates `last_heartbeat` on every poll cycle.
The registry contains two types of data:
- **Agent identity** (`description`, `capabilities`) — configured by the user in `channels.mailbox` config section. Optional; if not set, other agents can only see `agent_id` and `status`.
- **Runtime state** (`status`, `current_tasks`, `last_heartbeat`) — computed automatically by the mailbox channel.
```json
{
"researcher": {
"agent_id": "researcher",
"description": "负责信息检索和分析的 agent",
"capabilities": ["web_search", "code_analysis", "summarization"],
"status": "idle",
"allow_from": ["coder", "writer"],
"max_concurrent_tasks": 3,
"current_tasks": ["msg_1745659200_coder"],
"registered_at": "2026-04-26T10:00:00Z",
"last_heartbeat": "2026-04-26T10:05:00Z"
},
"coder": {
"agent_id": "coder",
"description": "负责代码编写和修改的 agent",
"capabilities": ["code_write", "test_run", "review"],
"status": "idle",
"allow_from": ["*"],
"max_concurrent_tasks": 3,
"current_tasks": [],
"registered_at": "2026-04-26T10:00:30Z",
"last_heartbeat": "2026-04-26T10:05:30Z"
}
}
```
### Agent States
| State | Accept New Tasks? | Description |
|-------|-------------------|-------------|
| `idle` | Yes | Available, accepts immediately |
| `busy` | If quota allows | `current_tasks` < `max_concurrent_tasks` then queue, otherwise reject |
| `offline` | Messages queue | Messages stay in inbox, processed when agent comes online |
State transitions:
- Startup `idle`
- Receive task + accept `busy`
- All tasks completed `idle`
- Heartbeat timeout `offline`
## Communication Protocol
### Message Types
1. **message** Instant notification, Q&A, chat
2. **task** Work request requiring execution and result return
3. **task_update** Status update for a previously sent task
### Task Lifecycle (A2A Task Concept)
```
pending → accepted → working → completed
\→ rejected
\→ failed
```
### Message Format
```json
{
"id": "msg_1745659200_researcher",
"from": "researcher",
"to": "coder",
"timestamp": "2026-04-26T10:00:00Z",
"type": "task | message | task_update",
"ttl": 3,
"trace": ["researcher"],
"task": {
"id": "original_task_id",
"state": "pending | accepted | working | completed | failed | rejected",
"deadline": "2026-04-26T11:00:00Z"
},
"content": {
"parts": [
{"type": "text", "text": "..."},
{"type": "data", "data": {}},
{"type": "file", "path": "/path/to/file"}
]
},
"callback": {
"session_id": "feishu:user_123",
"channel": "feishu"
},
"reply_to": "replied_message_id | null",
"metadata": {}
}
```
Fields:
- `ttl` Time-to-live hop count. Decremented on each relay. Default 3. At 0, agent must handle itself, cannot delegate.
- `trace` List of agent_ids this message has passed through. Prevents circular routing.
- `task` Present for `task` and `task_update` types. Contains lifecycle state and optional deadline.
- `content.parts` Structured content (A2A Message Parts concept). Supports text, data, and file types.
- `callback` Original session info from the initiating channel. Carried through the task lifecycle so results route back to the correct user conversation.
### Task Acceptance Criteria
All conditions must be met for an agent to accept a task:
1. Sender is in `allow_from` list (`"*"` = accept all)
2. `current_tasks` count < `max_concurrent_tasks`
3. `deadline` has not expired (if present)
4. LLM judges it has the capability to complete the task
Decision outcomes:
- Accept reply `task_update {state: "accepted"}`, add to `current_tasks`
- Reject reply `task_update {state: "rejected"}` with reason
### Anti-Loop Mechanism
Two fields prevent infinite agent-to-agent conversations:
1. **TTL**: Decremented on each relay. At 0, no further delegation allowed. Default = 3 (max 3 hops: ABCD).
2. **Trace**: Append agent_id on each relay. Reject forwarding to any agent already in trace.
| Scenario | Protection |
|----------|-----------|
| AB mutual ping | Trace: B sees A in trace, rejects |
| ABCA cycle | Trace: C sees A in trace, rejects |
| ABCD→... infinite chain | TTL: reaches 0, stops |
### User Experience: End-to-End Flow
The user never interacts with mailbox directly. They communicate through their normal channel (Feishu, WeChat, etc.). The mailbox is invisible infrastructure.
```
User (Feishu) Agent A (researcher) Agent B (coder)
│ │ │
│ "帮我写排序函数" │ │
│ ──────────────────→ │ │
│ │ LLM decides to delegate │
│ │ │
│ "我让 coder 去处理, │ │
│ 完成后通知你" │ │
│ ←────────────────── │ │
│ │ │
│ ...time passes...│ │
│ │ │
│ │ task { │
│ │ callback: { │
│ │ session_id: "feishu:user_123",
│ │ channel: "feishu" │
│ │ } │
│ │ } │
│ │ ──────────────────────→ │
│ │ │ B processes...
│ │ │
│ │ task_update {completed} │
│ │ ←────────────────────── │
│ │ │
│ MailboxChannel polls task_update │
│ Routes to session "feishu:user_123" │
│ LLM sees result in original conversation context │
│ │ │
│ "排序函数已完成: │ │
│ sort_by_mtime()..." │ │
│ ←────────────────── │ │
```
The `callback` field carries the original channel session info through the entire task lifecycle. When the task_update arrives at Agent A's mailbox, the MailboxChannel restores the original `session_id` and `channel`, so the AgentLoop processes it in the correct conversation context and the LLM naturally responds to the user via Feishu.
### Error Scenarios
| Scenario | Handling |
|----------|----------|
| Target agent offline | Messages queue in inbox; processed when agent comes online |
| Heartbeat timeout | Registry marks agent `offline`; visible to other agents on next registry read |
| Agent crash during task | No `completed`/`failed` sent; sender can use `deadline` to detect timeout |
| Task rejected | Sender receives `rejected` with reason; decides next action |
| Task deadline expired | Receiver checks deadline on processing; rejects if expired |
| allow_from mismatch | Receiver discards message; optionally replies `rejected` with "unauthorized" |
| Registry concurrent write | Atomic writes (.tmp rename); each agent only writes its own entry |
## Implementation
### New Components
| Component | Type | Modifies Existing Code |
|-----------|------|----------------------|
| `MailboxManager` | New file (`nanobot/channels/mailbox_manager.py`) | No |
| `MailboxChannel` | New file (`nanobot/channels/mailbox.py`) | No |
| `MailboxConfig` | Pydantic model inside mailbox.py | No |
| Config section `channels.mailbox` | User config.json | No (additive) |
### MailboxManager
Low-level file operations with no nanobot dependencies:
```python
class MailboxManager:
def __init__(self, mailboxes_root: Path): ...
def register(self, agent_id: str, card: dict) -> None:
"""Atomic write to registry (.tmp → rename)"""
def heartbeat(self, agent_id: str) -> None:
"""Update last_heartbeat in registry"""
def update_status(self, agent_id: str, status: str, current_tasks: list[str] | None = None) -> None:
"""Update agent status and current task list"""
def send(self, from_id: str, to_id: str, msg: dict) -> None:
"""Atomic write to to_id/inbox/{timestamp}_{from_id}.msg.json"""
def poll(self, agent_id: str) -> list[dict]:
"""Scan inbox/, return new messages sorted by timestamp"""
def mark_processed(self, agent_id: str, filename: str) -> None:
"""Move from inbox/ to processed/"""
def list_online_agents(self) -> list[dict]:
"""Read all online agents from registry"""
def get_agent(self, agent_id: str) -> dict | None:
"""Read single agent info from registry"""
```
### MailboxChannel
Follows existing channel pattern (like telegram.py, feishu.py):
```python
class MailboxChannel(BaseChannel):
def __init__(self, config: Any, bus: MessageBus):
if isinstance(config, dict):
config = MailboxConfig.model_validate(config)
super().__init__(config, bus)
self.config: MailboxConfig = config
self.manager = MailboxManager(Path(self.config.mailboxes_root).expanduser())
self._running = False
def _build_card(self) -> dict:
"""Build Agent Card from mailbox config. Identity fields are user-configured."""
return {
"agent_id": self.config.agent_id,
"description": self.config.description or "",
"capabilities": self.config.capabilities or [],
"status": "idle",
"allow_from": self.config.allow_from,
"max_concurrent_tasks": self.config.max_concurrent_tasks,
"current_tasks": [],
"registered_at": datetime.now(timezone.utc).isoformat(),
"last_heartbeat": datetime.now(timezone.utc).isoformat(),
}
async def start(self) -> None:
"""Register agent + start poll loop"""
self.manager.register(self.config.agent_id, self._build_card())
self._running = True
asyncio.create_task(self._poll_loop())
async def _poll_loop(self) -> None:
"""Periodically scan inbox, inject new messages into bus"""
while self._running:
messages = self.manager.poll(self.config.agent_id)
for msg in messages:
inbound = self._to_inbound(msg)
await self.bus.publish_inbound(inbound)
self.manager.mark_processed(self.config.agent_id, msg["filename"])
self.manager.heartbeat(self.config.agent_id)
await asyncio.sleep(self.config.poll_interval)
async def send_message(self, to_id: str, msg: dict) -> None:
"""Send message to another agent's mailbox"""
# Anti-loop checks
if to_id in msg.get("trace", []):
logger.warning(f"Rejecting circular route: {to_id} already in trace")
return
if msg.get("ttl", 0) <= 0:
logger.warning("TTL exhausted, cannot forward")
return
msg["ttl"] = msg.get("ttl", 3) - 1
msg.setdefault("trace", []).append(self.config.agent_id)
self.manager.send(self.config.agent_id, to_id, msg)
def _to_inbound(self, msg: dict) -> InboundMessage:
"""Convert mailbox JSON to standard InboundMessage"""
callback = msg.get("callback", {})
return InboundMessage(
channel=callback.get("channel", "mailbox"),
sender=msg["from"],
content=self._extract_text(msg),
session_id=callback.get("session_id") or f"mailbox:{msg['from']}",
metadata={
"mailbox_type": msg["type"],
"mailbox_task": msg.get("task"),
"mailbox_parts": msg.get("content", {}).get("parts"),
"mailbox_ttl": msg.get("ttl"),
"mailbox_trace": msg.get("trace"),
"reply_to": msg.get("reply_to"),
},
)
async def stop(self) -> None:
self._running = False
self.manager.update_status(self.config.agent_id, "offline")
@classmethod
def default_config(cls) -> dict[str, Any]:
return MailboxConfig().model_dump(by_alias=True)
```
### MailboxConfig
All settings are self-contained within the mailbox channel config.
```python
class MailboxConfig(Base):
enabled: bool = False
agent_id: str = ""
description: str = "" # optional, for agent discovery
capabilities: list[str] = [] # optional, for agent discovery
allow_from: list[str] = Field(default_factory=lambda: ["*"])
max_concurrent_tasks: int = 3
poll_interval: float = 5.0
mailboxes_root: str = "~/.nanobot/mailboxes"
```
### Configuration
In `~/.nanobot/config.json`:
```json
{
"channels": {
"feishu": {
"enabled": true,
"appId": "...",
"appSecret": "..."
},
"mailbox": {
"enabled": true,
"agentId": "coder",
"description": "负责代码编写和修改的 agent",
"capabilities": ["code_write", "test_run", "review"],
"allowFrom": ["researcher"],
"maxConcurrentTasks": 3,
"pollInterval": 5,
"mailboxesRoot": "~/.nanobot/mailboxes"
}
}
}
```
Two agents running independently:
```json
// Agent A config — researcher
{
"channels": {
"mailbox": {
"enabled": true,
"agentId": "researcher",
"description": "负责信息检索和分析的 agent",
"capabilities": ["web_search", "code_analysis", "summarization"],
"allowFrom": ["*"]
}
}
}
// Agent B config — coder
{
"channels": {
"mailbox": {
"enabled": true,
"agentId": "coder",
"description": "负责代码编写和修改的 agent",
"capabilities": ["code_write", "test_run", "review"],
"allowFrom": ["researcher"]
}
}
}
```
## Design Principles
- **Zero side effects**: Only new files, no modifications to existing code
- **Channel plugin pattern**: MailboxChannel follows the same interface as all other channels
- **Bus integration**: Mailbox messages become standard `InboundMessage` objects; AgentLoop is unaware of mailbox
- **Callback routing**: Original channel session is preserved through the task lifecycle for seamless user experience
- **Anti-loop by default**: TTL + trace prevents runaway agent conversations without configuration
- **Best-effort deadlines**: Optional `deadline` field for task timeout, not a blocking mechanism