File-system-based mailbox channel for inter-agent communication. Zero modifications to existing code — pure channel plugin. - MailboxManager: atomic file I/O, registry CRUD, message send/poll - MailboxConfig: Pydantic config with camelCase aliasing - MailboxChannel: polls inbox, routes messages via bus, callback routing - Anti-loop protection: TTL decrement + circular trace detection - allowFrom access control inherited from BaseChannel - 29 unit/integration tests + standalone smoke test script Design: docs/superpowers/specs/2026-04-26-multi-agent-mailbox-design.md Plan: docs/superpowers/plans/2026-04-26-multi-agent-mailbox.md
16 KiB
Multi-Agent Mailbox Communication Design
Overview
Design a simple, zero-side-effect mechanism for multiple independent nanobot agent instances to communicate with each other. The mechanism uses a file-system-based mailbox system implemented as a standard nanobot channel plugin, requiring no modifications to existing code.
Goals
- Multiple independent agent processes can discover and communicate with each other
- Peer-to-peer (fully connected) topology
- Asynchronous event-driven messaging
- "Boss experience": agents auto-delegate, auto-report progress, user only interacts through their normal channel (e.g., Feishu)
- Zero side effects: only new files added, no modifications to existing code
- Incorporate concepts from Google's A2A protocol (Agent Cards, Task lifecycle, Message Parts)
Architecture
Storage: File-System Mailbox
~/.nanobot/mailboxes/
├── _registry.json # Agent Cards (discovery)
├── researcher/
│ ├── inbox/ # Pending messages
│ │ └── 1745659200_coder.msg.json
│ └── processed/ # Archived messages
│ └── 1745659000_coder.msg.json
└── coder/
├── inbox/
└── processed/
Global path ~/.nanobot/mailboxes/ is used because different agents may have different working directories.
Message File Naming
{unix_timestamp}_{from_agent_id}.msg.json
Atomic writes: write to .tmp file first, then os.rename() to prevent reading half-written messages.
Agent Discovery (A2A Agent Card)
Each agent registers itself in _registry.json on startup and updates last_heartbeat on every poll cycle.
The registry contains two types of data:
- Agent identity (
description,capabilities) — configured by the user inchannels.mailboxconfig section. Optional; if not set, other agents can only seeagent_idandstatus. - Runtime state (
status,current_tasks,last_heartbeat) — computed automatically by the mailbox channel.
{
"researcher": {
"agent_id": "researcher",
"description": "负责信息检索和分析的 agent",
"capabilities": ["web_search", "code_analysis", "summarization"],
"status": "idle",
"allow_from": ["coder", "writer"],
"max_concurrent_tasks": 3,
"current_tasks": ["msg_1745659200_coder"],
"registered_at": "2026-04-26T10:00:00Z",
"last_heartbeat": "2026-04-26T10:05:00Z"
},
"coder": {
"agent_id": "coder",
"description": "负责代码编写和修改的 agent",
"capabilities": ["code_write", "test_run", "review"],
"status": "idle",
"allow_from": ["*"],
"max_concurrent_tasks": 3,
"current_tasks": [],
"registered_at": "2026-04-26T10:00:30Z",
"last_heartbeat": "2026-04-26T10:05:30Z"
}
}
Agent States
| State | Accept New Tasks? | Description |
|---|---|---|
idle |
Yes | Available, accepts immediately |
busy |
If quota allows | current_tasks < max_concurrent_tasks then queue, otherwise reject |
offline |
Messages queue | Messages stay in inbox, processed when agent comes online |
State transitions:
- Startup →
idle - Receive task + accept →
busy - All tasks completed →
idle - Heartbeat timeout →
offline
Communication Protocol
Message Types
- message — Instant notification, Q&A, chat
- task — Work request requiring execution and result return
- task_update — Status update for a previously sent task
Task Lifecycle (A2A Task Concept)
pending → accepted → working → completed
\→ rejected
\→ failed
Message Format
{
"id": "msg_1745659200_researcher",
"from": "researcher",
"to": "coder",
"timestamp": "2026-04-26T10:00:00Z",
"type": "task | message | task_update",
"ttl": 3,
"trace": ["researcher"],
"task": {
"id": "original_task_id",
"state": "pending | accepted | working | completed | failed | rejected",
"deadline": "2026-04-26T11:00:00Z"
},
"content": {
"parts": [
{"type": "text", "text": "..."},
{"type": "data", "data": {}},
{"type": "file", "path": "/path/to/file"}
]
},
"callback": {
"session_id": "feishu:user_123",
"channel": "feishu"
},
"reply_to": "replied_message_id | null",
"metadata": {}
}
Fields:
ttl— Time-to-live hop count. Decremented on each relay. Default 3. At 0, agent must handle itself, cannot delegate.trace— List of agent_ids this message has passed through. Prevents circular routing.task— Present fortaskandtask_updatetypes. Contains lifecycle state and optional deadline.content.parts— Structured content (A2A Message Parts concept). Supports text, data, and file types.callback— Original session info from the initiating channel. Carried through the task lifecycle so results route back to the correct user conversation.
Task Acceptance Criteria
All conditions must be met for an agent to accept a task:
- Sender is in
allow_fromlist ("*"= accept all) current_taskscount <max_concurrent_tasksdeadlinehas not expired (if present)- LLM judges it has the capability to complete the task
Decision outcomes:
- Accept → reply
task_update {state: "accepted"}, add tocurrent_tasks - Reject → reply
task_update {state: "rejected"}with reason
Anti-Loop Mechanism
Two fields prevent infinite agent-to-agent conversations:
- TTL: Decremented on each relay. At 0, no further delegation allowed. Default = 3 (max 3 hops: A→B→C→D).
- Trace: Append agent_id on each relay. Reject forwarding to any agent already in trace.
| Scenario | Protection |
|---|---|
| A↔B mutual ping | Trace: B sees A in trace, rejects |
| A→B→C→A cycle | Trace: C sees A in trace, rejects |
| A→B→C→D→... infinite chain | TTL: reaches 0, stops |
User Experience: End-to-End Flow
The user never interacts with mailbox directly. They communicate through their normal channel (Feishu, WeChat, etc.). The mailbox is invisible infrastructure.
User (Feishu) Agent A (researcher) Agent B (coder)
│ │ │
│ "帮我写排序函数" │ │
│ ──────────────────→ │ │
│ │ LLM decides to delegate │
│ │ │
│ "我让 coder 去处理, │ │
│ 完成后通知你" │ │
│ ←────────────────── │ │
│ │ │
│ ...time passes...│ │
│ │ │
│ │ task { │
│ │ callback: { │
│ │ session_id: "feishu:user_123",
│ │ channel: "feishu" │
│ │ } │
│ │ } │
│ │ ──────────────────────→ │
│ │ │ B processes...
│ │ │
│ │ task_update {completed} │
│ │ ←────────────────────── │
│ │ │
│ MailboxChannel polls task_update │
│ Routes to session "feishu:user_123" │
│ LLM sees result in original conversation context │
│ │ │
│ "排序函数已完成: │ │
│ sort_by_mtime()..." │ │
│ ←────────────────── │ │
The callback field carries the original channel session info through the entire task lifecycle. When the task_update arrives at Agent A's mailbox, the MailboxChannel restores the original session_id and channel, so the AgentLoop processes it in the correct conversation context and the LLM naturally responds to the user via Feishu.
Error Scenarios
| Scenario | Handling |
|---|---|
| Target agent offline | Messages queue in inbox; processed when agent comes online |
| Heartbeat timeout | Registry marks agent offline; visible to other agents on next registry read |
| Agent crash during task | No completed/failed sent; sender can use deadline to detect timeout |
| Task rejected | Sender receives rejected with reason; decides next action |
| Task deadline expired | Receiver checks deadline on processing; rejects if expired |
| allow_from mismatch | Receiver discards message; optionally replies rejected with "unauthorized" |
| Registry concurrent write | Atomic writes (.tmp → rename); each agent only writes its own entry |
Implementation
New Components
| Component | Type | Modifies Existing Code |
|---|---|---|
MailboxManager |
New file (nanobot/channels/mailbox_manager.py) |
No |
MailboxChannel |
New file (nanobot/channels/mailbox.py) |
No |
MailboxConfig |
Pydantic model inside mailbox.py | No |
Config section channels.mailbox |
User config.json | No (additive) |
MailboxManager
Low-level file operations with no nanobot dependencies:
class MailboxManager:
def __init__(self, mailboxes_root: Path): ...
def register(self, agent_id: str, card: dict) -> None:
"""Atomic write to registry (.tmp → rename)"""
def heartbeat(self, agent_id: str) -> None:
"""Update last_heartbeat in registry"""
def update_status(self, agent_id: str, status: str, current_tasks: list[str] | None = None) -> None:
"""Update agent status and current task list"""
def send(self, from_id: str, to_id: str, msg: dict) -> None:
"""Atomic write to to_id/inbox/{timestamp}_{from_id}.msg.json"""
def poll(self, agent_id: str) -> list[dict]:
"""Scan inbox/, return new messages sorted by timestamp"""
def mark_processed(self, agent_id: str, filename: str) -> None:
"""Move from inbox/ to processed/"""
def list_online_agents(self) -> list[dict]:
"""Read all online agents from registry"""
def get_agent(self, agent_id: str) -> dict | None:
"""Read single agent info from registry"""
MailboxChannel
Follows existing channel pattern (like telegram.py, feishu.py):
class MailboxChannel(BaseChannel):
def __init__(self, config: Any, bus: MessageBus):
if isinstance(config, dict):
config = MailboxConfig.model_validate(config)
super().__init__(config, bus)
self.config: MailboxConfig = config
self.manager = MailboxManager(Path(self.config.mailboxes_root).expanduser())
self._running = False
def _build_card(self) -> dict:
"""Build Agent Card from mailbox config. Identity fields are user-configured."""
return {
"agent_id": self.config.agent_id,
"description": self.config.description or "",
"capabilities": self.config.capabilities or [],
"status": "idle",
"allow_from": self.config.allow_from,
"max_concurrent_tasks": self.config.max_concurrent_tasks,
"current_tasks": [],
"registered_at": datetime.now(timezone.utc).isoformat(),
"last_heartbeat": datetime.now(timezone.utc).isoformat(),
}
async def start(self) -> None:
"""Register agent + start poll loop"""
self.manager.register(self.config.agent_id, self._build_card())
self._running = True
asyncio.create_task(self._poll_loop())
async def _poll_loop(self) -> None:
"""Periodically scan inbox, inject new messages into bus"""
while self._running:
messages = self.manager.poll(self.config.agent_id)
for msg in messages:
inbound = self._to_inbound(msg)
await self.bus.publish_inbound(inbound)
self.manager.mark_processed(self.config.agent_id, msg["filename"])
self.manager.heartbeat(self.config.agent_id)
await asyncio.sleep(self.config.poll_interval)
async def send_message(self, to_id: str, msg: dict) -> None:
"""Send message to another agent's mailbox"""
# Anti-loop checks
if to_id in msg.get("trace", []):
logger.warning(f"Rejecting circular route: {to_id} already in trace")
return
if msg.get("ttl", 0) <= 0:
logger.warning("TTL exhausted, cannot forward")
return
msg["ttl"] = msg.get("ttl", 3) - 1
msg.setdefault("trace", []).append(self.config.agent_id)
self.manager.send(self.config.agent_id, to_id, msg)
def _to_inbound(self, msg: dict) -> InboundMessage:
"""Convert mailbox JSON to standard InboundMessage"""
callback = msg.get("callback", {})
return InboundMessage(
channel=callback.get("channel", "mailbox"),
sender=msg["from"],
content=self._extract_text(msg),
session_id=callback.get("session_id") or f"mailbox:{msg['from']}",
metadata={
"mailbox_type": msg["type"],
"mailbox_task": msg.get("task"),
"mailbox_parts": msg.get("content", {}).get("parts"),
"mailbox_ttl": msg.get("ttl"),
"mailbox_trace": msg.get("trace"),
"reply_to": msg.get("reply_to"),
},
)
async def stop(self) -> None:
self._running = False
self.manager.update_status(self.config.agent_id, "offline")
@classmethod
def default_config(cls) -> dict[str, Any]:
return MailboxConfig().model_dump(by_alias=True)
MailboxConfig
All settings are self-contained within the mailbox channel config.
class MailboxConfig(Base):
enabled: bool = False
agent_id: str = ""
description: str = "" # optional, for agent discovery
capabilities: list[str] = [] # optional, for agent discovery
allow_from: list[str] = Field(default_factory=lambda: ["*"])
max_concurrent_tasks: int = 3
poll_interval: float = 5.0
mailboxes_root: str = "~/.nanobot/mailboxes"
Configuration
In ~/.nanobot/config.json:
{
"channels": {
"feishu": {
"enabled": true,
"appId": "...",
"appSecret": "..."
},
"mailbox": {
"enabled": true,
"agentId": "coder",
"description": "负责代码编写和修改的 agent",
"capabilities": ["code_write", "test_run", "review"],
"allowFrom": ["researcher"],
"maxConcurrentTasks": 3,
"pollInterval": 5,
"mailboxesRoot": "~/.nanobot/mailboxes"
}
}
}
Two agents running independently:
// Agent A config — researcher
{
"channels": {
"mailbox": {
"enabled": true,
"agentId": "researcher",
"description": "负责信息检索和分析的 agent",
"capabilities": ["web_search", "code_analysis", "summarization"],
"allowFrom": ["*"]
}
}
}
// Agent B config — coder
{
"channels": {
"mailbox": {
"enabled": true,
"agentId": "coder",
"description": "负责代码编写和修改的 agent",
"capabilities": ["code_write", "test_run", "review"],
"allowFrom": ["researcher"]
}
}
}
Design Principles
- Zero side effects: Only new files, no modifications to existing code
- Channel plugin pattern: MailboxChannel follows the same interface as all other channels
- Bus integration: Mailbox messages become standard
InboundMessageobjects; AgentLoop is unaware of mailbox - Callback routing: Original channel session is preserved through the task lifecycle for seamless user experience
- Anti-loop by default: TTL + trace prevents runaway agent conversations without configuration
- Best-effort deadlines: Optional
deadlinefield for task timeout, not a blocking mechanism