nanobot/docs/superpowers/specs/2026-04-26-multi-agent-mailbox-design.md
chengyongru 6daf91c620 feat(mailbox): add multi-agent mailbox channel plugin
File-system-based mailbox channel for inter-agent communication.
Zero modifications to existing code — pure channel plugin.

- MailboxManager: atomic file I/O, registry CRUD, message send/poll
- MailboxConfig: Pydantic config with camelCase aliasing
- MailboxChannel: polls inbox, routes messages via bus, callback routing
- Anti-loop protection: TTL decrement + circular trace detection
- allowFrom access control inherited from BaseChannel
- 29 unit/integration tests + standalone smoke test script

Design: docs/superpowers/specs/2026-04-26-multi-agent-mailbox-design.md
Plan: docs/superpowers/plans/2026-04-26-multi-agent-mailbox.md
2026-04-27 01:14:12 +08:00

16 KiB

Multi-Agent Mailbox Communication Design

Overview

Design a simple, zero-side-effect mechanism for multiple independent nanobot agent instances to communicate with each other. The mechanism uses a file-system-based mailbox system implemented as a standard nanobot channel plugin, requiring no modifications to existing code.

Goals

  • Multiple independent agent processes can discover and communicate with each other
  • Peer-to-peer (fully connected) topology
  • Asynchronous event-driven messaging
  • "Boss experience": agents auto-delegate, auto-report progress, user only interacts through their normal channel (e.g., Feishu)
  • Zero side effects: only new files added, no modifications to existing code
  • Incorporate concepts from Google's A2A protocol (Agent Cards, Task lifecycle, Message Parts)

Architecture

Storage: File-System Mailbox

~/.nanobot/mailboxes/
├── _registry.json                  # Agent Cards (discovery)
├── researcher/
│   ├── inbox/                      # Pending messages
│   │   └── 1745659200_coder.msg.json
│   └── processed/                  # Archived messages
│       └── 1745659000_coder.msg.json
└── coder/
    ├── inbox/
    └── processed/

Global path ~/.nanobot/mailboxes/ is used because different agents may have different working directories.

Message File Naming

{unix_timestamp}_{from_agent_id}.msg.json

Atomic writes: write to .tmp file first, then os.rename() to prevent reading half-written messages.

Agent Discovery (A2A Agent Card)

Each agent registers itself in _registry.json on startup and updates last_heartbeat on every poll cycle.

The registry contains two types of data:

  • Agent identity (description, capabilities) — configured by the user in channels.mailbox config section. Optional; if not set, other agents can only see agent_id and status.
  • Runtime state (status, current_tasks, last_heartbeat) — computed automatically by the mailbox channel.
{
  "researcher": {
    "agent_id": "researcher",
    "description": "负责信息检索和分析的 agent",
    "capabilities": ["web_search", "code_analysis", "summarization"],
    "status": "idle",
    "allow_from": ["coder", "writer"],
    "max_concurrent_tasks": 3,
    "current_tasks": ["msg_1745659200_coder"],
    "registered_at": "2026-04-26T10:00:00Z",
    "last_heartbeat": "2026-04-26T10:05:00Z"
  },
  "coder": {
    "agent_id": "coder",
    "description": "负责代码编写和修改的 agent",
    "capabilities": ["code_write", "test_run", "review"],
    "status": "idle",
    "allow_from": ["*"],
    "max_concurrent_tasks": 3,
    "current_tasks": [],
    "registered_at": "2026-04-26T10:00:30Z",
    "last_heartbeat": "2026-04-26T10:05:30Z"
  }
}

Agent States

State Accept New Tasks? Description
idle Yes Available, accepts immediately
busy If quota allows current_tasks < max_concurrent_tasks then queue, otherwise reject
offline Messages queue Messages stay in inbox, processed when agent comes online

State transitions:

  • Startup → idle
  • Receive task + accept → busy
  • All tasks completed → idle
  • Heartbeat timeout → offline

Communication Protocol

Message Types

  1. message — Instant notification, Q&A, chat
  2. task — Work request requiring execution and result return
  3. task_update — Status update for a previously sent task

Task Lifecycle (A2A Task Concept)

pending → accepted → working → completed
                  \→ rejected
                  \→ failed

Message Format

{
  "id": "msg_1745659200_researcher",
  "from": "researcher",
  "to": "coder",
  "timestamp": "2026-04-26T10:00:00Z",
  "type": "task | message | task_update",
  "ttl": 3,
  "trace": ["researcher"],
  "task": {
    "id": "original_task_id",
    "state": "pending | accepted | working | completed | failed | rejected",
    "deadline": "2026-04-26T11:00:00Z"
  },
  "content": {
    "parts": [
      {"type": "text", "text": "..."},
      {"type": "data", "data": {}},
      {"type": "file", "path": "/path/to/file"}
    ]
  },
  "callback": {
    "session_id": "feishu:user_123",
    "channel": "feishu"
  },
  "reply_to": "replied_message_id | null",
  "metadata": {}
}

Fields:

  • ttl — Time-to-live hop count. Decremented on each relay. Default 3. At 0, agent must handle itself, cannot delegate.
  • trace — List of agent_ids this message has passed through. Prevents circular routing.
  • task — Present for task and task_update types. Contains lifecycle state and optional deadline.
  • content.parts — Structured content (A2A Message Parts concept). Supports text, data, and file types.
  • callback — Original session info from the initiating channel. Carried through the task lifecycle so results route back to the correct user conversation.

Task Acceptance Criteria

All conditions must be met for an agent to accept a task:

  1. Sender is in allow_from list ("*" = accept all)
  2. current_tasks count < max_concurrent_tasks
  3. deadline has not expired (if present)
  4. LLM judges it has the capability to complete the task

Decision outcomes:

  • Accept → reply task_update {state: "accepted"}, add to current_tasks
  • Reject → reply task_update {state: "rejected"} with reason

Anti-Loop Mechanism

Two fields prevent infinite agent-to-agent conversations:

  1. TTL: Decremented on each relay. At 0, no further delegation allowed. Default = 3 (max 3 hops: A→B→C→D).
  2. Trace: Append agent_id on each relay. Reject forwarding to any agent already in trace.
Scenario Protection
A↔B mutual ping Trace: B sees A in trace, rejects
A→B→C→A cycle Trace: C sees A in trace, rejects
A→B→C→D→... infinite chain TTL: reaches 0, stops

User Experience: End-to-End Flow

The user never interacts with mailbox directly. They communicate through their normal channel (Feishu, WeChat, etc.). The mailbox is invisible infrastructure.

User (Feishu)           Agent A (researcher)       Agent B (coder)
│                       │                          │
│ "帮我写排序函数"        │                          │
│ ──────────────────→   │                          │
│                       │ LLM decides to delegate  │
│                       │                          │
│ "我让 coder 去处理,    │                          │
│  完成后通知你"          │                          │
│ ←──────────────────   │                          │
│                       │                          │
│       ...time passes...│                          │
│                       │                          │
│                       │ task {                    │
│                       │   callback: {             │
│                       │     session_id: "feishu:user_123",
│                       │     channel: "feishu"     │
│                       │   }                       │
│                       │ }                         │
│                       │ ──────────────────────→   │
│                       │                          │ B processes...
│                       │                          │
│                       │ task_update {completed}   │
│                       │ ←──────────────────────   │
│                       │                          │
│ MailboxChannel polls task_update                  │
│ Routes to session "feishu:user_123"               │
│ LLM sees result in original conversation context  │
│                       │                          │
│ "排序函数已完成:        │                          │
│  sort_by_mtime()..."  │                          │
│ ←──────────────────   │                          │

The callback field carries the original channel session info through the entire task lifecycle. When the task_update arrives at Agent A's mailbox, the MailboxChannel restores the original session_id and channel, so the AgentLoop processes it in the correct conversation context and the LLM naturally responds to the user via Feishu.

Error Scenarios

Scenario Handling
Target agent offline Messages queue in inbox; processed when agent comes online
Heartbeat timeout Registry marks agent offline; visible to other agents on next registry read
Agent crash during task No completed/failed sent; sender can use deadline to detect timeout
Task rejected Sender receives rejected with reason; decides next action
Task deadline expired Receiver checks deadline on processing; rejects if expired
allow_from mismatch Receiver discards message; optionally replies rejected with "unauthorized"
Registry concurrent write Atomic writes (.tmp → rename); each agent only writes its own entry

Implementation

New Components

Component Type Modifies Existing Code
MailboxManager New file (nanobot/channels/mailbox_manager.py) No
MailboxChannel New file (nanobot/channels/mailbox.py) No
MailboxConfig Pydantic model inside mailbox.py No
Config section channels.mailbox User config.json No (additive)

MailboxManager

Low-level file operations with no nanobot dependencies:

class MailboxManager:
    def __init__(self, mailboxes_root: Path): ...

    def register(self, agent_id: str, card: dict) -> None:
        """Atomic write to registry (.tmp → rename)"""

    def heartbeat(self, agent_id: str) -> None:
        """Update last_heartbeat in registry"""

    def update_status(self, agent_id: str, status: str, current_tasks: list[str] | None = None) -> None:
        """Update agent status and current task list"""

    def send(self, from_id: str, to_id: str, msg: dict) -> None:
        """Atomic write to to_id/inbox/{timestamp}_{from_id}.msg.json"""

    def poll(self, agent_id: str) -> list[dict]:
        """Scan inbox/, return new messages sorted by timestamp"""

    def mark_processed(self, agent_id: str, filename: str) -> None:
        """Move from inbox/ to processed/"""

    def list_online_agents(self) -> list[dict]:
        """Read all online agents from registry"""

    def get_agent(self, agent_id: str) -> dict | None:
        """Read single agent info from registry"""

MailboxChannel

Follows existing channel pattern (like telegram.py, feishu.py):

class MailboxChannel(BaseChannel):
    def __init__(self, config: Any, bus: MessageBus):
        if isinstance(config, dict):
            config = MailboxConfig.model_validate(config)
        super().__init__(config, bus)
        self.config: MailboxConfig = config
        self.manager = MailboxManager(Path(self.config.mailboxes_root).expanduser())
        self._running = False

    def _build_card(self) -> dict:
        """Build Agent Card from mailbox config. Identity fields are user-configured."""
        return {
            "agent_id": self.config.agent_id,
            "description": self.config.description or "",
            "capabilities": self.config.capabilities or [],
            "status": "idle",
            "allow_from": self.config.allow_from,
            "max_concurrent_tasks": self.config.max_concurrent_tasks,
            "current_tasks": [],
            "registered_at": datetime.now(timezone.utc).isoformat(),
            "last_heartbeat": datetime.now(timezone.utc).isoformat(),
        }

    async def start(self) -> None:
        """Register agent + start poll loop"""
        self.manager.register(self.config.agent_id, self._build_card())
        self._running = True
        asyncio.create_task(self._poll_loop())

    async def _poll_loop(self) -> None:
        """Periodically scan inbox, inject new messages into bus"""
        while self._running:
            messages = self.manager.poll(self.config.agent_id)
            for msg in messages:
                inbound = self._to_inbound(msg)
                await self.bus.publish_inbound(inbound)
                self.manager.mark_processed(self.config.agent_id, msg["filename"])
            self.manager.heartbeat(self.config.agent_id)
            await asyncio.sleep(self.config.poll_interval)

    async def send_message(self, to_id: str, msg: dict) -> None:
        """Send message to another agent's mailbox"""
        # Anti-loop checks
        if to_id in msg.get("trace", []):
            logger.warning(f"Rejecting circular route: {to_id} already in trace")
            return
        if msg.get("ttl", 0) <= 0:
            logger.warning("TTL exhausted, cannot forward")
            return
        msg["ttl"] = msg.get("ttl", 3) - 1
        msg.setdefault("trace", []).append(self.config.agent_id)
        self.manager.send(self.config.agent_id, to_id, msg)

    def _to_inbound(self, msg: dict) -> InboundMessage:
        """Convert mailbox JSON to standard InboundMessage"""
        callback = msg.get("callback", {})
        return InboundMessage(
            channel=callback.get("channel", "mailbox"),
            sender=msg["from"],
            content=self._extract_text(msg),
            session_id=callback.get("session_id") or f"mailbox:{msg['from']}",
            metadata={
                "mailbox_type": msg["type"],
                "mailbox_task": msg.get("task"),
                "mailbox_parts": msg.get("content", {}).get("parts"),
                "mailbox_ttl": msg.get("ttl"),
                "mailbox_trace": msg.get("trace"),
                "reply_to": msg.get("reply_to"),
            },
        )

    async def stop(self) -> None:
        self._running = False
        self.manager.update_status(self.config.agent_id, "offline")

    @classmethod
    def default_config(cls) -> dict[str, Any]:
        return MailboxConfig().model_dump(by_alias=True)

MailboxConfig

All settings are self-contained within the mailbox channel config.

class MailboxConfig(Base):
    enabled: bool = False
    agent_id: str = ""
    description: str = ""           # optional, for agent discovery
    capabilities: list[str] = []    # optional, for agent discovery
    allow_from: list[str] = Field(default_factory=lambda: ["*"])
    max_concurrent_tasks: int = 3
    poll_interval: float = 5.0
    mailboxes_root: str = "~/.nanobot/mailboxes"

Configuration

In ~/.nanobot/config.json:

{
  "channels": {
    "feishu": {
      "enabled": true,
      "appId": "...",
      "appSecret": "..."
    },
    "mailbox": {
      "enabled": true,
      "agentId": "coder",
      "description": "负责代码编写和修改的 agent",
      "capabilities": ["code_write", "test_run", "review"],
      "allowFrom": ["researcher"],
      "maxConcurrentTasks": 3,
      "pollInterval": 5,
      "mailboxesRoot": "~/.nanobot/mailboxes"
    }
  }
}

Two agents running independently:

// Agent A config — researcher
{
  "channels": {
    "mailbox": {
      "enabled": true,
      "agentId": "researcher",
      "description": "负责信息检索和分析的 agent",
      "capabilities": ["web_search", "code_analysis", "summarization"],
      "allowFrom": ["*"]
    }
  }
}

// Agent B config — coder
{
  "channels": {
    "mailbox": {
      "enabled": true,
      "agentId": "coder",
      "description": "负责代码编写和修改的 agent",
      "capabilities": ["code_write", "test_run", "review"],
      "allowFrom": ["researcher"]
    }
  }
}

Design Principles

  • Zero side effects: Only new files, no modifications to existing code
  • Channel plugin pattern: MailboxChannel follows the same interface as all other channels
  • Bus integration: Mailbox messages become standard InboundMessage objects; AgentLoop is unaware of mailbox
  • Callback routing: Original channel session is preserved through the task lifecycle for seamless user experience
  • Anti-loop by default: TTL + trace prevents runaway agent conversations without configuration
  • Best-effort deadlines: Optional deadline field for task timeout, not a blocking mechanism