mirror of https://github.com/HKUDS/nanobot.git synced 2026-05-20 08:32:25 +00:00

chengyongru 6daf91c620 feat(mailbox): add multi-agent mailbox channel plugin

File-system-based mailbox channel for inter-agent communication.
Zero modifications to existing code — pure channel plugin.

- MailboxManager: atomic file I/O, registry CRUD, message send/poll
- MailboxConfig: Pydantic config with camelCase aliasing
- MailboxChannel: polls inbox, routes messages via bus, callback routing
- Anti-loop protection: TTL decrement + circular trace detection
- allowFrom access control inherited from BaseChannel
- 29 unit/integration tests + standalone smoke test script

Design: docs/superpowers/specs/2026-04-26-multi-agent-mailbox-design.md
Plan: docs/superpowers/plans/2026-04-26-multi-agent-mailbox.md

2026-04-27 01:14:12 +08:00

16 KiB

Raw Blame History

Multi-Agent Mailbox Communication Design

Overview

Design a simple, zero-side-effect mechanism for multiple independent nanobot agent instances to communicate with each other. The mechanism uses a file-system-based mailbox system implemented as a standard nanobot channel plugin, requiring no modifications to existing code.

Goals

Multiple independent agent processes can discover and communicate with each other
Peer-to-peer (fully connected) topology
Asynchronous event-driven messaging
"Boss experience": agents auto-delegate, auto-report progress, user only interacts through their normal channel (e.g., Feishu)
Zero side effects: only new files added, no modifications to existing code
Incorporate concepts from Google's A2A protocol (Agent Cards, Task lifecycle, Message Parts)

Architecture

Storage: File-System Mailbox

~/.nanobot/mailboxes/
├── _registry.json                  # Agent Cards (discovery)
├── researcher/
│   ├── inbox/                      # Pending messages
│   │   └── 1745659200_coder.msg.json
│   └── processed/                  # Archived messages
│       └── 1745659000_coder.msg.json
└── coder/
    ├── inbox/
    └── processed/

Global path ~/.nanobot/mailboxes/ is used because different agents may have different working directories.

Message File Naming

{unix_timestamp}_{from_agent_id}.msg.json

Atomic writes: write to .tmp file first, then os.rename() to prevent reading half-written messages.

Agent Discovery (A2A Agent Card)

Each agent registers itself in _registry.json on startup and updates last_heartbeat on every poll cycle.

The registry contains two types of data:

Agent identity (description, capabilities) — configured by the user in channels.mailbox config section. Optional; if not set, other agents can only see agent_id and status.
Runtime state (status, current_tasks, last_heartbeat) — computed automatically by the mailbox channel.

{
  "researcher": {
    "agent_id": "researcher",
    "description": "负责信息检索和分析的 agent",
    "capabilities": ["web_search", "code_analysis", "summarization"],
    "status": "idle",
    "allow_from": ["coder", "writer"],
    "max_concurrent_tasks": 3,
    "current_tasks": ["msg_1745659200_coder"],
    "registered_at": "2026-04-26T10:00:00Z",
    "last_heartbeat": "2026-04-26T10:05:00Z"
  },
  "coder": {
    "agent_id": "coder",
    "description": "负责代码编写和修改的 agent",
    "capabilities": ["code_write", "test_run", "review"],
    "status": "idle",
    "allow_from": ["*"],
    "max_concurrent_tasks": 3,
    "current_tasks": [],
    "registered_at": "2026-04-26T10:00:30Z",
    "last_heartbeat": "2026-04-26T10:05:30Z"
  }
}

Agent States

State	Accept New Tasks?	Description
`idle`	Yes	Available, accepts immediately
`busy`	If quota allows	`current_tasks` < `max_concurrent_tasks` then queue, otherwise reject
`offline`	Messages queue	Messages stay in inbox, processed when agent comes online

State transitions:

Startup → idle
Receive task + accept → busy
All tasks completed → idle
Heartbeat timeout → offline

Communication Protocol

Message Types

message — Instant notification, Q&A, chat
task — Work request requiring execution and result return
task_update — Status update for a previously sent task

Task Lifecycle (A2A Task Concept)

pending → accepted → working → completed
                  \→ rejected
                  \→ failed

Message Format

{
  "id": "msg_1745659200_researcher",
  "from": "researcher",
  "to": "coder",
  "timestamp": "2026-04-26T10:00:00Z",
  "type": "task | message | task_update",
  "ttl": 3,
  "trace": ["researcher"],
  "task": {
    "id": "original_task_id",
    "state": "pending | accepted | working | completed | failed | rejected",
    "deadline": "2026-04-26T11:00:00Z"
  },
  "content": {
    "parts": [
      {"type": "text", "text": "..."},
      {"type": "data", "data": {}},
      {"type": "file", "path": "/path/to/file"}
    ]
  },
  "callback": {
    "session_id": "feishu:user_123",
    "channel": "feishu"
  },
  "reply_to": "replied_message_id | null",
  "metadata": {}
}

Fields:

ttl — Time-to-live hop count. Decremented on each relay. Default 3. At 0, agent must handle itself, cannot delegate.
trace — List of agent_ids this message has passed through. Prevents circular routing.
task — Present for task and task_update types. Contains lifecycle state and optional deadline.
content.parts — Structured content (A2A Message Parts concept). Supports text, data, and file types.
callback — Original session info from the initiating channel. Carried through the task lifecycle so results route back to the correct user conversation.

Task Acceptance Criteria

All conditions must be met for an agent to accept a task:

Sender is in allow_from list ("*" = accept all)
current_tasks count < max_concurrent_tasks
deadline has not expired (if present)
LLM judges it has the capability to complete the task

Decision outcomes:

Accept → reply task_update {state: "accepted"}, add to current_tasks
Reject → reply task_update {state: "rejected"} with reason

Anti-Loop Mechanism

Two fields prevent infinite agent-to-agent conversations:

TTL: Decremented on each relay. At 0, no further delegation allowed. Default = 3 (max 3 hops: A→B→C→D).
Trace: Append agent_id on each relay. Reject forwarding to any agent already in trace.

Scenario	Protection
A↔B mutual ping	Trace: B sees A in trace, rejects
A→B→C→A cycle	Trace: C sees A in trace, rejects
A→B→C→D→... infinite chain	TTL: reaches 0, stops

User Experience: End-to-End Flow

The user never interacts with mailbox directly. They communicate through their normal channel (Feishu, WeChat, etc.). The mailbox is invisible infrastructure.

User (Feishu)           Agent A (researcher)       Agent B (coder)
│                       │                          │
│ "帮我写排序函数"        │                          │
│ ──────────────────→   │                          │
│                       │ LLM decides to delegate  │
│                       │                          │
│ "我让 coder 去处理，    │                          │
│  完成后通知你"          │                          │
│ ←──────────────────   │                          │
│                       │                          │
│       ...time passes...│                          │
│                       │                          │
│                       │ task {                    │
│                       │   callback: {             │
│                       │     session_id: "feishu:user_123",
│                       │     channel: "feishu"     │
│                       │   }                       │
│                       │ }                         │
│                       │ ──────────────────────→   │
│                       │                          │ B processes...
│                       │                          │
│                       │ task_update {completed}   │
│                       │ ←──────────────────────   │
│                       │                          │
│ MailboxChannel polls task_update                  │
│ Routes to session "feishu:user_123"               │
│ LLM sees result in original conversation context  │
│                       │                          │
│ "排序函数已完成：        │                          │
│  sort_by_mtime()..."  │                          │
│ ←──────────────────   │                          │

The callback field carries the original channel session info through the entire task lifecycle. When the task_update arrives at Agent A's mailbox, the MailboxChannel restores the original session_id and channel, so the AgentLoop processes it in the correct conversation context and the LLM naturally responds to the user via Feishu.

Error Scenarios

Scenario	Handling
Target agent offline	Messages queue in inbox; processed when agent comes online
Heartbeat timeout	Registry marks agent `offline`; visible to other agents on next registry read
Agent crash during task	No `completed`/`failed` sent; sender can use `deadline` to detect timeout
Task rejected	Sender receives `rejected` with reason; decides next action
Task deadline expired	Receiver checks deadline on processing; rejects if expired
allow_from mismatch	Receiver discards message; optionally replies `rejected` with "unauthorized"
Registry concurrent write	Atomic writes (.tmp → rename); each agent only writes its own entry

Implementation

New Components

Component	Type	Modifies Existing Code
`MailboxManager`	New file (`nanobot/channels/mailbox_manager.py`)	No
`MailboxChannel`	New file (`nanobot/channels/mailbox.py`)	No
`MailboxConfig`	Pydantic model inside mailbox.py	No
Config section `channels.mailbox`	User config.json	No (additive)

MailboxManager

Low-level file operations with no nanobot dependencies:

class MailboxManager:
    def __init__(self, mailboxes_root: Path): ...

    def register(self, agent_id: str, card: dict) -> None:
        """Atomic write to registry (.tmp → rename)"""

    def heartbeat(self, agent_id: str) -> None:
        """Update last_heartbeat in registry"""

    def update_status(self, agent_id: str, status: str, current_tasks: list[str] | None = None) -> None:
        """Update agent status and current task list"""

    def send(self, from_id: str, to_id: str, msg: dict) -> None:
        """Atomic write to to_id/inbox/{timestamp}_{from_id}.msg.json"""

    def poll(self, agent_id: str) -> list[dict]:
        """Scan inbox/, return new messages sorted by timestamp"""

    def mark_processed(self, agent_id: str, filename: str) -> None:
        """Move from inbox/ to processed/"""

    def list_online_agents(self) -> list[dict]:
        """Read all online agents from registry"""

    def get_agent(self, agent_id: str) -> dict | None:
        """Read single agent info from registry"""

MailboxChannel

Follows existing channel pattern (like telegram.py, feishu.py):

class MailboxChannel(BaseChannel):
    def __init__(self, config: Any, bus: MessageBus):
        if isinstance(config, dict):
            config = MailboxConfig.model_validate(config)
        super().__init__(config, bus)
        self.config: MailboxConfig = config
        self.manager = MailboxManager(Path(self.config.mailboxes_root).expanduser())
        self._running = False

    def _build_card(self) -> dict:
        """Build Agent Card from mailbox config. Identity fields are user-configured."""
        return {
            "agent_id": self.config.agent_id,
            "description": self.config.description or "",
            "capabilities": self.config.capabilities or [],
            "status": "idle",
            "allow_from": self.config.allow_from,
            "max_concurrent_tasks": self.config.max_concurrent_tasks,
            "current_tasks": [],
            "registered_at": datetime.now(timezone.utc).isoformat(),
            "last_heartbeat": datetime.now(timezone.utc).isoformat(),
        }

    async def start(self) -> None:
        """Register agent + start poll loop"""
        self.manager.register(self.config.agent_id, self._build_card())
        self._running = True
        asyncio.create_task(self._poll_loop())

    async def _poll_loop(self) -> None:
        """Periodically scan inbox, inject new messages into bus"""
        while self._running:
            messages = self.manager.poll(self.config.agent_id)
            for msg in messages:
                inbound = self._to_inbound(msg)
                await self.bus.publish_inbound(inbound)
                self.manager.mark_processed(self.config.agent_id, msg["filename"])
            self.manager.heartbeat(self.config.agent_id)
            await asyncio.sleep(self.config.poll_interval)

    async def send_message(self, to_id: str, msg: dict) -> None:
        """Send message to another agent's mailbox"""
        # Anti-loop checks
        if to_id in msg.get("trace", []):
            logger.warning(f"Rejecting circular route: {to_id} already in trace")
            return
        if msg.get("ttl", 0) <= 0:
            logger.warning("TTL exhausted, cannot forward")
            return
        msg["ttl"] = msg.get("ttl", 3) - 1
        msg.setdefault("trace", []).append(self.config.agent_id)
        self.manager.send(self.config.agent_id, to_id, msg)

    def _to_inbound(self, msg: dict) -> InboundMessage:
        """Convert mailbox JSON to standard InboundMessage"""
        callback = msg.get("callback", {})
        return InboundMessage(
            channel=callback.get("channel", "mailbox"),
            sender=msg["from"],
            content=self._extract_text(msg),
            session_id=callback.get("session_id") or f"mailbox:{msg['from']}",
            metadata={
                "mailbox_type": msg["type"],
                "mailbox_task": msg.get("task"),
                "mailbox_parts": msg.get("content", {}).get("parts"),
                "mailbox_ttl": msg.get("ttl"),
                "mailbox_trace": msg.get("trace"),
                "reply_to": msg.get("reply_to"),
            },
        )

    async def stop(self) -> None:
        self._running = False
        self.manager.update_status(self.config.agent_id, "offline")

    @classmethod
    def default_config(cls) -> dict[str, Any]:
        return MailboxConfig().model_dump(by_alias=True)

MailboxConfig

All settings are self-contained within the mailbox channel config.

class MailboxConfig(Base):
    enabled: bool = False
    agent_id: str = ""
    description: str = ""           # optional, for agent discovery
    capabilities: list[str] = []    # optional, for agent discovery
    allow_from: list[str] = Field(default_factory=lambda: ["*"])
    max_concurrent_tasks: int = 3
    poll_interval: float = 5.0
    mailboxes_root: str = "~/.nanobot/mailboxes"

Configuration

In ~/.nanobot/config.json:

{
  "channels": {
    "feishu": {
      "enabled": true,
      "appId": "...",
      "appSecret": "..."
    },
    "mailbox": {
      "enabled": true,
      "agentId": "coder",
      "description": "负责代码编写和修改的 agent",
      "capabilities": ["code_write", "test_run", "review"],
      "allowFrom": ["researcher"],
      "maxConcurrentTasks": 3,
      "pollInterval": 5,
      "mailboxesRoot": "~/.nanobot/mailboxes"
    }
  }
}

Two agents running independently:

// Agent A config — researcher
{
  "channels": {
    "mailbox": {
      "enabled": true,
      "agentId": "researcher",
      "description": "负责信息检索和分析的 agent",
      "capabilities": ["web_search", "code_analysis", "summarization"],
      "allowFrom": ["*"]
    }
  }
}

// Agent B config — coder
{
  "channels": {
    "mailbox": {
      "enabled": true,
      "agentId": "coder",
      "description": "负责代码编写和修改的 agent",
      "capabilities": ["code_write", "test_run", "review"],
      "allowFrom": ["researcher"]
    }
  }
}

Design Principles

Zero side effects: Only new files, no modifications to existing code
Channel plugin pattern: MailboxChannel follows the same interface as all other channels
Bus integration: Mailbox messages become standard InboundMessage objects; AgentLoop is unaware of mailbox
Callback routing: Original channel session is preserved through the task lifecycle for seamless user experience
Anti-loop by default: TTL + trace prevents runaway agent conversations without configuration
Best-effort deadlines: Optional deadline field for task timeout, not a blocking mechanism

16 KiB Raw Blame History