The catch-all except Exception in QQ send() was swallowing
aiohttp.ClientError and OSError that _send_media correctly
re-raises. Add explicit catch for network errors before the
generic handler.
Audited all channel implementations for overly broad exception handling
that causes retry amplification or silent message loss during network
errors. This is the same class of bug as #3050 (Telegram _send_text).
Fixes by channel:
Telegram (send_delta):
- _stream_end path used except Exception for HTML edit fallback
- Network errors (TimedOut, NetworkError) triggered redundant plain
text edit, doubling connection demand during pool exhaustion
- Changed to except BadRequest, matching the _send_text fix
Discord:
- send() caught all exceptions without re-raising
- ChannelManager._send_with_retry() saw successful return, never retried
- Messages silently dropped on any send failure
- Added raise after error logging
DingTalk:
- _send_batch_message() returned False on all exceptions including
network errors — no retry, fallback text sent unnecessarily
- _read_media_bytes() and _upload_media() swallowed transport errors,
causing _send_media_ref() to cascade through doomed fallback attempts
- Added except httpx.TransportError handlers that re-raise immediately
WeChat:
- Media send failure triggered text fallback even for network errors
- During network issues: 3×(media + text) = 6 API calls per message
- Added specific catches: TimeoutException/TransportError re-raise,
5xx HTTPStatusError re-raises, 4xx falls back to text
QQ:
- _send_media() returned False on all exceptions
- Network errors triggered fallback text instead of retry
- Added except (aiohttp.ClientError, OSError) that re-raises
Tests: 331 passed (283 existing + 48 new across 5 channel test files)
Fixes: #3054
Related: #3050, #3053
Previously _send_text() caught all exceptions (except Exception) when
sending HTML-formatted messages, falling back to plain text even for
network errors like TimedOut and NetworkError. This caused connection
demand to double during pool exhaustion scenarios (3 retries × 2
fallback attempts = 6 calls per message instead of 3).
Now only catches BadRequest (HTML parse errors), letting network errors
propagate immediately to the retry layer where they belong.
Fixes: HKUDS/nanobot#3050
Add a focused regression test for the successful no-image retry path so the original message history stays stripped after fallback and the repeated retry loop cannot silently return.
Made-with: Cursor
When a non-transient LLM error occurs with image content, the retry
mechanism strips images from a copy but never updates the original
conversation history. Subsequent iterations rebuild context from the
unmodified history, causing the same error-retry cycle to repeat
every iteration until max_iterations is reached.
Add _strip_image_content_inplace() that mutates the original message
content lists in-place after a successful no-image retry, so callers
sharing those references (e.g. the runner's conversation history)
also see the stripped version.
Point Dream skill creation at a readable builtin skill-creator template, keep skill writes rooted at the workspace, and document the new skill discovery behavior in README.
Made-with: Cursor
Instead of a separate skill discovery system, extend Dream's two-phase
pipeline to also detect reusable behavioral patterns from conversation
history and generate SKILL.md files.
Phase 1 gains a [SKILL] output type for pattern detection.
Phase 2 gains write_file (scoped to skills/) and read access to builtin
skills, enabling it to check for duplicates and follow skill-creator's
format conventions before creating new skills.
Inspired by PR #3039 by @wanghesong2019.
Co-authored-by: wanghesong2019 <wanghesong2019@users.noreply.github.com>
Keep the new exec guard focused on writes to history.jsonl and .dream_cursor while still allowing read-only copy operations out of those files.
Made-with: Cursor
Explain the new agents.defaults.disabledSkills option so users can discover and configure skill exclusion from the main agent and subagents.
Made-with: Cursor
Add 'domain' field to FeishuConfig (Literal['feishu', 'lark'], default 'feishu').
Pass domain to lark.Client.builder() and lark.ws.Client to support Lark global
(open.larksuite.com) in addition to Feishu China (open.feishu.cn).
Existing configs default to 'feishu' for backward compatibility.
Also add documentation for domain field in README.md and add tests for
domain config.
The plain reply() uses cmd="reply" which does not support "text" msgtype
and causes WeCom API to return errcode=40008 (invalid message type).
Unify both progress and final text messages to use reply_stream()
(cmd="aibot_respond_msg"), differentiating via finish flag.
Fixes#2999
- Use asyncio.to_thread for file I/O to avoid blocking event loop
- Add 200MB upload size limit with early rejection
- Fix file handle leak by using context manager
- Use memoryview for upload chunking to reduce peak memory
- Add inbound download size check to prevent OOM
- Use asyncio.to_thread for write_bytes in download path
- Extract inline media_type detection to _guess_wecom_media_type()
- Use asyncio.to_thread for file I/O to avoid blocking event loop
- Add 200MB upload size limit with early rejection
- Fix file handle leak by using context manager
- Free raw bytes early after chunking to reduce memory pressure
- Add file attachments to media_paths (was text-only, inconsistent with image)
- Use robust _sanitize_filename() instead of os.path.basename() for path safety
- Remove re-raise in send() for consistency with QQ channel
- Fix truncated media_id logging for short IDs
QQ channel improvements (on top of nightly):
- Add top-level try/except in _on_message and send() for resilience
- Use defensive getattr() for attachment attributes (botpy version compat)
- Skip file_name for image uploads to avoid QQ rendering as file attachment
- Extract only file_info from upload response to avoid extra fields
- Handle protocol-relative URLs (//...) in attachment downloads
WeCom channel improvements:
- Add _upload_media_ws() for WebSocket 3-step media upload protocol
- Send media files (image/video/voice/file) via WeCom rich media API
- Support progress messages (plain reply) vs final response (streaming)
- Support proactive send when no frame available (cron push)
- Pass media_paths to message bus for downstream processing
* feat(agent): add mid-turn message injection for responsive follow-ups
Allow user messages sent during an active agent turn to be injected
into the running LLM context instead of being queued behind a
per-session lock. Inspired by Claude Code's mid-turn queue drain
mechanism (query.ts:1547-1643).
Key design decisions:
- Messages are injected as natural user messages between iterations,
no tool cancellation or special system prompt needed
- Two drain checkpoints: after tool execution and after final LLM
response ("last-mile" to prevent dropping late arrivals)
- Bounded by MAX_INJECTION_CYCLES (5) to prevent consuming the
iteration budget on rapid follow-ups
- had_injections flag bypasses _sent_in_turn suppression so follow-up
responses are always delivered
Closes#1609
* fix(agent): harden mid-turn injection with streaming fix, bounded queue, and message safety
- Fix streaming protocol violation: Checkpoint 2 now checks for injections
BEFORE calling on_stream_end, passing resuming=True when injections found
so streaming channels (Feishu) don't prematurely finalize the card
- Bound pending queue to maxsize=20 with QueueFull handling
- Add warning log when injection batch exceeds _MAX_INJECTIONS_PER_TURN
- Re-publish leftover queue messages to bus in _dispatch finally block to
prevent silent message loss on early exit (max_iterations, tool_error, cancel)
- Fix PEP 8 blank line before dataclass and logger.info indentation
- Add 12 new tests covering drain, checkpoints, cycle cap, queue routing,
cleanup, and leftover re-publish
- Add explicit error logging for missing file_key and message_id
- Add logging for download failures
- Change audio extension from .opus to .ogg for better Whisper compatibility
- Feishu voice messages are opus in OGG container; .ogg is more widely recognized
Rename the auto compact module to autocompact.py for a cleaner path while keeping the AutoCompact type and behavior unchanged. Update the agent loop import to match.
Prefer the more user-friendly idleCompactAfterMinutes name for auto compact while keeping sessionTtlMinutes as a backward-compatible alias. Update tests and README to document the retained recent-context behavior and the new preferred key.
Keep a legal recent suffix in idle auto-compacted sessions so resumed chats preserve their freshest live context while older messages are summarized. Recover persisted summaries even when retained messages remain, and document the new behavior.
Make Consolidator.archive() return the summary string directly instead
of writing to history.jsonl then reading back via get_last_history_entry().
This eliminates a race condition where concurrent _archive calls for
different sessions could read each other's summaries from the shared
history file (cross-user context leak in multi-user deployments).
Also removes Consolidator.get_last_history_entry() — no longer needed.
history.jsonl may contain non-UTF-8 bytes (e.g. from email channel
binary content), causing auto compact to fail when reading the last
entry for summary generation. Catch UnicodeDecodeError alongside
FileNotFoundError and JSONDecodeError.
When a user is idle for longer than a configured TTL, nanobot **proactively** compresses the session context into a summary. This reduces token cost and first-token latency when the user returns — instead of re-processing a long stale context with an expired KV cache, the model receives a compact summary and fresh input.
When on_job callbacks call list_jobs() (which triggers _load_store),
the in-memory state is reloaded from disk, discarding the next_run_at_ms
updates that _on_timer is actively computing. This causes jobs to
re-trigger indefinitely on the next tick.
Add an _executing flag around the job execution loop. While set,
_load_store returns the cached store instead of reloading from disk.
Includes regression test.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each MCP server now connects in its own asyncio.Task to isolate anyio
cancel scopes and prevent 'exit cancel scope in different task' errors
when multiple servers (especially mixed transport types) are configured.
Changes:
- connect_mcp_servers() returns dict[str, AsyncExitStack] instead of None
- Each server runs in separate task via asyncio.gather()
- AgentLoop uses _mcp_stacks dict to track per-server stacks
- Tests updated to handle new API