provider.chat() had no retry logic — a transient 429 rate limit,
502 gateway error, or network timeout would permanently fail the
entire message. For a system running cron jobs and heartbeats 24/7,
even a brief provider blip causes lost tasks.
Adds _chat_with_retry() that:
- Retries up to 3 times with 1s/2s/4s exponential backoff
- Only retries transient errors (429, 5xx, timeout, connection)
- Returns immediately on permanent errors (400, 401, etc.)
- Falls through to the final attempt if all retries exhaust
Some LLM providers (Minimax, Dashscope) strictly reject consecutive
messages with the same role. build_messages() was emitting two separate
user messages back-to-back: the runtime context and the actual user
content.
Merge them into a single user message, handling both plain text and
multimodal (image) content. Update _save_turn() to strip the runtime
context prefix from the merged message when persisting to session
history.
Fixes#1414Fixes#1344
When an LLM returns content: null on a plain assistant message (no
tool_calls), the null gets saved to session history and causes
permanent 400 errors on every subsequent request.
- Sanitize None content on plain assistant messages to "(empty)" in
_sanitize_empty_content(), matching the existing empty-string handling
- Skip persisting error responses (finish_reason="error") to the
message history in _run_agent_loop(), preventing poison loops
Closes#1303
Some models (e.g., Kimi K2.5 via OpenRouter) return tool call arguments
as a list instead of a dict. This caused an AttributeError when trying
to call .values() on the list.
The fix checks if arguments is a list and extracts the first element
before accessing .values().
Made-with: Cursor
A refactoring in commit 132807a introduced a regression where the final
response was silently discarded whenever the message tool was used,
regardless of the target. This restored the original logic from PR #832
that only suppresses the final reply when the message tool sends to the
same (channel, chat_id) as the original message.
Changes:
- message.py: Replace _sent_in_turn: bool with _turn_sends: list[tuple]
to track actual send targets, add get_turn_sends() method
- loop.py: Check if (msg.channel, msg.chat_id) is in sent_targets before
suppressing final reply. Also move the "Response to" log after the
suppress check to avoid misleading logs.
- Add unit tests for the suppress logic
This ensures:
- Email sent via message tool → Feishu still gets confirmation
- Message tool sends to same Feishu chat → No duplicate (suppressed)
- cancel_by_session: use asyncio.gather for parallel cancellation
instead of sequential await per task
- _dispatch: register in _active_tasks before acquiring lock so /stop
can find queued tasks (synced from #1179)
- SubagentManager tracks _session_tasks: session_key -> {task_id, ...}
- cancel_by_session() cancels all subagents for a session
- SpawnTool passes session_key through to SubagentManager
- /stop response reports subagent cancellation count
- Cleanup callback removes from both _running_tasks and _session_tasks
Builds on #1179
- Add commands.py with CommandDef registry, parse_command(), get_help_text()
- Refactor run() to dispatch messages as asyncio tasks (non-blocking)
- /stop is an 'immediate' command: handled inline, cancels active task
- Global processing lock serializes message handling (safe for shared state)
- _pending_tasks set prevents GC of dispatched tasks before lock acquisition
- _dispatch() registers/clears active tasks, catches CancelledError gracefully
- /help now auto-generated from COMMANDS registry
Closes#849