The non-streaming parse path unconditionally promoted the `reasoning`
response field to `content` when content was empty. This was intended
for StepFun (whose API returns the actual answer in `reasoning`), but
it applied to every OpenAI-compatible provider — causing internal
thinking chains from models like Xiaomi MIMO to be leaked as formal
replies.
Add `reasoning_as_content: bool` to ProviderSpec (default False) and
set it only for StepFun. The fallback now requires this flag rather
than running globally.
Fixes#3443
Parse the endpoint host before disabling keepalive so public hostnames that merely contain private-network substrings keep the default connection pool behavior.
Made-with: Cursor
Local model servers (Ollama, llama.cpp, vLLM) often close idle HTTP
connections before the client-side keepalive timer expires. When two
LLM calls happen seconds apart — for example the heartbeat _decide()
phase followed immediately by process_direct() — the second call grabs
a now-dead pooled connection, causing a transient APIConnectionError
on every first attempt.
The fix detects local endpoints via:
- ProviderSpec.is_local (Ollama, LM Studio, vLLM, OVMS)
- Private-network URL patterns (localhost, 127.x, 192.168.x, 10.x,
172.16-31.x, host.docker.internal, [::1])
For these endpoints, the AsyncOpenAI client is created with a custom
httpx.AsyncClient that sets keepalive_expiry=0, forcing a fresh TCP
connection for each request. This is cheap on LAN (sub-5ms connect)
and eliminates the stale-connection retry tax entirely.
Cloud providers (OpenAI, Anthropic, OpenRouter, etc.) keep the default
5-second keepalive, which is fine for high-frequency API usage.
The private-network heuristic also covers the common case where users
configure provider='openai' but point apiBase at a LAN IP running
llama.cpp — the spec says is_local=False, but the URL clearly is.
The existing test only verified the adaptive path. Add two more cases:
- enabled thinking (high): temperature must also be omitted
- no thinking (None): temperature must still be omitted
Made-with: Cursor
Two issues with DeepSeek V4 thinking mode support:
1. Missing thinking parameter injection.
DeepSeek V4 requires `extra_body: {"thinking": {"type": "enabled/disabled"}}`
— identical to VolcEngine/BytePlus. The code had this for volcengine,
byteplus, dashscope, minimax, and kimi but not DeepSeek. This means
`reasoning_effort=minimal` (thinking off) silently has no effect.
Root cause: the thinking-style→wire-format mapping was an if/elif chain
on provider *names*. DeepSeek was forgotten.
Fix: make the mapping declarative via `ProviderSpec.thinking_style`:
- "thinking_type" → {"thinking": {"type": "..."}} (DeepSeek, Volc, BytePlus)
- "enable_thinking" → {"enable_thinking": bool} (DashScope)
- "reasoning_split" → {"reasoning_split": bool} (MiniMax)
`_build_kwargs` now does a single dict lookup. Adding a new provider
with an existing wire format requires zero changes to the function.
2. Legacy session messages crash thinking-mode requests.
When a session was started without thinking mode (or with a different
model), assistant messages lack reasoning_content. DeepSeek V4 in
thinking mode rejects these with 400:
"The reasoning_content in the thinking mode must be passed back to the API."
This affects ALL assistant messages, not just those with tool_calls
(despite the docs only mentioning the tool_calls case).
Fix: `_build_kwargs` backfills `reasoning_content: ""` on every
assistant message missing it, but only when thinking mode is active.
This is semantically neutral — the model treats empty reasoning_content
as "no thinking happened on that turn". The backfill only touches the
in-memory request copy; session files on disk are untouched.
Tests: +5 (3 thinking toggle, 2 backfill). Full suite: 2377 passed.
Made-with: Cursor
Adds a focused regression test so the fix for tool_result image
handling cannot silently revert. Two cases:
- list content with an image_url + text block -> image_url is
translated to a native Anthropic image block, sibling text passes
through unchanged
- plain string content passes through untouched (the new list branch
must not alter the string path)
These cover the exact symptom surface (silent image drop with a
"Non-transient LLM error with image content" warning) and the only
two content shapes tool results actually take today.
Made-with: Cursor
Locks in the four behaviors introduced by the fix so they can't silently
revert:
- _should_use_responses_api accepts github_copilot on its non-OpenAI base
- _build_responses_body strips the 'github_copilot/' routing prefix
- /responses failures on github_copilot do not fall back to /chat/completions
Made-with: Cursor
DashScope rejects the OpenAI-style value "minimal" with
`'reasoning_effort.effort' must be one of: 'none', 'minimum', 'low',
'medium', 'high', 'xhigh'`, but nanobot was passing the string through
verbatim. Users who tried the documented "minimal" to disable thinking
got a 400; users who tried the DashScope-native "minimum" to work
around it got `enable_thinking=True` because the internal comparison
was a hard string match on "minimal".
Introduce a semantic/wire split in `_build_kwargs`:
- `semantic_effort` is the internal canonical form (OpenAI vocabulary).
"minimum" on the way in is normalized to "minimal" here so both
spellings share one meaning.
- `wire_effort` is what we actually serialize. For DashScope with
semantic_effort == "minimal" we translate to "minimum" on the way
out; other providers are unchanged.
- `thinking_enabled` and the Kimi thinking branch now compare on
`semantic_effort`, so either user spelling correctly disables
provider-side thinking.
Tests:
- Strengthen `test_dashscope_thinking_disabled_for_minimal` to assert
the wire value is "minimum" in addition to the extra_body signal;
the original version only checked extra_body and let the
invalid-value bug slip through.
- Add `test_dashscope_thinking_disabled_for_minimum_alias` so a user
who read the DashScope docs and configured "minimum" still gets
thinking off.
- Add `test_non_dashscope_minimal_not_retranslated` to pin down that
the DashScope-specific translation does not leak to OpenAI et al.
ZhiPu API returns code 1302 with Chinese text "速率限制" instead of
standard HTTP 429 + "rate limit", causing the retry engine to treat
it as non-transient and fail immediately.
Extend `_merge_consecutive` so the three invariants from
`LLMProvider._enforce_role_alternation` all hold for Anthropic:
1. collapse consecutive same-role turns (unchanged)
2. no trailing assistant — Anthropic rejects prefill (unchanged)
3. no leading assistant — Anthropic requires the first turn be user
4. non-empty messages array — recover the last stripped assistant as a
user turn when every turn got stripped, so callers don't hit a
secondary "messages array empty" 400
Anthropic-specific wrinkle: `tool_use` blocks live inside `content` (not
a separate `tool_calls` field) and are illegal inside user turns, so
both recovery paths skip any message carrying them rather than silently
producing a malformed request.
Adds 4 unit tests covering the new branches, including the tool_use
opt-outs, and updates the existing `test_single_assistant_stripped` to
reflect the new rerouting contract.
Made-with: Cursor
Anthropic does not support assistant-message prefill and returns a 400
error when the conversation ends with an assistant turn. This commonly
happens when heartbeat/system messages accumulate trailing assistant
replies in the session history.
The _merge_consecutive method already handles same-role merging but did
not strip trailing assistant messages. The base provider's
_enforce_role_alternation (used by OpenAI-compat) does strip them, but
AnthropicProvider uses its own _merge_consecutive instead.
Add a trailing-assistant stripping loop to _merge_consecutive, matching
the behavior already present in _enforce_role_alternation.
Includes 7 new tests covering merge + strip behavior.
When the Responses API fails repeatedly (3 consecutive compatibility
errors), skip it and fall back directly to Chat Completions. Unlike a
permanent disable, the circuit re-probes after 5 minutes so recovery
is automatic when the API comes back. Success resets the counter.
Keyed per (model, reasoning_effort) so a failure with one model does
not affect others.
Two small follow-ups to the guard:
1. Fix the should_execute_tools docstring so it matches the actual code.
The previous version said "Only execute when finish_reason explicitly
signals tool intent" but the code also accepts finish_reason == "stop".
Explain why (some compliant providers emit "stop" with legitimate tool
calls — openai_compat_provider.py already mirrors this at lines ~633 /
~678 where ("tool_calls", "stop") are both treated as the terminal
tool-call state). Without this, a strict "tool_calls"-only guard would
regress 15 existing runner tests that construct LLMResponse with
tool_calls but no explicit finish_reason (default = "stop").
2. Add tests/providers/test_llm_response.py. This locks the three cases:
- no tool calls -> never executes
- tool calls + "tool_calls"/stop -> executes
- tool calls + refusal / content_filter / error / length / ... -> blocked
These are exactly the boundary cases the #3220 fix is about; without a
test here a future refactor could silently revert the guard.
Body + tests only, no behavior change beyond the existing PR's intent.
Made-with: Cursor
- Extract synthetic user message string to module-level constant
- Tighten comments in _snip_history recovery branch
- Strengthen no-user edge case test to verify safety net interaction
When _snip_history truncates the message history and the only user message
ends up outside the kept window, providers like GLM reject the resulting
system→assistant sequence with error 1214 ("messages 参数非法").
Two-layer fix:
1. _snip_history now walks backwards through non_system messages to recover
the nearest user message when none exists in the kept window.
2. _enforce_role_alternation inserts a synthetic user message
"(conversation continued)" when the first non-system message is a bare
assistant (no tool_calls), serving as a safety net for any edge cases
that slip through.
Co-authored-by: darlingbud <darlingbud@users.noreply.github.com>
Document why MiniMax thinking mode uses a separate Anthropic-compatible provider and list the matching base URLs. Add a small registry test so the new provider stays wired to the expected backend and API key.
Made-with: Cursor
- Inject `thinking={"type": "enabled|disabled"}` via extra_body for
Kimi thinking-capable models (kimi-k2.5, k2.6-code-preview).
- Add _is_kimi_thinking_model helper to handle both bare slugs and
OpenRouter-style prefixed names (e.g. moonshotai/kimi-k2.5).
- reasoning_effort="minimal" maps to disabled; any other value enables it.
- Add tests for enabled/disabled states and OpenRouter prefix handling.
Lock the new interaction-channel retry termination hints so both exhausted standard retries and persistent identical-error stops keep emitting the final progress message.
Made-with: Cursor
Lock the strict-provider sanitization path so assistant tool calls without function.arguments are normalized to {} instead of being forwarded as missing values.
Made-with: Cursor
Ensure assistant tool-call function.arguments is always emitted as valid JSON text so strict OpenAI-compatible backends (including Alibaba code models) do not reject requests. Add regressions for dict and malformed-string argument payloads in message sanitization.
Made-with: Cursor
When a subagent result is injected with current_role="assistant",
_enforce_role_alternation drops the trailing assistant message, leaving
only the system prompt. Providers like Zhipu/GLM reject such requests
with error 1214 ("messages parameter invalid"). Now the last popped
assistant message is recovered as a user message when no user/tool
messages remain.
Add a focused regression test for the successful no-image retry path so the original message history stays stripped after fallback and the repeated retry loop cannot silently return.
Made-with: Cursor
When a non-transient LLM error occurs with image content, the retry
mechanism strips images from a copy but never updates the original
conversation history. Subsequent iterations rebuild context from the
unmodified history, causing the same error-retry cycle to repeat
every iteration until max_iterations is reached.
Add _strip_image_content_inplace() that mutates the original message
content lists in-place after a successful no-image retry, so callers
sharing those references (e.g. the runner's conversation history)
also see the stripped version.
Keep tool-call assistant messages valid across provider sanitization and avoid trailing user-only history after model errors. This prevents follow-up requests from sending broken tool chains back to the gateway.
Resolved conflict in azure_openai_provider.py by keeping main's
Responses API implementation (role alternation not needed for the
Responses API input format).
Made-with: Cursor
- Fix assertion in streaming dict fallback test (trailing space in data
not reflected in expected value).
- Add two regression tests proving that models with reasoning_content
(e.g. DeepSeek-R1) and standard models (no reasoning fields) are
completely unaffected by the reasoning fallback.
Made-with: Cursor
Add comprehensive tests for the StepFun Plan API compatibility fix:
- _parse dict branch: content and reasoning_content fallback to reasoning
- _parse SDK object branch: same fallback for pydantic response objects
- _parse_chunks dict branch: reasoning field handled in streaming mode
- _parse_chunks SDK branch: reasoning fallback for SDK delta objects
- Precedence tests: reasoning_content field takes priority over reasoning
Refs: fix(provider): support StepFun Plan API reasoning field fallback
Merge the three retry-after header parsers (base, OpenAI, Anthropic)
into a single _extract_retry_after_from_headers on LLMProvider that
handles retry-after-ms, case-insensitive lookup, and HTTP date.
Remove the per-provider _parse_retry_after_headers duplicates and
their now-unused email.utils / time imports. Add test for retry-after-ms.
Made-with: Cursor
- Add byteplus and byteplus_coding_plan to thinking param providers
- Only send extra_body when reasoning_effort is explicitly set
- Use setdefault().update() to avoid clobbering existing extra_body
- Add 7 regression tests for thinking params
Made-with: Cursor
The test test_openai_compat_strips_message_level_reasoning_fields was
added in fbedf7a and incorrectly asserted that reasoning_content and
extra_content should be stripped from messages. This contradicts the
intent of b5302b6 which explicitly added these fields to _ALLOWED_MSG_KEYS
to preserve them through sanitization.
Rename the test and fix assertions to match the original design intent:
reasoning_content and extra_content at message level should be preserved,
and extra_content inside tool_calls should also be preserved.
Signed-off-by: Lingao Meng <menglingao@xiaomi.com>