Parse the endpoint host before disabling keepalive so public hostnames that merely contain private-network substrings keep the default connection pool behavior.
Made-with: Cursor
Local model servers (Ollama, llama.cpp, vLLM) often close idle HTTP
connections before the client-side keepalive timer expires. When two
LLM calls happen seconds apart — for example the heartbeat _decide()
phase followed immediately by process_direct() — the second call grabs
a now-dead pooled connection, causing a transient APIConnectionError
on every first attempt.
The fix detects local endpoints via:
- ProviderSpec.is_local (Ollama, LM Studio, vLLM, OVMS)
- Private-network URL patterns (localhost, 127.x, 192.168.x, 10.x,
172.16-31.x, host.docker.internal, [::1])
For these endpoints, the AsyncOpenAI client is created with a custom
httpx.AsyncClient that sets keepalive_expiry=0, forcing a fresh TCP
connection for each request. This is cheap on LAN (sub-5ms connect)
and eliminates the stale-connection retry tax entirely.
Cloud providers (OpenAI, Anthropic, OpenRouter, etc.) keep the default
5-second keepalive, which is fine for high-frequency API usage.
The private-network heuristic also covers the common case where users
configure provider='openai' but point apiBase at a LAN IP running
llama.cpp — the spec says is_local=False, but the URL clearly is.