feat: configurable context budget for tool-loop iterations (#2317)

* feat: add contextBudgetTokens config field for tool-loop trimming * feat: implement _trim_history_for_budget for tool-loop cost reduction * feat: thread contextBudgetTokens into AgentLoop constructor * feat: wire context budget trimming into agent loop * refactor: move trim_history_for_budget to helpers and add docs - Extract trim_history_for_budget() as a pure function in helpers.py - AgentLoop._trim_history_for_budget becomes a thin wrapper - Add docs/CONTEXT_BUDGET.md with usage guide and trade-off notes - Replace wrapper tests with direct helper unit tests --------- Co-authored-by: chengyongru <chengyongru.ai@gmail.com>
2026-04-27 05:15:51 +00:00 · 2026-03-23 06:13:03 -04:00 · 2026-03-23 06:13:03 -04:00 · 528b3cfe5a
commit 528b3cfe5a
parent 0182ce2852
6 changed files with 319 additions and 5 deletions
--- a/docs/CONTEXT_BUDGET.md
+++ b/docs/CONTEXT_BUDGET.md
@ -0,0 +1,62 @@
 # Context Budget (`context_budget_tokens`)
 Caps how many tokens of old session history are sent to the LLM during tool-loop iterations 2+. Reduces cost and first-token latency by trimming history between turns.
 ## How It Works
 During multi-turn tool-use sessions, each iteration re-sends the full conversation history. `context_budget_tokens` limits how many old tokens are included:
 - **Iteration 1** — always receives full context (no trimming)
 - **Iteration 2+** — old history is trimmed to fit within the budget; current turn is never trimmed
 - **Memory consolidation** — runs before/after the loop and always sees the full canonical history; trimming only affects the LLM's view
 ## Configuration
 ```json
 {
  "agents": {
    "defaults": {
      "context_budget_tokens": 1000
    }
  }
 }
 ```
 | Value | Behavior |
 |---|---|
 ---
 `0` (default) | No trimming — full history sent every iteration
 `4000` | Conservative — barely trims in practice; good for multi-step tasks
 `1000` | Aggressive — significant savings; works well for typical linear tasks
 `< 500` | Clamped to `500` minimum when positive (1–2 message pairs at typical token density)
 ## Trade-offs
 **Cost & latency** — Trimming reduces tokens sent each iteration, which saves money and lowers first-token time (TTFT). This is nanobot's primary sweet spot.
 **Context loss** — Older context is not visible to the LLM in later iterations. For tasks that genuinely require 20+ iterations of history to stay coherent, consider `0` or `4000`.
 **Tool-result truncation** — Large results from a previous turn (e.g., reading a 10,000-line file in Round 1, then editing in Round 2) can be trimmed. The agent can re-read the file via its tools — this is a 1-tool-call recovery cost, not a failure.
 **Prefix caching** — Some providers (e.g., DeepSeek) use implicit prefix-based caching. Aggressive trimming breaks prefix matching and can reduce cache hit rates. For these providers, `0` or a high value may be more cost-effective overall.
 ## When to Use
 | Use case | Recommended value |
 |---|---|
 | Simple read → process → act chains | `1000` |
 | Multi-step reasoning with tool chains | `4000` |
 | Complex debugging / long task traces | `0` |
 | Providers with implicit prefix caching | `0` or `4000` |
 | Long file operations across turns | `0` or re-read via tools |
 ## Example
 ```
 Turn 1: User asks to read a.py (10k lines)
 Turn 2: User asks to edit line 100
 ```
 With `context_budget_tokens=500`, the file-content result from Turn 1 may be trimmed before Turn 2. The agent will re-read the file to perform the edit — a 1-call recovery. This is normal behavior for the feature; it is not a bug.
--- a/nanobot/agent/loop.py
+++ b/nanobot/agent/loop.py
@ -27,7 +27,7 @@ from nanobot.agent.tools.shell import ExecTool
 from nanobot.agent.tools.spawn import SpawnTool
 from nanobot.agent.tools.web import WebFetchTool, WebSearchTool
 from nanobot.bus.events import InboundMessage, OutboundMessage
-from nanobot.utils.helpers import build_status_content
+from nanobot.utils.helpers import build_status_content, trim_history_for_budget
 from nanobot.bus.queue import MessageBus
 from nanobot.providers.base import LLMProvider
 from nanobot.session.manager import Session, SessionManager
@ -59,6 +59,7 @@ class AgentLoop:
        model: str | None = None,
        max_iterations: int = 40,
        context_window_tokens: int = 65_536,
        context_budget_tokens: int = 0,
        web_search_config: WebSearchConfig | None = None,
        web_proxy: str | None = None,
        exec_config: ExecToolConfig | None = None,
@ -78,6 +79,7 @@ class AgentLoop:
        self.model = model or provider.get_default_model()
        self.max_iterations = max_iterations
        self.context_window_tokens = context_window_tokens
        self.context_budget_tokens = max(context_budget_tokens, 500) if context_budget_tokens > 0 else 0
        self.web_search_config = web_search_config or WebSearchConfig()
        self.web_proxy = web_proxy
        self.exec_config = exec_config or ExecToolConfig()
@ -232,6 +234,22 @@ class AgentLoop:
            metadata={"render_as": "text"},
        )
    def _trim_history_for_budget(
        self,
        messages: list[dict],
        turn_start_index: int,
        iteration: int,
    ) -> list[dict]:
        """Thin wrapper: delegates to trim_history_for_budget helper."""
        return trim_history_for_budget(
            messages,
            turn_start_index,
            iteration,
            self.context_budget_tokens,
            Session._find_legal_start,
        )
    async def _run_agent_loop(
        self,
        initial_messages: list[dict],
@ -250,6 +268,7 @@ class AgentLoop:
        iteration = 0
        final_content = None
        tools_used: list[str] = []
        turn_start_index = len(initial_messages) - 1
        # Wrap on_stream with stateful think-tag filter so downstream
        # consumers (CLI, channels) never see <think> blocks.
@ -271,20 +290,23 @@ class AgentLoop:
            tool_defs = self.tools.get_definitions()
            send_messages = self._trim_history_for_budget(
                messages, turn_start_index, iteration,
            )
            if on_stream:
                response = await self.provider.chat_stream_with_retry(
-                    messages=messages,
+                    messages=send_messages,
                    tools=tool_defs,
                    model=self.model,
                    on_content_delta=_filtered_stream,
                )
            else:
                response = await self.provider.chat_with_retry(
-                    messages=messages,
+                    messages=send_messages,
                    tools=tool_defs,
                    model=self.model,
                )
            usage = response.usage or {}
            self._last_usage = {
                "prompt_tokens": int(usage.get("prompt_tokens", 0) or 0),
--- a/nanobot/cli/commands.py
+++ b/nanobot/cli/commands.py
@ -527,6 +527,7 @@ def gateway(
        model=config.agents.defaults.model,
        max_iterations=config.agents.defaults.max_tool_iterations,
        context_window_tokens=config.agents.defaults.context_window_tokens,
        context_budget_tokens=config.agents.defaults.context_budget_tokens,
        web_search_config=config.tools.web.search,
        web_proxy=config.tools.web.proxy or None,
        exec_config=config.tools.exec,
@ -722,6 +723,7 @@ def agent(
        model=config.agents.defaults.model,
        max_iterations=config.agents.defaults.max_tool_iterations,
        context_window_tokens=config.agents.defaults.context_window_tokens,
        context_budget_tokens=config.agents.defaults.context_budget_tokens,
        web_search_config=config.tools.web.search,
        web_proxy=config.tools.web.proxy or None,
        exec_config=config.tools.exec,
--- a/nanobot/config/schema.py
+++ b/nanobot/config/schema.py
@ -39,6 +39,7 @@ class AgentDefaults(Base):
    context_window_tokens: int = 65_536
    temperature: float = 0.1
    max_tool_iterations: int = 40
    context_budget_tokens: int = 0  # Max old-history tokens during tool iterations (0 = no trim)
    reasoning_effort: str | None = None  # low / medium / high — enables LLM thinking mode
--- a/nanobot/utils/helpers.py
+++ b/nanobot/utils/helpers.py
@ -6,9 +6,10 @@ import re
 import time
 from datetime import datetime
 from pathlib import Path
-from typing import Any
+from typing import Any, Callable
 import tiktoken
 from loguru import logger
 def strip_think(text: str) -> str:
@ -201,6 +202,58 @@ def estimate_message_tokens(message: dict[str, Any]) -> int:
        return max(4, len(payload) // 4 + 4)
 def trim_history_for_budget(
    messages: list[dict[str, Any]],
    turn_start_index: int,
    iteration: int,
    context_budget_tokens: int,
    find_legal_start: Callable[[list[dict[str, Any]]], int],
 ) -> list[dict[str, Any]]:
    """Trim old session history to fit within context_budget_tokens.
    Returns the original list unchanged when no trimming is needed.
    Only trims on iteration >= 2 when context_budget_tokens > 0.
    Current-turn messages (from turn_start_index onward) are never trimmed.
    """
    if context_budget_tokens <= 0 or iteration <= 1:
        return messages
    if turn_start_index <= 1:
        return messages  # no old history to trim
    system = messages[:1]
    old_history = messages[1:turn_start_index]
    current_turn = messages[turn_start_index:]
    # Pre-compute token counts to avoid double-estimation
    token_counts = [estimate_message_tokens(m) for m in old_history]
    total = sum(token_counts)
    if total <= context_budget_tokens:
        return messages  # fits, no trim needed
    # Find cut index (O(n) scan, then single slice)
    cut = 0
    removed_tokens = 0
    while cut < len(old_history) and total > context_budget_tokens:
        removed_tokens += token_counts[cut]
        total -= token_counts[cut]
        cut += 1
    old_history = old_history[cut:]
    # Fix orphaned tool results after trimming
    legal_start = find_legal_start(old_history)
    if legal_start > 0:
        old_history = old_history[legal_start:]
    removed_count = turn_start_index - 1 - len(old_history)
    if removed_count > 0:
        logger.debug(
            "Context budget: trimmed {} history messages ({} tokens) for iteration {}",
            removed_count, removed_tokens, iteration,
        )
    return system + old_history + current_turn
 def estimate_prompt_tokens_chain(
    provider: Any,
    model: str | None,
--- a/tests/test_trim_history_for_budget.py
+++ b/tests/test_trim_history_for_budget.py
@ -0,0 +1,174 @@
 """Direct unit tests for trim_history_for_budget() helper."""
 import pytest
 from nanobot.session.manager import Session
 from nanobot.utils.helpers import estimate_message_tokens, trim_history_for_budget
 def _msg(role: str, content: str, **kw) -> dict:
    return {"role": role, "content": content, **kw}
 def _system(content: str = "You are a bot.") -> dict:
    return _msg("system", content)
 def _user(content: str) -> dict:
    return _msg("user", content)
 def _assistant(content: str | None = None, tool_calls: list | None = None) -> dict:
    m = {"role": "assistant", "content": content}
    if tool_calls:
        m["tool_calls"] = tool_calls
    return m
 def _tool_call(tc_id: str, name: str = "exec", args: str = "{}") -> dict:
    return {"id": tc_id, "type": "function", "function": {"name": name, "arguments": args}}
 def _tool_result(tc_id: str, content: str = "ok") -> dict:
    return {"role": "tool", "tool_call_id": tc_id, "name": "exec", "content": content}
 # --- Early-exit cases ---
 def test_budget_zero_returns_same_list():
    msgs = [_system(), _user("old1"), _assistant("old reply"), _user("current")]
    result = trim_history_for_budget(msgs, turn_start_index=3, iteration=2, context_budget_tokens=0, find_legal_start=Session._find_legal_start)
    assert result is msgs
 def test_iteration_one_never_trims():
    msgs = [_system(), _user("old1"), _assistant("old reply"), _user("current")]
    result = trim_history_for_budget(msgs, turn_start_index=3, iteration=1, context_budget_tokens=1000, find_legal_start=Session._find_legal_start)
    assert result is msgs
 def test_turn_start_at_one_returns_same():
    """turn_start_index=1 means no old history before the current turn."""
    msgs = [_system(), _user("current")]
    result = trim_history_for_budget(msgs, turn_start_index=1, iteration=2, context_budget_tokens=0, find_legal_start=Session._find_legal_start)
    assert result is msgs
 def test_history_under_budget_returns_unchanged():
    msgs = [_system(), _user("short msg"), _assistant("short reply"), _user("current")]
    result = trim_history_for_budget(msgs, turn_start_index=3, iteration=2, context_budget_tokens=50000, find_legal_start=Session._find_legal_start)
    assert result is msgs
 # --- Trimming cases ---
 def test_trim_removes_oldest_messages():
    old_msgs = []
    for i in range(40):
        old_msgs.append(_user(f"old message number {i} padding extra text here"))
        old_msgs.append(_assistant(f"reply to message {i} with more padding"))
    current_user = _user("current task")
    current_tc = _assistant(None, [_tool_call("tc1")])
    current_result = _tool_result("tc1", "done")
    msgs = [_system()] + old_msgs + [current_user, current_tc, current_result]
    turn_start = 1 + len(old_msgs)
    result = trim_history_for_budget(msgs, turn_start, iteration=2, context_budget_tokens=500, find_legal_start=Session._find_legal_start)
    # System and current turn preserved
    assert result[0] == msgs[0]
    assert result[-3:] == [current_user, current_tc, current_result]
    # Old history trimmed
    trimmed_history = result[1:-3]
    assert len(trimmed_history) < len(old_msgs)
    # Token budget respected
    trimmed_tokens = sum(estimate_message_tokens(m) for m in trimmed_history)
    assert trimmed_tokens <= 500
 def test_trim_preserves_tool_call_boundary():
    """Trimming must not leave orphaned tool results."""
    old = [
        _user("padding " * 200),
        _assistant(None, [_tool_call("old_tc1")]),
        _tool_result("old_tc1", "short result"),
        _user("recent msg"),
        _assistant("recent reply"),
    ]
    current = _user("current")
    msgs = [_system()] + old + [current]
    turn_start = 1 + len(old)
    result = trim_history_for_budget(msgs, turn_start, iteration=2, context_budget_tokens=500, find_legal_start=Session._find_legal_start)
    # Check no orphaned tool results
    trimmed_history = result[1:-1]
    declared_ids = set()
    for m in trimmed_history:
        if m.get("role") == "assistant" and m.get("tool_calls"):
            for tc in m["tool_calls"]:
                declared_ids.add(tc["id"])
    for m in trimmed_history:
        if m.get("role") == "tool":
            tc_id = m.get("tool_call_id")
            assert tc_id in declared_ids, f"Orphan tool result: {tc_id}"
 def test_extreme_trim_keeps_system_and_current_turn():
    """When budget is tiny, only system and current turn remain."""
    old = [_user("x" * 2000), _assistant("y" * 2000)]
    current = _user("current")
    msgs = [_system()] + old + [current]
    result = trim_history_for_budget(msgs, turn_start_index=3, iteration=2, context_budget_tokens=500, find_legal_start=Session._find_legal_start)
    assert result[0] == msgs[0]  # system
    assert result[-1] == current  # current turn
    assert len(result) <= len(msgs)
 def test_original_messages_not_mutated():
    old = [_user("x" * 2000), _assistant("y" * 2000)]
    current = _user("current")
    msgs = [_system()] + old + [current]
    original_len = len(msgs)
    _ = trim_history_for_budget(msgs, turn_start_index=3, iteration=2, context_budget_tokens=500, find_legal_start=Session._find_legal_start)
    assert len(msgs) == original_len
 def test_current_turn_never_trimmed():
    """All messages at or after turn_start_index must be preserved verbatim."""
    old = [_user("old"), _assistant("reply")]
    current_turn = [
        _user("current user message"),
        _assistant(None, [_tool_call("tc1")]),
        _tool_result("tc1", "result"),
    ]
    msgs = [_system()] + old + current_turn
    turn_start = 1 + len(old)
    result = trim_history_for_budget(msgs, turn_start, iteration=2, context_budget_tokens=1, find_legal_start=Session._find_legal_start)
    assert result[-len(current_turn):] == current_turn
 def test_iteration_two_first_trim():
    """iteration=2 is the first iteration where trimming kicks in."""
    old = [_user("x" * 2000), _assistant("y" * 2000)]
    msgs = [_system()] + old + [_user("current")]
    turn_start = 3
    # iteration=1: no trim
    r1 = trim_history_for_budget(msgs, turn_start, iteration=1, context_budget_tokens=0, find_legal_start=Session._find_legal_start)
    assert r1 is msgs
    # iteration=2 with budget=0: no trim (budget is 0)
    r2 = trim_history_for_budget(msgs, turn_start, iteration=2, context_budget_tokens=0, find_legal_start=Session._find_legal_start)
    assert r2 is msgs
    # iteration=2 with positive budget: trim occurs (2000-char msgs ~= 500+ tokens each)
    r3 = trim_history_for_budget(msgs, turn_start, iteration=2, context_budget_tokens=500, find_legal_start=Session._find_legal_start)
    assert r3 is not msgs