nanobot

mirror/nanobot

Fork 0

mirror of https://github.com/HKUDS/nanobot.git synced 2026-04-30 06:45:55 +00:00

Commit Graph

Author	SHA1	Message	Date
hlg	8e7d8bef6a	fix(utils): handle malformed think tags and channel markers in strip_think Some models / Ollama renderers occasionally emit tokenizer-level template leaks that the existing regexes miss: 1. Malformed opening tags with no closing `>`, running straight into user-facing content — e.g. `<think广场照明灯目前…` (observed with Gemma 4 via Ollama). The earlier `<think>[\s\S]?</think>` and `^\s<think>[\s\S]$` patterns both require `>`, so these leak into rendered messages. 2. Harmony-style channel markers like `<channel\|>` / `<\|channel\|>` at the start of a response. 3. Orphan `</think>` / `</thought>` closing tags left behind when only the opener was consumed upstream. Handles each case conservatively: - Malformed `<think` / `<thought` only match when the next char is NOT a tag-name continuation (`[A-Za-z0-9_\-:>/]`). Explicit ASCII class instead of `\w` because Python's Unicode `\w` matches CJK and would defeat the primary fix. - Orphan closing tags and channel markers are stripped only at the start or end of the text*. `strip_think` is also applied before persisting history (memory.py), so mid-text stripping would silently rewrite transcripts where the tokens themselves are discussed. Preserves: `<thinker>`, `<think-foo>`, `<think_foo>`, `<think1>`, `<think:foo>`, `<thought/>`, literal `` `</think>` `` / `` `<channel\|>` `` inside prose or code blocks. Adds 16 new regression tests covering both the leak cases and the preserved-prose cases.	2026-04-20 17:04:48 +08:00
04cb	e392c27f7e	fix(utils): anchor unclosed think-tag regex to string start (#3004 )	2026-04-11 13:46:15 +08:00
chengyongru	e0c6e6f180	test: add regression tests for <thought> tag stripping	2026-04-10 12:10:23 +08:00

Author

SHA1

Message

Date

hlg

8e7d8bef6a

fix(utils): handle malformed think tags and channel markers in strip_think

Some models / Ollama renderers occasionally emit tokenizer-level template
leaks that the existing regexes miss:

  1. Malformed opening tags with no closing `>`, running straight into
     user-facing content — e.g. `<think广场照明灯目前…` (observed with
     Gemma 4 via Ollama). The earlier `<think>[\s\S]*?</think>` and
     `^\s*<think>[\s\S]*$` patterns both require `>`, so these leak into
     rendered messages.
  2. Harmony-style channel markers like `<channel|>` / `<|channel|>` at
     the start of a response.
  3. Orphan `</think>` / `</thought>` closing tags left behind when only
     the opener was consumed upstream.

Handles each case conservatively:

  - Malformed `<think` / `<thought` only match when the next char is NOT
    a tag-name continuation (`[A-Za-z0-9_\-:>/]`). Explicit ASCII class
    instead of `\w` because Python's Unicode `\w` matches CJK and would
    defeat the primary fix.
  - Orphan closing tags and channel markers are stripped **only at the
    start or end of the text**. `strip_think` is also applied before
    persisting history (memory.py), so mid-text stripping would silently
    rewrite transcripts where the tokens themselves are discussed.

Preserves: `<thinker>`, `<think-foo>`, `<think_foo>`, `<think1>`,
`<think:foo>`, `<thought/>`, literal `` `</think>` `` / `` `<channel|>` ``
inside prose or code blocks.

Adds 16 new regression tests covering both the leak cases and the
preserved-prose cases.

2026-04-20 17:04:48 +08:00

04cb

e392c27f7e

fix(utils): anchor unclosed think-tag regex to string start (#3004 )

2026-04-11 13:46:15 +08:00

chengyongru

e0c6e6f180

test: add regression tests for <thought> tag stripping

2026-04-10 12:10:23 +08:00

3 Commits