6 Commits

Author SHA1 Message Date
NanoBot
c20ecc52d7 feat(transcription): add Xiaomi MiMo ASR provider (mimo-v2.5-asr)
Add support for Xiaomi MiMo ASR as a third transcription backend alongside
Groq and OpenAI Whisper. Xiaomi ASR uses the /v1/chat/completions endpoint
with base64-encoded audio input, rather than the standard Whisper multipart
upload format.

Co-Authored-By:连 <lian@tangping.homes>
2026-06-09 04:29:09 +08:00
Ilia Breitburg
0eb3010e40 feat(transcription): configurable STT model + OpenRouter provider
Add a `transcriptionModel` channel setting and an OpenRouter transcription
backend so voice messages can be transcribed through OpenRouter's
speech-to-text endpoint (e.g. nvidia/parakeet-tdt-0.6b-v3, openai/whisper-1),
alongside the existing Groq/OpenAI Whisper providers.

- schema: add channels.transcriptionModel (None = provider default)
- providers/transcription: extract a shared POST/retry skeleton; add a
  JSON+base64 OpenRouterTranscriptionProvider; make the STT model a
  constructor param on all providers instead of hardcoding it
- channels: route transcriptionProvider="openrouter" and thread the model
  through the manager to each channel
- docs + tests

Only dedicated STT models work on OpenRouter's transcription endpoint;
chat LLMs (e.g. google/gemini-3.5-flash) are rejected there.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 04:01:37 +08:00
Xubin Ren
9c81280300
feat(transcription): add shared voice input support (#4232)
* feat(webui): add voice transcription input

* feat(webui): render ANSI output in code blocks

* refactor(webui): isolate voice recorder logic

* refactor(transcription): keep websocket ingress thin

* refactor(transcription): resolve channel audio settings on demand

* style(webui): neutralize voice waveform color

* feat(webui): add voice input tooltip

* feat(webui): add voice input keyboard shortcut

* fix(webui): distinguish voice shortcut platforms

* fix(webui): place voice button after model selector

* refactor(webui): share voice hold recording helpers

* fix(desktop): allow microphone voice input

* fix(webui): stabilize token usage month labels

* feat(webui): show voice input on settings overview

* fix(webui): label voice capability as recognition

* fix(webui): align capability overview status

* refactor(webui): isolate transcription socket handling

* fix(webui): soften silent voice waveform

* refactor(audio): clarify transcription service location

* docs(transcription): clarify audio and provider boundaries

* fix(exec): reduce session output polling flake
2026-06-09 01:08:49 +08:00
04cb
ef2ef4f789 fix(transcription): normalize chat-style apiBase to audio endpoint (#3637) 2026-05-23 17:32:59 +08:00
chengyongru
3437ff273f fix(transcription): address review nits on PR #3253
- Correct api_key type hint to str | None in _post_transcription_with_retry
- Remove unreachable final return ""
- Fix test_openai_missing_api_key_short_circuits to actually test
  missing-key path (use audio_file fixture so file exists)
- Fix PermissionError patch for Windows (patch class method instead
  of instance attribute)
2026-05-06 15:52:29 +08:00
mohamed-elkholy95
7ebf611be8 fix(transcription): retry Whisper calls and guard malformed responses
A single transient failure between the agent and an OpenAI/Groq Whisper
endpoint currently vanishes as `return ""` in transcribe(). The voice
message arrives as the empty string and there is no way to tell real
silence apart from a failed upload. A malformed but successful response
body is even worse: the JSON-decode error escapes the helper unhandled.

Add a shared `_post_transcription_with_retry` used by both providers.

Retry behaviour:
  - exponential backoff 1s -> 2s -> 4s, up to 3 retries (4 attempts)
  - retryable HTTP statuses: 408, 429, 500, 502, 503, 504
  - retryable exceptions: TimeoutException, ConnectError, ReadError,
    WriteError, RemoteProtocolError

Non-transient failures short-circuit to "" on the first attempt --
retrying a misconfigured key or a broken upload only burns rate-limit
quota. Branches that short-circuit:
  - missing API key, missing audio file
  - file-read errors (PermissionError, OSError) on the audio path,
    preserving the nightly contract for direct provider callers
  - HTTP auth/4xx body issues via raise_for_status()
  - response.json() parse failures
  - non-dict JSON payloads

Sharing one helper means OpenAI and Groq cannot drift apart silently.

Thread `language` through the helper. The multipart files dict is rebuilt
inside the per-attempt loop, so when a caller sets self.language the
`language` field is sent on every attempt -- not just the first.

Tests cover:
  - every advertised retryable status and exception, parameterized
  - language present on attempts 1 and 2 of a 503->200 sequence
  - language absent when unset; present when set (both providers)
  - malformed JSON body and non-dict JSON body short-circuit to ""
  - PermissionError on file read short-circuits with no HTTP attempt
  - max-attempts give-up, exponential-backoff schedule, auth no-retry,
    missing-key / missing-file short-circuit

Test stub fix: the _StubResponse in tests/channels/test_channel_plugins.py
declared no status_code, which the new helper reads for retry classification.
Set status_code = 200 so the stub advertises the successful response that
those tests already simulate. Also moved the two transcription-provider
imports to the top of that file (previously placed mid-file) so the file
is ruff-clean (E402).
2026-05-06 15:52:25 +08:00