4 Commits

Author SHA1 Message Date
Xubin Ren
c937c07178 fix: two bugs in document extraction pipeline
Bug 1: _drain_pending did not call extract_documents on follow-up
messages arriving mid-turn. Documents attached to queued messages were
silently dropped because _build_user_content only handles images.
Fix: call extract_documents before _build_user_content in _drain_pending.

Bug 2: extract_documents read the entire file into memory (up to 50 MB)
just to check 16 bytes of magic header for MIME detection.
Fix: read only the first 16 bytes via open()+read(16) instead of
Path.read_bytes().

Added regression tests for both bugs.

Made-with: Cursor
2026-04-14 13:15:04 +00:00
Xubin Ren
47f5795708 refactor: move document extraction from ContextBuilder to API layer
ContextBuilder._build_user_content now only handles images (its original
responsibility).  Document text extraction (PDF, DOCX, XLSX, PPTX) is
performed by the new _extract_documents() helper in server.py, called
before process_direct().  This keeps the core context builder free of
format-specific dependencies and makes the API boundary the single place
where uploaded files are pre-processed.

Tests updated to reflect the new responsibility boundary.

Made-with: Cursor
2026-04-14 13:00:59 +00:00
Xubin Ren
2502fc616b Merge origin/main into feat/api-file-upload
Keep the API file upload branch current with main, enforce the documented JSON base64 per-file limit, and avoid leaking document extraction error strings into user prompts.

Made-with: Cursor
2026-04-14 12:29:43 +00:00
dengjingren
a068df5a79 feat(api): support file uploads via JSON base64 and multipart/form-data 2026-04-08 15:58:52 +08:00