Kaloyan Tenchov 8a2a5eecdd fix(signal): emit textStyle offsets in UTF-16 code units
Signal's BodyRange (via signal-cli's textStyle) interprets start/length as
UTF-16 code units, but the Phase-3 assembly used Python's len(), which counts
code points. A single non-BMP character (e.g. an emoji) earlier in a message
shifted every subsequent styled span left by one unit, dropping the last
letter of bold/italic words.

Track a running UTF-16 offset in the assembly loop and add regression tests
covering emojis, supplementary CJK, ZWJ sequences, and a multi-section
message that mirrors the reported failure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 22:57:49 +08:00
..