mirror of
https://github.com/HKUDS/nanobot.git
synced 2026-05-06 17:55:59 +00:00
Two related bugs that together caused scheduled jobs to disappear after a container restart: 1. `_save_store()` used `Path.write_text(...)`, which truncates the destination in place. A SIGKILL or shutdown mid-write left `jobs.json` either truncated or corrupt. 2. `_load_jobs()` caught any parse error, logged at WARNING, and returned an empty list. `start()` then called `_save_store()` immediately, overwriting the corrupt-but-recoverable file with an empty job array. Every scheduled job was silently lost with only a single warning line in the log. Reproduction in production: container restart at 18:08, after which a job that had fired correctly for two consecutive days never fired again. jobs.json on disk was missing the job entirely. Fix: - `_save_store()` now writes via temp file + `os.replace` + `fsync` (matches the session manager pattern from 512bf59, "fix(session): fsync sessions on graceful shutdown to prevent data loss"). An interrupted write cannot corrupt the live file. - `_load_jobs()` now moves a corrupt store aside as `jobs.json.corrupt-<ts>` and returns `None` instead of `[]`. - `start()` aborts with a `RuntimeError` when the on-disk store is corrupt, instead of starting empty and overwriting. - `_load_store()` falls back to the previous in-memory snapshot when a hot reload encounters a corrupt file, so a transient corruption after start does not drop live jobs. Tests cover the atomic-write path, the corrupt-file preservation, the start-time refusal, the in-memory fallback, and a basic save/load round trip across two service instances. Existing 79 cron tests and full suite (2553 tests) still pass.