hussein1362 75c2506c07 fix(cron): atomic write for jobs.json + don't silently overwrite corrupt store
Two related bugs that together caused scheduled jobs to disappear after
a container restart:

1. `_save_store()` used `Path.write_text(...)`, which truncates the
   destination in place.  A SIGKILL or shutdown mid-write left
   `jobs.json` either truncated or corrupt.

2. `_load_jobs()` caught any parse error, logged at WARNING, and
   returned an empty list.  `start()` then called `_save_store()`
   immediately, overwriting the corrupt-but-recoverable file with an
   empty job array.  Every scheduled job was silently lost with only a
   single warning line in the log.

Reproduction in production: container restart at 18:08, after which a
job that had fired correctly for two consecutive days never fired
again.  jobs.json on disk was missing the job entirely.

Fix:
- `_save_store()` now writes via temp file + `os.replace` + `fsync`
  (matches the session manager pattern from 512bf59,
  "fix(session): fsync sessions on graceful shutdown to prevent data
  loss").  An interrupted write cannot corrupt the live file.
- `_load_jobs()` now moves a corrupt store aside as
  `jobs.json.corrupt-<ts>` and returns `None` instead of `[]`.
- `start()` aborts with a `RuntimeError` when the on-disk store is
  corrupt, instead of starting empty and overwriting.
- `_load_store()` falls back to the previous in-memory snapshot when
  a hot reload encounters a corrupt file, so a transient corruption
  after start does not drop live jobs.

Tests cover the atomic-write path, the corrupt-file preservation,
the start-time refusal, the in-memory fallback, and a basic save/load
round trip across two service instances.  Existing 79 cron tests and
full suite (2553 tests) still pass.
2026-05-04 00:16:39 +08:00
..