Two related bugs that together caused scheduled jobs to disappear after
a container restart:
1. `_save_store()` used `Path.write_text(...)`, which truncates the
destination in place. A SIGKILL or shutdown mid-write left
`jobs.json` either truncated or corrupt.
2. `_load_jobs()` caught any parse error, logged at WARNING, and
returned an empty list. `start()` then called `_save_store()`
immediately, overwriting the corrupt-but-recoverable file with an
empty job array. Every scheduled job was silently lost with only a
single warning line in the log.
Reproduction in production: container restart at 18:08, after which a
job that had fired correctly for two consecutive days never fired
again. jobs.json on disk was missing the job entirely.
Fix:
- `_save_store()` now writes via temp file + `os.replace` + `fsync`
(matches the session manager pattern from 512bf59,
"fix(session): fsync sessions on graceful shutdown to prevent data
loss"). An interrupted write cannot corrupt the live file.
- `_load_jobs()` now moves a corrupt store aside as
`jobs.json.corrupt-<ts>` and returns `None` instead of `[]`.
- `start()` aborts with a `RuntimeError` when the on-disk store is
corrupt, instead of starting empty and overwriting.
- `_load_store()` falls back to the previous in-memory snapshot when
a hot reload encounters a corrupt file, so a transient corruption
after start does not drop live jobs.
Tests cover the atomic-write path, the corrupt-file preservation,
the start-time refusal, the in-memory fallback, and a basic save/load
round trip across two service instances. Existing 79 cron tests and
full suite (2553 tests) still pass.