PostgreSQL index corruption silently broke the matrix.org homeserver. State groups were corrupted, active data was deleted, and restoring consistency took a week of forensic debugging and reindexing. The root cause? Unclear. Hardware, maybe. But not Postgres or Synapse. The team’s fix involved disabling cleanup jobs, restoring from backup, rebuilding a 4TB index, and surgically re-inserting lost rows. This was a full-blown SRE horror story with just enough logs and luck to recover.
Key takeaway: Even mature, boring tech like Postgres can fail catastrophically — you need operational headroom, paranoia, and a backup that actually works.