Here’s what happened: Process A grabbed the lock from Redis, started processing a withdrawal, then Java decided it needed to run garbage collection. The entire process froze for 15 seconds while GC ran. Your lock had a 10-second TTL, so Redis expired it. Process B immediately grabbed the now-available lock and started its own withdrawal. Then Process A woke up from its GC-induced coma, completely unaware it lost the lock, and finished processing the withdrawal. Both processes just withdrew money from the same account.
This isn’t a theoretical edge case. In production systems running on large heaps (32GB+), stop-the-world GC pauses of 10-30 seconds happen regularly. Your process doesn’t crash, it doesn’t log an error, it just freezes. Network connections stay alive. When it wakes up, it continues exactly where it left off, blissfully unaware that the world moved on without it.
https://systemdr.substack.com/p/distributed-lock-failure-how-long
https://github.com/sysdr/sdir/tree/main/paxos
https://sdcourse.substack.com/p/hands-on-distributed-systems-with