Kernel Bug

Spent a whole night tracing a strange timer bug: Mace timer did not fire off at correct time on some machines. Specifically, on those machines, timer goes off one second earlier than requested. E.g., if you want it to fire off after 2 second, it actually fires off after 1 second. Strangely, it doesn’t happen to every machines.

Initially I thought it was a problem in Mace code, so I spent the night digging into the timer code. Finally, there’s one function to blame: pthread_cond_timedwait(). It does not behave correctly.

According this this StackOverflow post, this bug is triggered by a Linux kernel bug due to leap second. The problem will be gone after rebooting the machine, and indeed that problematic machine hasn’t been rebooted for more than a year. Well, this is so unexpected.

