Math Is Hard
This deep dive chronicles a fascinating journey into a decades-old floating-point exception bug on the venerable VAX architecture, which caused OpenBSD processes to spin infinitely. It illuminates the intricate dance between hardware, operating systems, and signal handling, revealing how a subtle architectural change in early VAX models led to a protracted, obscure problem. This historical computing detective work is a prime example for anyone who loves a deep dive into low-level systems and the enduring challenges of complex software.
The Lowdown
Miod Vallat recounts a challenging and long-standing bug in the OpenBSD kernel running on VAX machines, highlighting a fundamental difference in how processor exceptions are handled between userland and kernel space. The problem, which manifested as an infinite loop when an application ignored SIGFPE (floating-point exception), ultimately traced back to a subtle, undocumented architectural change in the VAX processor itself.
- The Perl Problem: In 2002, OpenBSD developers encountered a build failure with Perl 5.8 on VAX, as its
miniperlcomponent would get stuck in an infinite loop. This occurred when the program ignored SIGFPE caused by a divide-by-zero, a common pattern for specific arithmetic handling. - VAX Exception Model: The VAX architecture distinguishes between "traps" (non-recoverable, PC points to next instruction) and "faults" (potentially recoverable, PC points to faulting instruction). SIGFPE could be either, depending on the specific arithmetic error.
- The Infinite Loop: If a floating-point exception was a "fault" (e.g., floating overflow, divide-by-zero) and SIGFPE was ignored by the program, the kernel would repeatedly re-execute the faulting instruction, leading to an endless loop because the Program Counter (PC) was not advanced.
- The Solution: The kernel needed to manually advance the program counter past the faulting instruction. This was complicated by VAX's variable-length instructions, requiring the kernel to disassemble the instruction to determine its length—a non-trivial task ultimately adapted from kernel debugger code.
- Historical Context: The surprising longevity of this bug, first reported in 2002 for an architecture from 1977, was attributed to a historical change in VAX hardware. Early VAX-11/780s treated these floating-point conditions as "traps" (PC advanced automatically). Later VAX models, and upgraded 11/780s, changed them to "faults," but this architectural shift and its implications for operating systems like BSD (which allowed ignoring SIGFPE in some cases) were not widely understood or documented outside of obscure notes in later VAX reference manuals.
- Refined Fixes: The initial fix involved the
skip_opcodefunction to advance the PC. A subsequent correction by another developer was needed years later for a subtle race condition, whereskip_opcodeneeded to be called beforetrapsignal()to avoid conflicting PC manipulations.
This deep dive into a niche hardware-software interaction perfectly illustrates how historical design choices and their downstream consequences can create perplexing, long-lived bugs in the deepest layers of operating systems, requiring significant detective work to uncover and resolve decades later.