HN
Today

Math Is Hard

This deep dive chronicles a fascinating journey into a decades-old floating-point exception bug on the venerable VAX architecture, which caused OpenBSD processes to spin infinitely. It illuminates the intricate dance between hardware, operating systems, and signal handling, revealing how a subtle architectural change in early VAX models led to a protracted, obscure problem. This historical computing detective work is a prime example for anyone who loves a deep dive into low-level systems and the enduring challenges of complex software.

5
Score
0
Comments
#11
Highest Rank
14h
on Front Page
First Seen
Apr 25, 11:00 PM
Last Seen
Apr 26, 12:00 PM
Rank Over Time
1914121211141925242622242526

The Lowdown

Miod Vallat recounts a challenging and long-standing bug in the OpenBSD kernel running on VAX machines, highlighting a fundamental difference in how processor exceptions are handled between userland and kernel space. The problem, which manifested as an infinite loop when an application ignored SIGFPE (floating-point exception), ultimately traced back to a subtle, undocumented architectural change in the VAX processor itself.

  • The Perl Problem: In 2002, OpenBSD developers encountered a build failure with Perl 5.8 on VAX, as its miniperl component would get stuck in an infinite loop. This occurred when the program ignored SIGFPE caused by a divide-by-zero, a common pattern for specific arithmetic handling.
  • VAX Exception Model: The VAX architecture distinguishes between "traps" (non-recoverable, PC points to next instruction) and "faults" (potentially recoverable, PC points to faulting instruction). SIGFPE could be either, depending on the specific arithmetic error.
  • The Infinite Loop: If a floating-point exception was a "fault" (e.g., floating overflow, divide-by-zero) and SIGFPE was ignored by the program, the kernel would repeatedly re-execute the faulting instruction, leading to an endless loop because the Program Counter (PC) was not advanced.
  • The Solution: The kernel needed to manually advance the program counter past the faulting instruction. This was complicated by VAX's variable-length instructions, requiring the kernel to disassemble the instruction to determine its length—a non-trivial task ultimately adapted from kernel debugger code.
  • Historical Context: The surprising longevity of this bug, first reported in 2002 for an architecture from 1977, was attributed to a historical change in VAX hardware. Early VAX-11/780s treated these floating-point conditions as "traps" (PC advanced automatically). Later VAX models, and upgraded 11/780s, changed them to "faults," but this architectural shift and its implications for operating systems like BSD (which allowed ignoring SIGFPE in some cases) were not widely understood or documented outside of obscure notes in later VAX reference manuals.
  • Refined Fixes: The initial fix involved the skip_opcode function to advance the PC. A subsequent correction by another developer was needed years later for a subtle race condition, where skip_opcode needed to be called before trapsignal() to avoid conflicting PC manipulations.

This deep dive into a niche hardware-software interaction perfectly illustrates how historical design choices and their downstream consequences can create perplexing, long-lived bugs in the deepest layers of operating systems, requiring significant detective work to uncover and resolve decades later.