80386 Memory Pipeline

The author, currently building an FPGA 386 core that successfully boots DOS and runs applications like Norton Commander and games like Doom at 75 MHz, takes a step back to analyze one of the 80386's critical performance subsystems: its memory pipeline. The post delves into how the 386 efficiently managed 32-bit protected mode virtual memory, despite its inherent complexity, achieving common-case address path completion in approximately 1.5 clocks. It explores the microarchitecture of the memory access pipeline, efficient address translation, microcode's role, and the RTL timing involved.

Microcode for Memory Accesses: The post starts by examining microcode patterns, specifically RD (read) and WR (write) commands alongside DLY (delay), which mark points where the microcode waits for memory operations. The core question is how the hardware makes these operations cheap enough for the entire machine to function efficiently, leading to a dedicated address path that usually adds only about 1.5 extra cycles.
Efficient Segmentation: Segmentation, mandatory in both protected and real modes, could be a performance bottleneck if implemented naively. The 80386 employs two key optimizations:
- Cached Segment State: When a selector is loaded into a segment register, the processor also loads the descriptor's base, limit, and attributes into an invisible "descriptor cache." This avoids repeated memory lookups for descriptors on every access. This design also unifies real-mode and protected-mode segmentation logic and enables the "unreal mode" trick by not updating the limit in real mode.
- Parallel Relocation and Limit Checking: To form a linear address, the segment base is added to the effective address, while the limit check is performed in parallel. This avoids serial dependencies, with further optimizations for effective address calculation (e.g., cheap shift for scaling, optimizing for two-term additions).
Early Start Optimization: This ingenious optimization allows the address path to begin processing the next instruction's memory access in the last cycle of the previous instruction, overlapping with its writeback. This effectively hides much of the 1.5-to-2-cycle address-generation latency, improving overall performance by about 9%. However, this complexity also introduced corner cases, such as the famous "POPAD bug."
Paging Fast Path: Paging is also optimized. On a Translation Lookaside Buffer (TLB) hit, translation remains fast within the overlapped memory pipeline. A dedicated hardware page walker efficiently handles TLB misses without large microcode routines.
Bus Interface and Caching: The 80386 uses a non-multiplexed bus, allowing for address pipelining where the address for the next cycle can be presented while the current bus cycle finishes. While system DRAM was often slower than this ideal, the 386 was designed with external caching in mind, exemplified by the Intel 82385 companion cache controller, which provided significant performance boosts (30-40%).
FPGA Implementation Considerations: The author discusses mapping these historical techniques to an FPGA 386 core. Challenges include adapting latch-based 386 designs to modern edge-triggered flip-flops, simplifying the 386's two clock phases to a single FPGA clock, and designing caches (L1 instruction and data caches) to achieve 1-cycle hit latency despite synchronous block RAMs.

The 80386's memory pipeline is presented as a sophisticated blend of latency-hiding techniques across microcode, segmentation, TLB, bus interface, prefetch, and caching. This robust design made virtual memory practical and solidified the 386's role as a foundation for serious PC operating systems.

80386 Memory Pipeline

The Lowdown