HN
Today

RE#: how we built the fastest regex engine in F#

A new F# regex engine, RE#, boasts world-beating performance and introduces long-missing boolean operators like intersection and complement, all with linear-time guarantees. This technical deep dive explains how it leverages 'forgotten' 1964 automata theory and Brzozowski derivatives to achieve reliable, O(N) matching. Hacker News readers appreciate its innovative approach to solving common regex pitfalls, addressing ReDoS vulnerabilities, and providing a mathematically sound interpretation of regex semantics.

20
Score
4
Comments
#4
Highest Rank
10h
on Front Page
First Seen
Mar 4, 11:00 AM
Last Seen
Mar 4, 8:00 PM
Rank Over Time
4569111616202629

The Lowdown

The article introduces RE#, a novel regex engine built in F# that claims to be the fastest in the world across industry benchmarks. Beyond speed, RE# distinguishes itself by supporting boolean operators (union, intersection, complement) and context-aware lookarounds while maintaining a guaranteed linear-time complexity for searches. The author details the engineering journey and theoretical underpinnings that allowed them to develop an engine that not only outperforms existing solutions but also offers a more semantically sound and practical approach to regular expressions.

  • Historical Context & Problem Statement: Most modern regex engines derive from either Thompson's NFA construction (linear but limited) or backtracking (feature-rich but prone to exponential time ReDoS attacks and ordered | semantics). Crucially, neither commonly supports intersection (&) or complement (~).
  • Brzozowski Derivatives: RE#'s core innovation relies on Brzozowski derivatives, a concept from 1964 that was largely forgotten. This elegant mathematical tool allows the engine to compute the "remaining" regex after consuming a character, naturally extending to boolean operators without special machinery.
  • Benefits of Boolean Operators: These operators enable developers to compose complex regexes from smaller, more readable, and maintainable fragments, greatly simplifying tasks like validating structured data or searching for text with multiple properties (e.g., "contains 'a' AND does not contain 'b'").
  • Minterm Compression: To optimize performance, RE# employs minterm compression, which partitions character sets into equivalence classes. This drastically reduces the number of transitions a DFA needs to consider, especially vital for Unicode characters, leading to significant speed gains.
  • Beyond Finite Automata: RE# moves beyond the traditional "finite" state machine concept, allowing for an effectively infinite number of states that are still guaranteed to terminate. This approach, which encodes context information directly into states, is key to implementing linear-time lookarounds.
  • Leftmost-Longest Semantics: Unlike many backtracking engines, RE# adheres to POSIX's leftmost-longest matching semantics. This ensures that boolean algebra identities hold true for regexes, eliminating ambiguities caused by alternation order and providing a more predictable and mathematically consistent behavior.
  • Efficient All-Match Finding: The llmatch algorithm efficiently finds all leftmost-longest matches using two linear scans (right-to-left for potential starts, left-to-right for confirmation), reporting matches retroactively once all context is known, without back-and-forth movement.
  • F# Advantages: The choice of F# allowed seamless integration with .NET's high-performance infrastructure (e.g., SIMD, SearchValues<T>), while its functional features like algebraic data types and pattern matching facilitated clear expression of complex algorithms, even with low-level performance optimizations.

RE# represents a significant advancement by resurrecting powerful, forgotten theoretical concepts and combining them with modern engineering to create a regex engine that is fast, reliable, and semantically consistent. By supporting boolean operators and principled lookarounds with linear-time guarantees, it aims to solve long-standing problems in regex usage, offering a more intuitive and robust tool for developers. The engine is open-source and available as a NuGet package, with a web app for demonstration.