| HN Mirror

No, the indexer used a custom HTML parser that was ancient (IIRC, when I was there it wasn't even fully HTML5-compliant - it doesn't have to be, because if you're tokenizing pages for indexing the only really important parts are text, anchors, meta tags, formatting, and headers, and as long as you basically get those right it's not important that you have the exact parse tree that a browser would.) It was very tightly optimized for speed, eg. it was SAX-style, with callbacks instead of heap allocations; the data that it returned to the caller was indexes into the original source data rather than case-normalizing and copying buffers; and the liberties it took with the HTML spec were largely so that it could minimize branching and state during the parse.

Headless Chrome existed within the indexing system while I was there but it was turned on for only a very small section of the index for CPU reasons. Their public statements since have indicated it's used for everything now.

Gumbo grew out of a templating language that I was working on with a larger team within Search Infrastructure and Maps. AFAIU the templating language is still being used, though I think Gumbo isn't being used for it anymore (its syntax diverged from pure HTML5 at some point). It was really intended for tools like refactoring & static analysis: that's why it's slower than the indexing HTML parser; why it was done in C rather than C++ (easier binding to scripting languages); why the API is intended to be round-trippable, with access to the original text that data structures refer to; and why it put a premium on generating accurate parses over speed.