| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by robertelder 3556 days ago
	One of the moments where I really started to feel like I was starting to 'see the matrix' was when I was working on a regex engine to try to make my compiler faster (it didn't, but that's another story). The asymptotically fast way to approach regex processing actually involves writing a parser to process the regex, so in order to write a fast compiler, you need to write another fast compiler to process the regexes that will process the actual programs that you write. But, if your regexes get complex, you should really write a parser to parse the regexes that will parse the actual program. This is where you realize that it's parsers all the way down. When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.

5 comments

wyager 3556 days ago

I mean, the whole point of regexes (not to be confused with PCREs) is that any given regex is isomorphic to some canonical finite state machine. It is, specifically speaking, a tiny description of an FSM over the alphabet of ASCII characters (or whatever charset you're using).

Interestingly, regexes/FSMs are (IIRC) the most powerful class of machines for which equivalence is decidable. So if you give me any two regexes, I can tell you if they match on all the same strings, but this is not true for any more powerful grammar.

halter73 3556 days ago

Isn't it possible to decide the equivalence of deterministic pushdown automata? Wouldn't DPDAs be considered more powerful than FSMs due to the addition of a stack?

Wikipedia [1] shows there's a paper called "The equivalence problem for deterministic pushdown automata is decidable" that won the Gödel Prize is 2002. I haven't read the paper nor do I currently have access to it though.

[1] https://en.wikipedia.org/wiki/Deterministic_pushdown_automat...

kefka 3556 days ago

Here you go!

http://link.springer.com.sci-hub.cc/chapter/10.1007/3-540-63...

It was initially published in 1997.

wyager 3556 days ago

Huh! Well there you go.

adrianratnapala 3556 days ago

Presumably that means "this is not true for any more powerful class of grammars".

Any particular pair grammars from that class. I mean if you showed me a yacc grammar for Pascal and another one for Java, I am sure I could prove they were ineqivelent.

More tricky, if you gave someone clever a hand written Pascal parser and that same yacc grammar she could probably prove equivilency with some effort. But there is no guarantee that this is always possible.

wyager 3556 days ago

Yes, that is what I meant.

> I mean if you showed me a yacc grammar for Pascal and another one for Java,

This is why I said "any", and not "all". In most practical cases, you can tell if two grammars are different, but it's easy to construct cases where you can't.

hvidgaard 3556 days ago

While it is true for Regular Languages (FSM) strict RegExs, it's hardly true for any contemporary implementation of RegExs. They have back references and other constructs that make them more powerful than a Regular Language, and thereby make it impossible to decide equivalence in the general case.

wyager 3556 days ago

Hence "not to be confused with PCREs", which most pseudo-regex implementations are based off of.

Hydraulix989 3556 days ago

I'll ask the "dumb question" then: What is a PCRE? How is it different than a "true" regex?

robewald 3556 days ago

PCREs are Perl Compatible Regular Expressions, they have backreferences and other constructs. Almost all languages that have Regular Expressions have similar constructs. "Computer Science" Regular Expressions are more limited and implement a "regular language" as defined in https://en.wikipedia.org/wiki/Chomsky_hierarchy. This gives some guarantees, which go out the window with PCREs.

hvidgaard 3556 days ago

Did you edit the comment? I cannot recall reading that at all, but I may have been in a hurry.

wyager 3556 days ago

Nope, didn't change it!

signa11 3556 days ago

> When you think more about regexes this way, you realize that a regex is just a tiny description of a virtual machine (or emulator) that can process the simplest of instructions (check for 'a', accept '0-9', etc.). Each step in the regex is just a piece of bytecode that can execute, and if you turn a regex on its side you can visualize it as just a simple assembly program.

which is _exactly_ what bpf does :)

posterboy 3556 days ago

A regular expression is by definition a regular grammar, the least mighty language in the chomsky hierarchy and realizable by finite state automaton.

You can get something similar to a virtual equivalent of a turing complete language, due to the fact that the tape of the turing machine is, contrary to the hypothetical theory, practically limit by recourses. CPUs are finite state automatons, yet we use them to solve problems of sufficiently low complexity in NP space.

I am not an expert, please correct me if I am wrong.

pmiller2 3556 days ago

Real computers are much closer to linear bounded automata than full Turing machines.

https://en.m.wikipedia.org/wiki/Linear_bounded_automaton

chrisseaton 3556 days ago

I don't get it - what do regexps have to do with compilers and how do they make compilers faster?

ender7 3556 days ago

The first step in compilation is lexing -- converting a character stream to a stream of semantic "tokens", where a token might be "a number literal" or "the 'while' keyword" or a single character token like "{". This process is usually done via regexs.

vidarh 3556 days ago

It is rare for lexers to use actual regex engines, in my experience, as it's impractical if you want to do proper error reporting unless you write your own regex engine in which case people tend to opt for hand-writing lexer rules.

It's not unheard of, and I have the impression it's getting more common, but still only a small proportion of the compilers I've seen over the years explicitly use regexps.

rocket69 3556 days ago

So, what parses the regex in the lex, flex, Jison, etc...

vidarh 3556 days ago

Most tools like that reduces any regexes to dfa's or similar, but most production compilers I've seen don't use tools like that because they're a pain to do proper error reporting for.

rocket69 3556 days ago

Yea, we're not talking about compilers, we're talking about lexers. Regex doesn't even make a little bit of sense in a compiler world. What lexer are you working with?

chrisseaton 3556 days ago

I didn't think of lexer rules as regular expressions but I suppose they are of course.

But I can't imagine a lexer would ever be the performance bottleneck in a compiler.

adrianratnapala 3556 days ago

According to textbooks, lexers are the bottleneck because they are the only part that has to literally look at every byte of the source (I am including hashing as part of the lexing process).

I am not sure if this is true any more. It probably depends on the language.

Even as CPUs grow faster, code bases grow bigger, so the number-of-bytes argument is still important. On the other hand, heavy optimisations and difficult languages like C++ will shift the bottlenecks to later stages.

readittwice 3556 days ago

I think I read that too, but I am not sure that this is actually true. Please correct me if I am wrong, according to the Rust developers code generation seems to be the bottleneck. That's why they are working on incremental compilation to improve compilation times. They detect changes in the input files by comparing the AST to the old version. So parsing is always necessary but later stages can be cached. That seems to confirm that parsing is not the bottleneck, at least not for Rust.

OTOH that probably also depends on your use case, JS for example needs to get parsed at every page load. Some JS-VMs only parse the source code lazily, at first the parser only detects function boundaries. Only if this function is really executed, the function gets parsed completely.

dan00 3556 days ago

> I think I read that too, but I am not sure that this is actually true. Please correct me if I am wrong, according to the Rust developers code generation seems to be the bottleneck.

You're both right. The problem of C++ is how '#include' works, that it just includes the content of all files and therefore there's more overhead on the lexing side.

Rust's "equivalent" of '#include' is 'use', which doesn't have this problem, because it doesn't concatenates files.

barrkel 3556 days ago

You can make codegen faster by doing less work and producing less sophisticated code. You can't not look at every byte. It's the limiting factor in fastest possible compilation, not what takes the most time in most production compilers.

Gibbon1 3556 days ago

I suspect that modern languages try and fix some of this by making the syntax unambiguous. Means you only need to tokenize each file exactly once. Compare with older languages where if you change a header/module/etc you need to reparse the whole shebang.

Possible with rust that moves a bunch of the grant work into the code generator. AKA where as in C/C++ by the time you're generating code all your type information is set in stone, possible in rust a bunch of stuff isn't resolved yet.

adrianratnapala 3556 days ago

Rust uses LLVM, a heavyweight backend with fearsomely thorough optimisation.

It might be that for non-optimised output, the parser becomes the bottleneck again. Or it might be that LLVM just imposes overhead even for the unoptimised case, that pays for itself in the optimised case.

titzer 3556 days ago

Depending on how source programs are split up into files, parsing can be easily parallelized (think one thread per file), while other compilation tasks are harder due to interdependencies. E.g. semantic analysis requires building type representations, global namespaces, etc. Code generation is (usually) parallelizable as well, but there are a couple very serial steps in the middle, too.

My experience is that parsers for source languages can reach into the 1-10mb/s range, and depending on how complex the IRs and transformations are after that, code generation is usually around 0.5mb-5mb/s. The stuff in the middle (dealing with IRs) is harder to measure in terms of bytes.

ab129c871xy 3556 days ago

Perhaps in an ancient one-pass C compiler. Modern compilers build many trees, usually one for each pass. Later stages do expensive operations on sets (even without optimization one has to do basic register allocation), so I'd say the lexing stage is entirely negligible.

jbangert 3556 days ago

C++ compilers (at least on most current code bases that don't need modules) still need to parse and lex headers though -- so actual executable code is pretty rare. Most compilations are also debug builds which don't optimize(but creating debug information can also be slow; the really slow step is usually linking)

signa11 3556 days ago

> But I can't imagine a lexer would ever be the performance bottleneck in a compiler.

one rather trivial way to observe the effects of avoiding building a source file is to use ccache. ccache avoids the recompilation (even if you do a make clean, or some such), and it is not uncommon to observe speed ups of factor of 5 or so.

however, once you have crossed that barrier, you hit the linking wall. which is where you would end up spending a large portion of time. gold (https://en.wikipedia.org/wiki/Gold_(linker)) optimizes that i.e. supports incremental linking, but unfortunately, i haven't had any experience in large code-bases where it is being used.

pertymcpert 3556 days ago

No chance, modern compilers spend most of their time in the optimization stages. Lexing is peanuts.

signa11 3556 days ago

Try it. You _might_ be surprised :)

kjax 3556 days ago

Lexer generators typically allow describing rules with regular expressions, but through Thompson's construction, subset construction, and DFA minimization, you can end up with a blazing fast/optimized DFA that processes your input at maximum speed.

paulddraper 3556 days ago

I think they're usually done with a specific FSM, rather than general regular expressions.

titzer 3556 days ago

Not all parsers do a lexing step. Such ones are called "scannerless parsers" and can be much, much faster than parsers with a separate lexing step.

duaneb 3556 days ago

A lot of optimizations on arbitrary byte code often look for patterns in byte streams (or in reified assembly) in similar ways to regexes.

teraflop 3556 days ago

For many years, an important stage of the GHC Haskell compiler consisted of a giant Perl script full of regexes.

http://code.haskell.org/ghc-scp/ghc/docs/comm/the-beast/mang...

Lanzaa 3556 days ago

If you spent much time looking at regexs you might enjoy this set of articles about improving regex engine performance:

https://swtch.com/~rsc/regexp/regexp1.html