Hacker News new | ask | show | jobs
by lolptdr 4055 days ago
"Parsers work at the grammatical level, lexers work at the word level."

Is this correct to say?

5 comments

Yes. For a given definition of "word", anyway, but "word" is fiddly enough we can assume they mean something sensible.

A stricter definition would be "lexers work at the lexical level" for NLP (in which case it's nearly tautologically obvious), or "lexers work at the token level" for programming languages.

(Creds: lexing / wordbreaking is the topic of my masters research in linguistics. Tries are a good starting point for that, but it turns out that they are a special case of simple morphological analyzer, and you need more complicated ones to do a good job on natural languages.)

Yes, as long as you include punctuation in your definition of "word". I like to think of it like this:

A lexer takes a sequence of characters and produces a sequence of tokens. List in, list out. It just chunks runs of characters together.

A parser takes in a sequence of tokens and produces a tree. It produces a nested data structure from a flat input.

Right, a lexer is a mapping function. A parser is a fold.
A lexer is a fold as well - there are usually fewer elements in the output list than in the input list. A mapping function implies a 1:1 correspondence, because the recursion is abstracted out into map() and the map function can't accumulate any state during the computation. (At least in traditional FP technology; the Mapper in MapReduce isn't actually a true map function, it's more like an unfold.)
Ah true I am incorrect on the lexer as a map. Had I thought on that half a second more I would have saved myself some embarrassment. :)
Lexers work at the lexeme level, which are generally understood to be words, assuming that a word is a non-decomposable unit of language.
Similar to the atom / compound forms described by McCarthy.
I've always understood it to be this:

A lexer is for a regular language (in the formal sense), whereas a parser is for a context-sensitive (or context-free) language.

Or, for a less formal description. A lexer's transformation is one that can be expressed as a function of <<input symbol or end of file>, position in file, state> -> <<some finite number of output symbols (including zero)>, <state or done>>, where state is finite. A lexer can work without backtracking or arbitrary memory usage. Whereas a parser potentially requires either arbitrary (but finite) backtracking or arbitrary (but finite) memory usage.

The boundary gets a little fuzzy, though. For instance, constants, at least for any language that allows arbitrary-sized integers.

I guess only if you consider punctuation to be "words". A more accurate description would be "lexers work at the atom level", IOW breaking the input into atoms - the smallest non-breakable lexical units that are used in a grammar.