| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lolptdr 4055 days ago
	"Parsers work at the grammatical level, lexers work at the word level." Is this correct to say?

5 comments

gliese1337 4055 days ago

Yes. For a given definition of "word", anyway, but "word" is fiddly enough we can assume they mean something sensible.

A stricter definition would be "lexers work at the lexical level" for NLP (in which case it's nearly tautologically obvious), or "lexers work at the token level" for programming languages.

(Creds: lexing / wordbreaking is the topic of my masters research in linguistics. Tries are a good starting point for that, but it turns out that they are a special case of simple morphological analyzer, and you need more complicated ones to do a good job on natural languages.)

link

munificent 4055 days ago

Yes, as long as you include punctuation in your definition of "word". I like to think of it like this:

A lexer takes a sequence of characters and produces a sequence of tokens. List in, list out. It just chunks runs of characters together.

A parser takes in a sequence of tokens and produces a tree. It produces a nested data structure from a flat input.

link

ch 4055 days ago

Right, a lexer is a mapping function. A parser is a fold.

link

nostrademons 4055 days ago

A lexer is a fold as well - there are usually fewer elements in the output list than in the input list. A mapping function implies a 1:1 correspondence, because the recursion is abstracted out into map() and the map function can't accumulate any state during the computation. (At least in traditional FP technology; the Mapper in MapReduce isn't actually a true map function, it's more like an unfold.)

link

ch 4055 days ago

Ah true I am incorrect on the lexer as a map. Had I thought on that half a second more I would have saved myself some embarrassment. :)

link

jahewson 4055 days ago

Lexers work at the lexeme level, which are generally understood to be words, assuming that a word is a non-decomposable unit of language.

link

agumonkey 4055 days ago

Similar to the atom / compound forms described by McCarthy.

link

TheLoneWolfling 4055 days ago

I've always understood it to be this:

A lexer is for a regular language (in the formal sense), whereas a parser is for a context-sensitive (or context-free) language.

Or, for a less formal description. A lexer's transformation is one that can be expressed as a function of <<input symbol or end of file>, position in file, state> -> <<some finite number of output symbols (including zero)>, <state or done>>, where state is finite. A lexer can work without backtracking or arbitrary memory usage. Whereas a parser potentially requires either arbitrary (but finite) backtracking or arbitrary (but finite) memory usage.

The boundary gets a little fuzzy, though. For instance, constants, at least for any language that allows arbitrary-sized integers.

link

eliben 4055 days ago

I guess only if you consider punctuation to be "words". A more accurate description would be "lexers work at the atom level", IOW breaking the input into atoms - the smallest non-breakable lexical units that are used in a grammar.

link