| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zerokernel 3115 days ago

> Byte-by-byte parsing is a valid way to do parsing but not the only way. Byte-by-byte parsers tend to be slow and -- arguably, more importantly -- overly complex and rigid. It is, for example, usually very hard to do "random access" with a byte-by-byte parser, because allowing out-of-order parsing tends to blow the code complexity through the roof.

I have to agree here by experiences past. If the format in question has a chance of being performance sensitive, don't use FSM-based encodings [1]. It is inordinately difficult to optimize parsing these encodings even if you only have to handle tiny subsets, and it still won't be fast. A format like msgpack which prides itself on being very fast may be fast compared to JSON and other ways to express essentially arbitrary structures, but is DEAD SLOW compared to any direct encoding (be it a dedicated encoding you developed in literally a few hours or something like capnproto).

[1] Obviously, considering an encoding more complex than FSM means that you're an idiot and your application will almost certainly have security vulnerabilities related to the format in the future.

1 comments

bitwize 3115 days ago

kentonv introduced the term 'parsing' into the discussion, not me. Originally I wasn't talking about parsing as such, just being explicit about the byte-offset, length, and ordering of any piece of data you fetch or store by doing (ptr[n] << 24) | (ptr[n+1] << 16) | (ptr[n+2] << 8) | ptr[n+3], or the corresponding write operation, if you're working with a chunk of data that came from, or is destined for, a file or the network. And if for whatever reason you want or need to work with structs, don't try to alias them onto the disk or network-bound bits. FSMs don't even come into it. It's just a matter of being a little more careful than mmap()ing into a C struct and hoping for the best.