Hacker News new | ask | show | jobs
by _ar7 3293 days ago
Could you elaborate on the zero allocation part? I made a CSV parser based on yours [0] before the 1.0.0-beta rewrite, and at the very least I have to allocate as much memory to hold a single field in the CSV (and then reuse that memory).

[0]: https://github.com/AriaFallah/csv-parser

2 comments

It's an iterator, so the parser will parse fields from the already-allocated raw CSV string and yield them to you as it reads them as slices borrowed from the string. It's up to you to store them in whatever structured format you want.
The `csv` crate uses allocation, but `csv-core` does not. I think it would help if you looked at the `read_field` API in `csv-core`: https://docs.rs/csv-core/0.1.2/csv_core/struct.Reader.html#m...

The key is that the parser can start and stop at any point in time. So if you give it a fixed size stack allocation and it isn't big enough to hold the data, then `read_field` will return `OutputFull`.[1] At that point, it's up to the caller to figure out what to do. If caller literally cannot dynamically allocate memory at all, then the caller might choose to return an error. Otherwise, the caller might choose to allocate more space and then ask the CSV parser to pick back up where it left off.

With that said, while the `read_field` API is relatively simple, it's actually a bit slower than one would hope because it has to return to the caller every time a single field is read. So there's some overhead there. For that reason, there is also a `read_record` API, which is more complicated to use but is faster and preserves the zero-allocation property. It's faster because it reads an entire record at once.

The higher level `csv` library uses the read_record API[3], and it will expand its allocation when necessary.

The allocation bit isn't terribly important for performance, so long as you can amortize it (which is covered my blog post towards the end). The real trick to making it fast is the DFA. The DFA provides a way to do a very small constant amount of work per byte, which means that the configurability of the CSV parser has zero overhead because it's baked into the DFA. Compare that with the traditional NFA parser[4], and you can see how each state might need to step through multiple conditional checks.

[1] - https://docs.rs/csv-core/0.1.2/csv_core/enum.ReadFieldResult...

[2] - https://docs.rs/csv-core/0.1.2/csv_core/struct.Reader.html#m...

[3] - https://github.com/BurntSushi/rust-csv/blob/34e957e63bb92853...

[4] - https://github.com/BurntSushi/rust-csv/blob/master/csv-core/...