|
|
|
|
|
by masklinn
3542 days ago
|
|
Good CSV parsers reach 200MB/s, a formatted datetime is under 40 bytes so assuming you only get dates your input is ~5 million dates per second, 19µs/date is ~50000 dates per second. The date parsing is three orders of magnitude behind the data source. Even allowing for other data e.g. a timestamp followed by some other data, at 19µs/datetime you can easily end up with that bottlenecking your entire pipeline if the datasource spews (which is common in contexts like HFT, aggregated logs and the like) |
|
+1
This is why a little ELT goes a long way.
>Good CSV parsers reach 200MB/s
By good (and open source) we're talking about libcsv, rust-csv, and rust quick-csv[1]. If you're doing your own custom parsing you can write your own numeric parsers to remove support for parsing nan, inf, -inf, etc and drop scientific notation which will claw back a lot of the time. If you also know the exact width of the date field then you can also shave plenty of time parsing datetimes. But at that point, maybe write data to disk as protobuf or msgpack or avro, or whatever.
[1] https://bitbucket.org/ewanhiggs/csv-game