|
|
|
|
|
by af3d
1249 days ago
|
|
Relying on undefined behaviour can't really be considered much of a solution. Any changes to one of those third-party libraries could possibly break your application without warning. I would suggest inserting a sanitization routine right there into the stack to parse and transform the data file accordingly. For the sake of posterity, emitting logs of every "questionable" entry along the way wouldn't be a bad idea either. |
|
The best way to do exactly what you're saying is just use R and do:
``` data.table::fread('my file.txt') |> arrow::write_parquet('new_file.parquet') ```
That will do the exact same thing-- sanitize the file, parsing and transforming the data correctly, logging questionable lines, and outputting a binary file that can be used by other systems later.
When you're working with thousands of files and hundreds of millions of lines every day and your client will be rightfully pissed if their data is off by $100,000 and my only resolution is to wait 2 weeks for someone in IT on their end upstream to _maybe_ fix the file, hopefully without introducing a new error...
Writing my own delimited file parser over a huge amount of community effort sounds like the worst case of not-invented-here syndrome ever. What stinks is how willing most of those projects are to fail silently.