|
|
|
|
|
by karbarcca
2074 days ago
|
|
The biggest difference in these benchmarks comes down to how multiple threads are leveraged; (disclaimer: primary CSV.jl author here). In CSV.jl, it was relatively straightforward to add multithreaded parsing support; we chunk up the file, find row starts, then spawn a threaded task to process each chunk. In data.table (fread), they have a constraint of R having a global String intern store; so it fundamentally restricts the multithreading capabilities by requiring strings to be interned sequentially. In pandas, there are similar nontrivialities jumping between the C++ source and python object world that adds a lot of complexity (and probably explains why no one has made the effort to do multithreaded parsing). It's interesting to note, however, that the apache arrow project (pyarrow package in Python), has a multithreaded csv parser integrated with the project. It provides similar performance to CSV.jl because it was built from the ground up to process chunks on multiple threads into "arrow tables". Unfortunately, it's tied very specifically to the arrow project, so pandas doesn't benefit from its work! It's one of my favorite features about Julia that CSV.jl automatically integrates with other "data table formats"; i.e. ODBC.jl, SQLite.jl, MySQL.jl, Arrow.jl, DataFrames.jl, etc. They can all leverage CSV.jl as "their csv parser" because it's seamlessly integrated via well-established "table" apis. |
|