| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cbetti 2126 days ago
	In my view our collective interest in CSV as a medium for data distribution has resulted in far too much information loss, and consequently, time wasted on input sanitization, validity checking, and unresolvable conversations about the intent of data values like ”1.12345E+11” and "".

4 comments

paulgb 2126 days ago

Having spent way too much time wrangling vendor-provided CSVs, I 100% agree. I'd love for there to be a common, well-understood format for typed tabular data that supports multiple tables and enforces foreign keys between them. Ideally with a concept of "patching" to enable incremental updates.

Probably the closest thing I'm aware of is handing around a sqlite file, but I'm a little uneasy using a format that's meant to be a database as a transfer format. Dolt looks promising here too. Are there other ways?

Fiahil 2126 days ago

At work, we use Parquet (https://parquet.apache.org/) for almost everything related to a dataframe. We don't really care about performance gains (although, it's nice to have), but we really like to have a schema.

Note, we use mostly Python, some R, and a various range of ML or Optimisation tools, depending on the project.

xyzzy_plugh 2126 days ago

Parquet is kind of a royal pain in the ass compared to CSV/JSON/plaintext mostly because it uses a ton of Thrift encodings, resulting in mostly terrible/broken implementations anywhere outside of the Java/JVM ecosystem. If you're running Apache <Whatever> then sure, it'll probably be fine, but I'd recommend avoiding it if you start having to go down the rabbit hole of implementing support for things in your language du jour.

Fiahil 2126 days ago

The Rust and python impl are fine. But I get it, Parquet may not be perfect or optimal or whatever. It works as a simple, typed, columnar format.

We had to pick a single file format recommendation for sending 100GB+ tables on FTP servers or dropbox, scanning terabytes of useless stuff only to grap an key-value pair, and properly reading integer and UTF-8 columns. Turns out, Parquet is practical. Enough for users to start using it instead of CSV. It could be Avro, but it's just not as easy.

xyzzy_plugh 2126 days ago

> But I get it, Parquet may not be perfect or optimal or whatever.

I actually think Parquet is pretty great in practice, I just have some issues with the sheer volume of abstractions necessary to implement it. I just wish it was anything other than Thrift.

I would probably choose Parquet over anything else, though.

The_rationalist 2126 days ago

For exhaustivity sake, apache arrow could also do the work but Parquet seems more appropriate here https://stackoverflow.com/questions/56472727/difference-betw...

jeremyjh 2126 days ago

Does Parquet provide a way to define foreign keys between different tables in a single dataset?

xyzzy_plugh 2126 days ago

Parquet files don't reference anything outside of the file, usually. A group of parquet files in a folder is usually considered a table, where the schema is union of the schema of the files.

ForHackernews 2126 days ago

Avro is a strong contender for this use case: https://avro.apache.org/docs/current/

https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosi...

It's well-supported in Pandas and Kafka, has good schema support, and reasonably small compressed file sizes.

itroot 2126 days ago

https://www.sqlite.org/appfileformat.html - it is OK to use sqlite that way! =)

mildbyte 2126 days ago

The Splitgraph core code on GitHub [0], around which we've built the DDN, is all about managing "data images" which are basically snapshots of PostgreSQL schemata. You can build them with a format similar to Dockerfiles as well as do a "checkout" into a local instance of Splitgraph (which you can connect to with any PG client) -- this enables change tracking and delta compression too.

Behind the scenes, we store them as cstore_fdw [2] files which is a columnar storage format that helps with analytical queries.

[0] https://github.com/splitgraph/splitgraph/

[1] https://splitgraph.com/docs/concepts/images

[2] https://github.com/citusdata/cstore_fdw

pletnes 2126 days ago

Seconding parquet and sqlite, but hdf5 and netcdf4 certainly deserve a mention, especially for multidimensional, scientific, and generally large datasets.

ris 2126 days ago

See "CSV on the Web": https://www.w3.org/TR/tabular-data-primer/

rrrrrrrrrrrryan 2126 days ago

Both JSON and XML are self-documenting, and most modern databases support directly importing and exporting them. Though the tools to accomplish this could be better, these formats are far better suited as a "medium for data distribution" than CSV files are.

That said, a simple compressed .sql file of INSERT statements can often go a long way.

The only reason CSV is widely used is because normal people think of data as spreadsheets, and asking them to fire up a database and shred JSON data into it is ridiculous when they just want to whip up a line graph or answer a simple question (e.g. "What was value X on a this particular date?").

hodgesrm 2126 days ago

The thing that attracted me to SplitGraph from the very start is that they are proposing to make the PostgreSQL wire protocol and SQL dialect a general interface to remote data. The interface is not only well known but backed by permissively licensed, open source libraries. Plus there are hundreds of tools that already connect to PostgreSQL.

This idea makes such sense it's a little surprising nobody did it before.

johnthescott 2121 days ago

run-away queries become a problem when direct sql is exposed to public.

emersion 2126 days ago

CSV is just an example, data vendors can choose e.g. JSON if they want more a sane and well-defined format.

still_grokking 2126 days ago

Regarding "sane & well-defined":

http://seriot.ch/parsing_json.php

pletnes 2126 days ago

Json has many design flaws, e.g. support for large integers, floating point infinity and NaN, for instance.

shawnz 2126 days ago

It is not a design flaw to make a reasonable choice about what data types you support. Especially given the ones captured in JSON are overwhelmingly the most commonly used and necessary types.

In cases where that's not enough, you could roll your own types by putting the values in plain strings and it would still be strictly more expressive than CSV