Hacker News new | ask | show | jobs
by robochat 1769 days ago
I know that there's an xkcd comic about this but I think that the best solution would be to create a new text based format with a new extension ie. "bsv" - better separated values. This new format would either be much stricter than csv or have a compulsory header that defines the file's format (like the text encoding, separator character etc).

If it is stricter, it would have one type of field separator that is not commas since some locales use them as decimal places (I'm looking at you, France) but something like '|'. It would insist that dates were iso8601. It could define how fields can be escaped and quoted - although I would prefer if quoting was kept to a minimum. The format should also allow for comments i.e # so that people can comment their datasets inside the same file.

Alternatively or in addition, it could have some header lines:

1) A header that defines the encoding, separator, decimal separator, quote character, escape character, line ending character, date format ...

2) A header that defines each column's name

3) A header that defines each column's data type and formatting

4) A header that defines each column's unit like m/s or kg - ok, this is a bit of a stretch but it would be great to have.

or some variation of the above.

Fundamentally, this bsv format would still be csv and most programs would still be able to read it with the parsers that already exist or be quickly adapted to read it. It could still be easily edited by hand but the metadata would be present.

I suspect that this is just a pipe dream because people would find hundreds of ways to break it but toml took off and that didn't exist so long ago.

1 comments

I've played around with this (https://jtree.treenotation.org/designer/#standard%20iris)

I don't think you can make a breakthrough through syntax alone. I think you've got to integrate some type of live semantic schema, something like Schema.org. If I used "bsv" and didn't just get a slightly better parsing experience but also got data augmentation for free, or suggested data transformations/visualizations, et cetera, then I could see a community building.

I think perhaps a GPT-N will be able to write it's own Schema.org thing, using all the world's content, and then a BSV format could come out of that.