Hacker News new | ask | show | jobs
by ryncewynd 818 days ago
Are there unicode characters specifically for delimiters?

If Excel had a standardised "Save as USV" option it would solve so many issues for me.

I get so many broken CSVs from third-parties

6 comments

ASCII has characters for unit, record, group and file separator. And a some days ago there was a story here about using the unicode printable representation of these for an editor friendly format.

https://news.ycombinator.com/item?id=39679378

There are characters from ASCII for delimiting records, which are underused because they cause confusion about whether they should be represented as a state change like a backspace character, or as a glyph. See also: "nobody can agree on the line ending character sequence".

The USV proposal uses additional codepoints introduced in Unicode for the representation of the record delimiters, so they will always look and edit like character glyphs, and nobody is using them for some other purpose. The standardized look of these glyphs is unappealing, and they aren't easy to type, but it's fixable with a font and some editing functions.

Most of the issue hinges on Excel support.

> nobody is using them for some other purpose.

There's a lot of tooling which uses them for their intended purpose, which is to represent the C0 control characters in a font, so they can be printed when they appear in a document. Your editor is probably one of those.

Which is why I consider USV a terrible idea. If I see ␇ in a file, I don't want to be constantly wondering if it's "mention ␇" or "use ␇ to represent \x07". That's why the control pictures block exists: to provide pictographs of invisible control characters. Not to make a cutesy hack "look! it's a picture of a control character pretending to be what it isn't!!" format.

I agree about USV, it creates confusion where none needs to exist. For personal use, though, it is not that bad to receive a USV: it should be postmarked ".usv" and in any case if you suspect shenanigans you can `grep` for the offending (literally!) unicode characters and `tr` them into proper ASCII separators. Now, if there is nesting in the USV, I give up.

I share the lament: the whole table issue was solved before it became a problem. POSIX divides ASCII into portable and non-portable characters; only portable characters are allowed in the fields and separators are non-portable. If you need nesting, use a portable encoding of the inner table. This scheme repeats indefinitely without escaping hell or exceptions, preventing tons of errors and headache.

Visibility is such a bizarre complaint. Text editors already handle control characters: they handle tabs, they handle newlines, it is not a tremendous, earth-shattering feature request to make them handle separators gracefully.

I don't underatand why this is a question up for debate. You need eye tracking, so that there is a beep when you read the relevant part.
Hell, there's ASCII characters specifically for delimiters. 0x1C to 0x1F are respectively defined as file, group, record, and unit separators. Unicode naturally inherits them all.
Except nobody uses them. Another previous discussion: https://news.ycombinator.com/item?id=33935140
My significantly bigger beef would be all of the auto-formatting Excel does to mangle data. Excel loves to turn entries into dates.

Human genes had to be renamed so as to avoid this Excel features.

excel now has a prompt so you can tell it to not convert stuff automatically
Gasp. Big news. I do not recall ever seeing this, so I wonder if $JOB is running some hilariously outdated version for compatibility with a load bearing VBA script.
Automatic Data Conversion toggle was only added in the past ~year: https://insider.microsoft365.com/en-us/blog/control-data-con...
Yes, this recent discussion has lots of good info and links in the comments: https://news.ycombinator.com/item?id=39679378
> Are there unicode characters specifically for delimiters?

We could use the HL7 pipe ‘|’ and all enjoy that hell.

God, please no.

For those unfamiliar with the atrocity that is HL7v2, the format is essentially CSV, but with the record separator set to a lone CR, and the field separator usually set to |. Usually, because the format lets the consumer of the format redefine it, for whatever reasons. (The first use of the field separator it determines whatever character it will be. Thankfully, the first use is in a fixed spot, so it's determinable, but still. Oh, but we don't know the character encoding until like the 18th field in … and it doesn't necessarily have to be an ASCII superset. So I have no idea what an HL7v2 message in a non-ASCII superset even looks like, or how a parser is even supposed to reasonably parse such a thing. I presume attempt a decoding in all possible decodings, and then see which one matches the embedded character set, and pray nobody can create a polyglot?)

There's also further separators, delimiting within a field.

It also has its own escape sequences, to deal with the above.

… and it is what carries an unfortunate amount of medical data, and is generally how providers interoperate, despite the existence of more civilized standards like FHIR

I've unfortunately had to bless my brain with much more of this standard this week, for some reason.

Did I mention that subcomponents (if you look at it like a CSV, cells are further subdivided into components & subcomponents, so subcomponents are sort of where we hit "cell text", if you want to keep going with that broken analogy) — contain escape sequences, so that you can have things like the field separator. Normal stuff, so far. The escape sequences also include highlighting, binary blobs, and a subset of roff.

My worst nightmare was a semicolon-delimited file. Where one of the columns had hand-typed street names - without quotes.. so "WELLS" was often "WE;;S".

Since it was the only column like that, the # of columns to the left of the annoying column and the # on the right would always stay the same. So it was pretty easy to clean.

It’s been years since I last worked with HL7. Isn’t there also ^ and ~ to deal with?

Hell indeed.

Isn't that what HL7 stands for? Hell Layer 7 as in the seventh circle of hell.
Yes, there is. Multi-dimensional CSV?
There is the white space bs too. Sometimes it matters, sometimes it doesn’t. What type of white space is it?

Seriously rough.