Hacker News new | ask | show | jobs
by cocok 677 days ago
I stopped at 1.1 Notation. Full of arbitrary-looking decisions on which characters can be used for what.

It's 2024, and we still don't have a string notation that doesn't use the same character for opening and closing delimiter. If I start parsing from an arbitrary offset in the code, I can't say whether a double quote I read is the beginning of a string, or the end of one. I have to either resort to heuristics, or parse from the beginning of the file (at least once; and then cache offsets known to be outside a string). Something like "() would be nice. Still the familiar double quote, but the grouping is defined in a grammatically-superior way.

Also, still no identifiers that can start with a digit. Most of the mainstream languages have such complex grammars, probably requiring hand-coded parsers, but I can't have a "52cards" identifier. Is this really that hard compared to everything else?

Now, I'm self-taught and all. Maybe I'm missing something and the professors are right.

4 comments

Using parens is still problematic because they have semantic meaning in most coding languages, meaning you'll still have to backtrack to the opening paren to decide whether the later paren is closing an expression or a string.

> Is this really that hard compared to everything else?

In the languages I've seen that don't allow numbers in identifiers, it's because doing so makes other expressions ambiguous.

E.g. in Python (et al), things that start with 0x are treated as a hex literal. 0x9 would be ambiguous because it could either be an identifier named 0x9 or a literal for 9 in hex.

It also makes integer literals ambiguous, because 54 would be both a valid identifier and a valid literal.

You could disambiguate that with more rules (identifiers can include numbers but can't start with 0x, identifiers must include at least one non-numeric character), but the gain for doing so is so low it feels a little Quixotic.

Forth almost has you covered on both points: `52cards`, `1+`, and `0` would all first be looked up in the dictionary, and only if they had not been defined would an integer conversion be attempted, and `s" ` is in principle distinct from the trailing `"`.

[unicode does have various quotation-mark pairs, but note which is the opener can be natural-language dependent: eg « french » vs »magazine german«]

There is a classic English-language solution in the so called ⁶⁶round quotes⁹⁹, like the accursed SmartQuotes MS Word feature inflicts. There is more than one pair, and they look visually indistinct in some situations so I used superscript numbers above, rather than “this”.

I like the «guillemot» (French) solution, and also the similar ‹chevrons›. Like all similar holy wars there are issues with familiarity and input methods etc. and I don't think it will be solved through mere elegance.

Honorable mention to the classic ``unix quotes'' that are still seen in typesetting software.

Elixir has a ~s"" syntax that you might like.

Still, you're making a big deal of something very minor IMHO.

> If I start parsing from an arbitrary offset in the code, ...

Why would you ever do that? What's the point?

There are many other examples, but in C and C++, if you don't start parsing at the beginning, you're definitely going to get many things wrong. What if you start parsing in the middle of an identifier? How can you possibly expect to get something useful from that?

> Why would you ever do that? What's the point?

Syntax highlighting code visible on the screen

> Syntax highlighting code visible on the screen

That... does not work in most programming languages. Especially so for showing indent levels correctly. Generally speaking, the language server (or whatever) is parsing the entire file.

> Why would you ever do that? What's the point?

Performance, mostly. If you have to re-parse part of the code many times a second, in a text editor. For [pseudo]structural editing or syntax highlighting, for example.

Cache the position of quotes/apostrophes, etc.

Optimizing for your edge case would require everyone writing and reading code to conform to this extra thing, which seems completely unnecessary. Machines are pretty fast.

https://www.youtube.com/watch?app=desktop&v=ZI198eFghJk Modernizing Compiler Design for Carbon Toolchain - Chandler Carruth - CppNow 2023