Hacker News new | ask | show | jobs
Sometimes Java is just weird (vsadt.com)
12 points by jere_jones 5694 days ago
8 comments

I understand the need for localization and all, but 46 THOUSAND characters? Jeez.

There are that many Han characters alone, so I’m not sure what the surprise is. It’s not like you have to hard-code them in your grammar.

If anything, I’d hope that new languages in 2010 allow any of the roughly 100,000 non-control non-whitespace [edit: non-punctuation] Unicode characters. For a lot of the code I see, ASCII is at least as constraining as, say, fixnums would be.

Actually, it is possible it does. I only ran the test from 0x0000 to 0xFFFF. Maybe I should revisit.
moreover you actually already have trivial code for that in $JAVAISH_IMPLEMENTATIon's Character.isUnicodeIdentifier{start,part}
I used Character.isJavaIdentifierStart(int) and Character.isJavaIdentifierPart(int) to write a file with the ranges that I cut and pasted into my C# code. And thank goodness, too! I'm sure I would've made a typo or missed characters if I had to type all that myself.
"For now, I'm not going to support this in..."

Often, this is really just another way of saying, "I didn't think of this before, and I don't want to start thinking about it just now."

The floating point notation would be understandable for those who need to be intimately familiar with floats at the bit level. There are a few people who have to do this.

Perhaps I am nitpicking here - but it's "weird" not "wierd".

Also, I am not sure rewording the title added any value. The title of the article IMO was just fine - See http://ycombinator.com/newsguidelines.html

Thank you for the correction. I guess I was typing too fast. :-)
None of this seems particularly weird to me (except maybe the fact that the hexadecimals use binary exponents, but I can imagine why this would be way more useful than a decimal exponent)

Allowing unicode identifier names is a feature I've seen many people ask for, and doesn't seem like that big a deal. It must be frustrating using languages that don't support this feature to foreign speakers. Of course some characters can be in an identifier but not start it, this is true of most languages. You can't start variables with numbers in C.

I see no issue with allowing numbers in different bases. And of course the decimal would also be in that base. It would be weird if left of the decimal were in hex and right of the decimal were decimal.

"For now, I'm not going to support this in the parser as it appears to be a fairly dusty corner of the language. Maybe at some later date."

And that is why this particular parser won't pass the Java TCKs.

I have never heard of "Java TCKs" but after a little research, I understand them better. Just out of curiosity, what would passing the Java TCKs do for me or my users?
Certification will lead to the recognition of your product as an enterprise-ready technology, allowing interoperability and intercompatibilitization with all Oracle Java products and libraries. In today's complex and increasingly specialized economy, a promise of compatibility is not sufficient. The Oracle Java Compatibility Process™ provide you with the credentials you need to succeed in acquiring market share.
I'm curious, are you building your own parser from scratch? There are several existing libraries out there which can work with Java source code either at the bytecode or syntax-tree level.

Depending on your purpose, might be a whole lot easier to re-use an existing implementation and focus on whatever you're planning to use this parser for...

I'm using Irony as the parser. But I still have to translate the grammar into C# code.

I was unable to find an existing Java parser in C#. Do you know of one?

The only weird thing to me is this:

> # There are 46,908 different valid characters that you can use in an identifier

And yet among all those characters I can't have a question mark at the end of a boolean variable or function name...

Anyway, weirdness is no excuse for not supporting a feature in a parser!

That must be because of the ternary operator:

    condition ? result-if-true : result-if-false
although you're probably right that a smarter parser should be able to distinguish between both cases. But then the next guy comes along and complains that he can't use a colon as a valid character in his variable names...
Bear in mind that this parser's job is only for syntax highlighting and, later on, refactoring. Hmmm... maybe highlighting literals would be useful. I'll have to think on that.
Maybe you could borrow parsing code from Apache's Harmony? That's under a very liberal license.