I understand the need for localization and all, but 46 THOUSAND characters? Jeez.
There are that many Han characters alone, so I’m not sure what the surprise is. It’s not like you have to hard-code them in your grammar.
If anything, I’d hope that new languages in 2010 allow any of the roughly 100,000 non-control non-whitespace [edit: non-punctuation] Unicode characters. For a lot of the code I see, ASCII is at least as constraining as, say, fixnums would be.
I used Character.isJavaIdentifierStart(int) and Character.isJavaIdentifierPart(int) to write a file with the ranges that I cut and pasted into my C# code. And thank goodness, too! I'm sure I would've made a typo or missed characters if I had to type all that myself.
Often, this is really just another way of saying, "I didn't think of this before, and I don't want to start thinking about it just now."
The floating point notation would be understandable for those who need to be intimately familiar with floats at the bit level. There are a few people who have to do this.
None of this seems particularly weird to me (except maybe the fact that the hexadecimals use binary exponents, but I can imagine why this would be way more useful than a decimal exponent)
Allowing unicode identifier names is a feature I've seen many people ask for, and doesn't seem like that big a deal. It must be frustrating using languages that don't support this feature to foreign speakers. Of course some characters can be in an identifier but not start it, this is true of most languages. You can't start variables with numbers in C.
I see no issue with allowing numbers in different bases. And of course the decimal would also be in that base. It would be weird if left of the decimal were in hex and right of the decimal were decimal.
I have never heard of "Java TCKs" but after a little research, I understand them better. Just out of curiosity, what would passing the Java TCKs do for me or my users?
Certification will lead to the recognition of your product as an enterprise-ready technology, allowing interoperability and intercompatibilitization with all Oracle Java products and libraries. In today's complex and increasingly specialized economy, a promise of compatibility is not sufficient. The Oracle Java Compatibility Process™ provide you with the credentials you need to succeed in acquiring market share.
I'm curious, are you building your own parser from scratch? There are several existing libraries out there which can work with Java source code either at the bytecode or syntax-tree level.
Depending on your purpose, might be a whole lot easier to re-use an existing implementation and focus on whatever you're planning to use this parser for...
although you're probably right that a smarter parser should be able to distinguish between both cases. But then the next guy comes along and complains that he can't use a colon as a valid character in his variable names...
Bear in mind that this parser's job is only for syntax highlighting and, later on, refactoring. Hmmm... maybe highlighting literals would be useful. I'll have to think on that.
There are that many Han characters alone, so I’m not sure what the surprise is. It’s not like you have to hard-code them in your grammar.
If anything, I’d hope that new languages in 2010 allow any of the roughly 100,000 non-control non-whitespace [edit: non-punctuation] Unicode characters. For a lot of the code I see, ASCII is at least as constraining as, say, fixnums would be.