Hacker News new | ask | show | jobs
by rtpg 4265 days ago
the thing is if you build for unicode support from the start these conversations don't need to be had. The problem is not enough people treat text as a black box from the start (I can understand unwillingness to support bigger things like RTL)
2 comments

Okay... let's go into this... how are the strings in excel encoded anyway?

I'd be willing to bet money that at least some of the formats in question aren't UTF-8, they are likely ASCII encoded against a character set or code page.

Then you have to read that codepage, and convert the necessary characters to their Unicode equivalents, and from there do you downcode to utf-8?

Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

Who's going to go through the different document versions to confirm, and adjust for the various encodings for non-ascii characters?

It's not as simple as saying "don't choke on unicode".

> how are the strings in excel encoded anyway?

Length-prefixed byte arrays encoded using various code pages. There are a small number that excel uses: https://github.com/SheetJS/js-codepage/blob/master/excel.csv (the columns are CP#, mapping, single/double-byte)

> Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

If we can put together an Apache2-licensed module in JS in an afternoon (https://github.com/SheetJS/js-codepage) it can be done in python.

> Who's going to go through the different document versions to confirm, and adjust for the various encodings for non-ascii characters?

Someone already did that: https://github.com/SheetJS/test_files/tree/master/biff5 has artifacts for every language type

> If we can put together an Apache2-licensed module in JS in an afternoon (https://github.com/SheetJS/js-codepage) it can be done in python.

I thought Python 2 was Unicode-unfriendly. So not as easy as JS.

> Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

It's written in Python, which comes with support for pretty much every major encoding¹ out of the box, so yes.

¹: https://hg.python.org/cpython/file/cb94764bf8be/Lib/encoding...

There just isn't a lot of pervasive experience in the development community for multi-language unicode devlopment. Also xlrd is fairly old, although I don't know if that tool is part of what limits this to english.

In ten years it might be better.

Joel Spolsky said that ten years ago. The problem is that devs are afraid to learn unicode. They treat it like learning a foreign language. It's not even a fun problem, like learning a new programming language, so nobody makes time for it. The only people who learn it are those who make it a point of pride to implement something correctly and handle corner cases.

Unicode isn't even hard: Use UTF-8. Don't try to measure the length of a string unless you're rendering that string and measuring the length in screen units like pixels. If you do those two things, that's 90% of the effort of making Unicode-safe software.

I think both views are valid. Those who don't know how to write Unicode-safe software shouldn't feel shamed into learning Unicode before releasing open source work. Those who already know Unicode should feel happy that they're making other people's lives easier.

But these are file formats that may well not be encoded in UTF-8.. the formats already exist.. it isn't like he's creating a new spreadsheet format here. Some of them may well be encoded to something that works fine against unicode/utf-8, others not so much.
So you write FooToUTF8() and UTF8ToFoo(), where Foo is whatever the encoding is in the external format. Done.

As far as I know, UTF-8 will work 100% of the time, and is almost always the best internal representation for software you write due to how simple and uniform it is. If something is encoded in some other format, you can probably find a conversion function online.

Okay, so why don't you fork the project, and create your simple Foo/UTF8 methods, and confirm that they are the correct Foo/UTF8 methods for each of the document formats supported.

I'm not saying that it's really all that hard, but there are multiple document formats, and versions of those formats. The author obviously didn't need unicode support, so didn't test for it. I'm sure test cases, and a pull request would be welcome.