Hacker News new | ask | show | jobs
by tracker1 4270 days ago
Okay... let's go into this... how are the strings in excel encoded anyway?

I'd be willing to bet money that at least some of the formats in question aren't UTF-8, they are likely ASCII encoded against a character set or code page.

Then you have to read that codepage, and convert the necessary characters to their Unicode equivalents, and from there do you downcode to utf-8?

Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

Who's going to go through the different document versions to confirm, and adjust for the various encodings for non-ascii characters?

It's not as simple as saying "don't choke on unicode".

2 comments

> how are the strings in excel encoded anyway?

Length-prefixed byte arrays encoded using various code pages. There are a small number that excel uses: https://github.com/SheetJS/js-codepage/blob/master/excel.csv (the columns are CP#, mapping, single/double-byte)

> Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

If we can put together an Apache2-licensed module in JS in an afternoon (https://github.com/SheetJS/js-codepage) it can be done in python.

> Who's going to go through the different document versions to confirm, and adjust for the various encodings for non-ascii characters?

Someone already did that: https://github.com/SheetJS/test_files/tree/master/biff5 has artifacts for every language type

> If we can put together an Apache2-licensed module in JS in an afternoon (https://github.com/SheetJS/js-codepage) it can be done in python.

I thought Python 2 was Unicode-unfriendly. So not as easy as JS.

> Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

It's written in Python, which comes with support for pretty much every major encoding¹ out of the box, so yes.

¹: https://hg.python.org/cpython/file/cb94764bf8be/Lib/encoding...