| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by falsedan 3212 days ago

> Any benefit you get from using UTF-16 vanishes the moment you need to operate on it like a string, in other words.

So, don't decode to a string, and do all your character manipulation on the bytes.

> A better solution is to allow programmers to specify string encoding and default it to UTF-8.

Absolutely not: the internal representation of a string should be of no interest to a user of your language. The 'best' solution is to represent strings as a list of index lookups into a palette, and to update the palette as new graphemes are seen. This is similar to the approach Perl6 is using[0].

[0]: https://6guts.wordpress.com/2015/12/05/getting-closer-to-chr...

1 comments

dotancohen 3212 days ago

> So, don't decode to a string, and do all your character manipulation on the bytes.

WHAT?!? I suppose that you've only ever worked with Latin characters. Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

Yes, that is a Hebrew Monty Python quote. Now try it with a smiley somewhere in the string (HN filtered out my attempt to post the string with a smiley).

Is each application to maintain their own dictionary of code points? If the map is to be in a library, then why not have it in the language itself?

link

falsedan 3212 days ago

I don't understand your complaints. You clearly have some task you have in mind that you wish to perform: why not tell me what it is?

> Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

I don't see the string 'European' in that sentence, it seems to be solely comprised of Hebrew characters.

edit to attempt to answer your question:

    struct m {
        pos_t start;
        pos_t end;
    }

    int findsn(char* str, char* substr, match m) {
        next: for( int c_i = 0; c_i++; s[c_i] != '\0' ) {
            match.start = c_i;
            int s_i = 0;
            for( ; s_i++; substr[s_i] != '\0' ) {
                if( str[c_i] != substr[s_i] ) goto next;
            }
            match.end = c_i + s_i;
            return 1;
        }
        return 0;
    }

    char* replacesn(char* str, char* needle, char* rpl) {
        match m;
        if( findsn(str, needle, &m) ) {
            splicesn(str, m.start, m.end, rpl);
        }
        return str;
    }

splicesn should be obvious, and you normalise your strings before calling replacesn. This is just me crappily re-implementing a fraction of the wchar API without checking MSDN.

edit 2:

> Is each application to maintain their own dictionary of code points?

No, you use the system/standard library for composing/decomposing/normalising codepoints.

> If the map is to be in a library, then why not have it in the language itself?

Why not indeed? What a great idea.

link

dotancohen 3212 days ago

You win on the string replace, that was a bad example. Try a regex replace! But I will also mention that seeing properly indented code with clear identifier names is refreshing where I work!

> Why not indeed? What a great idea.

It sounded to me that you were arguing that string manipulation functions do not need to be included in modern programming languages. You said: "don't decode to a string, and do all your character manipulation on the bytes"

link

falsedan 3212 days ago

OK, I see how what I said could mean that. What I meant was: if using the language's internal string representation gives poor performance/resource usage, better to avoid it and directly manipulate the undecoded bytes. Most languages allow you to control when loaded data is converted to strings; simply don't convert it, and uh reimplement stdlib functions to work with your preferred encoding.

link