Hacker News new | ask | show | jobs
by dotancohen 3207 days ago
> So, don't decode to a string, and do all your character manipulation on the bytes.

WHAT?!? I suppose that you've only ever worked with Latin characters. Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

Yes, that is a Hebrew Monty Python quote. Now try it with a smiley somewhere in the string (HN filtered out my attempt to post the string with a smiley).

Is each application to maintain their own dictionary of code points? If the map is to be in a library, then why not have it in the language itself?

1 comments

I don't understand your complaints. You clearly have some task you have in mind that you wish to perform: why not tell me what it is?

> Please show a code example of changing European to African in this sentence in your language of choice, working on the bytes in any multi-byte encoding:

מהי מהירות האווירית של סנונית ארופאית ללא משא?‏

I don't see the string 'European' in that sentence, it seems to be solely comprised of Hebrew characters.

edit to attempt to answer your question:

    struct m {
        pos_t start;
        pos_t end;
    }

    int findsn(char* str, char* substr, match m) {
        next: for( int c_i = 0; c_i++; s[c_i] != '\0' ) {
            match.start = c_i;
            int s_i = 0;
            for( ; s_i++; substr[s_i] != '\0' ) {
                if( str[c_i] != substr[s_i] ) goto next;
            }
            match.end = c_i + s_i;
            return 1;
        }
        return 0;
    }

    char* replacesn(char* str, char* needle, char* rpl) {
        match m;
        if( findsn(str, needle, &m) ) {
            splicesn(str, m.start, m.end, rpl);
        }
        return str;
    }
splicesn should be obvious, and you normalise your strings before calling replacesn. This is just me crappily re-implementing a fraction of the wchar API without checking MSDN.

edit 2:

> Is each application to maintain their own dictionary of code points?

No, you use the system/standard library for composing/decomposing/normalising codepoints.

> If the map is to be in a library, then why not have it in the language itself?

Why not indeed? What a great idea.

You win on the string replace, that was a bad example. Try a regex replace! But I will also mention that seeing properly indented code with clear identifier names is refreshing where I work!

> Why not indeed? What a great idea.

It sounded to me that you were arguing that string manipulation functions do not need to be included in modern programming languages. You said: "don't decode to a string, and do all your character manipulation on the bytes"

OK, I see how what I said could mean that. What I meant was: if using the language's internal string representation gives poor performance/resource usage, better to avoid it and directly manipulate the undecoded bytes. Most languages allow you to control when loaded data is converted to strings; simply don't convert it, and uh reimplement stdlib functions to work with your preferred encoding.