Hacker News new | ask | show | jobs
by n1ghtm4n 4723 days ago
Some thoughts after spending ~100 hours with Go.

- Function overloading is a major convenience that you will miss. There are differently named versions of every function and you will call the wrong version with the wrong arguments all the time. The number of functions in the standard library could be reduced by at least 1/4 if they'd got this right. The official FAQ (http://golang.org/doc/faq#overloading) explains that leaving out overloading is "simpler", meaning simpler for them.

- Default parameters are a major convenience that you will miss. Using strings.Replace() to remove some chars from a string? Don't forget to pass the -1 at the end, asshole! The -1 says don't put a limit on the number of replacements. In Python there would be a max=None default parameter and this would never bite anyone.

- No named arguments, because fuck readability.

- Forcing me to handle errors is great. Having 20 different ways to do it is not great. Examples: fmt.Errorf(), fmt.Fprintf(os.Stderr), errors.New(), log.Fatal(), log.Fatalf(), log.Fatalln(), panic/recover...

- Using && and || for logical operators in this day and age is just ridiculous. Why do people keep inventing programming languages as if Python doesn't exist?

- Don't think that just because the Unicode guys invented Go that Unicode is going to be easy. Their solution is not to create an airtight abstraction layer between chars (or "runes" WTF?) and integers. Their solution is to provide almost no abstraction and force you to deal with the inherent integer-ness of all characters. Example:

In Python:

    len("нєℓℓσ")  # 5, because there are 5 chars
In Go:

    len("нєℓℓσ") // 12, because there are 12 bytes
    utf8.RuneCountInString("нєℓℓσ") // 5, plz kill me i am an abomination
tl;dr If you're inventing a programming language for human beings (not UNIX gods), try it out on a group of smart high school students first. It will be a humbling experience.
5 comments

> between chars (or "runes" WTF?) and integers.

"Runes" were the original name, as implemented in Plan 9 by the same folks, for what the standards committee later decided to call the relatively blaze term "Unicode codepoints"--and which are not quite the same thing as characters.

(In fact, I would say that the notion of a Unicode "character" is ambiguous to the point of uselessness--there are glyphs composed from several codepoints (base glyph + combining accents), which should be treated as one "character"; there are ligatures that hold single codepoints, but which semantically are multiple "characters"; there are stacking languages where one "character", representing a whole word, will be composed together from several codepoint "radicals"; while in other ideographic languages, each pre-composed idea-part is its own "character" and has its own codepoint; and so forth.)

> -there are glyphs composed from several codepoints (base glyph + combining accents), which should be treated as one "character"

The solution is to use Normalization Form C (NFC) (which combines accents with characters).

> there are ligatures that hold single codepoints, but which semantically are multiple "characters"

OK, so use Normalization Form KC (NFKC) (which splits ligatures, and combines accents with characters).

You're right that "length" of a unicode string is very ambiguous. Arguably, you shouldn't be able to call "length" without supplying an argument about what you are actually asking.

> Using && and || for logical operators in this day and age is just ridiculous

No it's not, it takes 10 minutes to learn that && means and and || means or (maybe a little longer to get the hang of it properly), and this knowledge transfers to many programming languages.

(This is a little like arguing "we shouldn't use + when English has a perfectly good word 'add'"; symbol reasoning is valuable.)

Plus, && and || are symbols which makes them stand out from variable names. This is, in my opinion, a benefit. Admittedly, though, I prefer {} to block delimiters over begin/end and their like.
That's 10 minutes where I could be... you know, living my life, man. Everybody knows what + does. You don't have to teach a high schooler that "if A or B" means "if either A or B is true". They just get it. But what the heck does "if A || B {}" mean?
Actually, you do have to teach most high schoolers (actually most people) what "A or B" means. The common usage reads that as "A xor B"
A very good point. I still think it's a big win if it makes it 1% easier for new programmers to understand.
I actually think "and" and "or" make it 1% more difficult for new programmers because it doesn't implicitly warn them about things like short-circuit evaluation and whatnot. "and" and "or" have a lot of nuances that experienced programmers take for granted that a new programmer won't know until they're taught.
'+' is damn near universal. '&&' && '||' !universal.
They're not universal, but they're essentially so; pretty much everyone who's used C(++), Java, C#, ... has seen && and ||, and knows what it is. And people who've only used languages with 'and' and 'or' will only take a few minutes to get up to speed (they have the option of spending longer complaining about it if they want).
I don't think the argument is that people can't understand '&&' and '||', but that using 'and' and 'or' is a better choice.
So what do you do about the bitwise and and or operations. How should they be expressed?

Python expresses the bitwise and and or using the & and | characters, the exact same characters as Go.

And that also explains why Go chooses to use && and || for and the logical and or operators.

The code is much more readable if the bitwise operators do not look almost identical to the logical ones.
> people who've only used languages with 'and' and 'or' will only take a few minutes to get up to speed

No. Cognitive overhead. You pay for it every time you parse these words in your brain. You pay for it by reducing the number of nested/combined clauses that you can parse on the fly.

(This is far from the only readability issue with Go, by the way, and you're right in that it's among the more superficial ones. The language is designed so well in all ways except the one that matters the most, it hurts.)

No, laziness.

&& is pronounced "and" but actually means "shortcircuit left-to-right-evaluated and".

If you're coming from a Pascal (or non-programming) background, you do not assume either left-to-right evaluation order, nor short circuit evaluation.

The cognitive overhead is always there, because whether you like to admit it or not, programming is applied math, and exact meaning is very important;

e.g.:

    if a == 0.0 or b/a > 3 then launch_missile();
Without the "cognitive overhead of knowning guaranteed left-to-right + short circuit", this code is wrong.

The hypothetical "newbie programmer who can write a working program but has cognitive overhead deciphering &&" is a mythical creature that does not actually exist.

I agree with you, but FWIW, if you're already used to C, then the cognitive overhead of && and || is probably negligible. Given Go's pedigree, that doesn't seem too surprising.
Thank you for expressing it so eloquently.
they are, unfortunately you need to have paid attention at school when they taught you logic operators
> In Go:

    len("нєℓℓσ") // 12, because there are 12 bytes
    utf8.RuneCountInString("нєℓℓσ") // 5, plz kill me i am an abomination
I'm not sure I understand your objection. Bytes and UTF8 characters are different things, and you can't abstract away the difference. There are also times, perhaps the majority of times, when you will need the byte count of a UTF8 string. That means you need at least two different length functions for strings and they need different names.

Shouldn't UTF8-specific things live in the utf8 namespace? Some programs won't need any string handling, after all, and it would be a waste to include code they never used.

Assuming you can allow the utf8 namespace as sensible, would you feel better if there was a RuneLen() function aliased to RuneCountInString()?

If you are that upset about it, then my suggestion is to explain your rationale and submit a patch[1] to provide the alias. It's not like it would be hard to code. Perhaps you might convince people and get it in the next release.

[1] http://golang.org/doc/contribute.html

I think the majority of the time you want the number of utf8 characters and not the byte count. In fact I have never wanted the byte count. If I did I would expect something like byteLen and Len. Not the other way around. You should be optimizing the common case, not the exception. Obviously I'm not a language designer so perhaps I'm talking out of my ass but I've heard this complaint A LOT.
Have you never had to indicate how many bytes you are sending over a stream, say in the Content-Length of an HTTP response? Have you never put strings into a byte buffer?

But that doesn't matter. Let's say you are correct: when working with strings, you more often want the rune length. It still wouldn't be the right decision, given the other design decisions of Go, because it would have needlessly complicated things with only arguable benefits. Let me show you what I mean.

The len() function works with a whole lot of things: strings, arrays, slices, maps and channels. For the first three, len() returns the number of bytes involved. This is because all three are backed by an array, and so sensibly have similar semantics. It would have violated the principal of least surprise for anyone who knew the language to have an array-backed storage not return a byte count. Both the language developers and the users of it would have to special-case strings, in code and in their brains.

Now, they could have decided to do it anyway, but then another surprise awaits. What happens when you take a slice of a string? Oh no, more special casing and more complication for everyone.

The Go developers do special-case where doing so would clearly be a win for their users. Consider range, which iterates by runes over a string, potentially moving the index on the underlying array forward by more than 1 on each pass. That is clearly going to be the most common usecase the user is going to want and so was worth doing. It also eliminates many of the usecases where getting the length of a string in runes would matter to you. Not all, but a lot.

> What happens when you take a slice of a string?

What happens when you slice a unicode string in Go is that it cuts multi-byte characters right in half, unless you get the byte boundaries just right. I know real programmers keep the byte boundaries for all the chars in all their strings in their head at all times, but for people like me this basically makes string slicing unusable for non-ASCII text.

Python somehow magically slices unicode strings without chopping characters in half.

In Python:

    s = "нєℓℓσ"
    s[1:4]  # "єℓℓ"
In Go:

    s[1:4] // "�є"
>Python somehow magically slices unicode strings without chopping characters in half.

You need a byte offset to slice a string, and it's impossible to convert from a Unicode rune offset to a byte offset without parsing the entire string up until that point. I'm not all that familiar with Python, but if the language works as you implied, it is basically doing this behind the scenes in common string processing tasks:

  1. The user uses some kind of pattern matching function or whatever to find where they want to split the string. Python returns a rune index.
  2. The user tells Python to go split apart that string along a rune index. It promptly begins parsing the string all over again until it finds the right byte boundaries.
  3. The language then actually creates the new string in between the byte boundaries.
Sure, a Python implementation could statically optimize this, but... why should it have to in the first place? That's fucking stupid and should be considered a language bug when it could be doing this:

  1. User pattern matches blah blah blah and gets a byte index.
  2. User tells their sane language to split the string apart at the byte index and it just does so.
>I know real programmers keep the byte boundaries for all the chars in all their strings in their head at all times

When the hell would you have to remember the byte or rune boundaries for characters in the first place? Why would you be slicing up a string with magic number indices? If you're getting indices from pattern matching functions, you shouldn't care whether they're in bytes or bits or nibbles, you should just be passing them on to your language's split routines (or whatever else you wanted to do). Unless, of course, you're the one actually writing low-level string processing routines, in which case rune offsets are far less useful than byte offsets for the reason explained above.

This Python "feature" seems to exist entirely to keep newbies from getting confused when they attempt to slice up strings in their REPL, for I cannot fathom a reason why anyone would write "s[1:4]" in production code. IIRC, Python was designed for pedagogy, so I'm not surprised that it would take on such a pointless implementation cost just to spare teachers from explaining why "s[1:4] gave me question marks"

You're right. You're not very familiar with Python. String slicing with numbered indices is used all the time. And you can slice more than just strings! It's one of the coolest features of Python and you're really missing out if your favorite language doesn't have that.

This might explain why my comments seem like heresy to you. I would point out that the OP is about Python programmers switching to Go.

> This Python "feature" seems to exist entirely to keep newbies from getting confused when they attempt to slice up strings in their REPL, for I cannot fathom a reason why anyone would write "s[1:4]" in production code.

Dealing with a format where data elements are defined to be fixed length in characters that happens to be encoded in Unicode?

> Python somehow magically slices unicode strings without chopping characters in half.

Well, that rather depends on what you mean by "character" and "in half".

    >>> s = u're\u0301sume\u0301'
    >>> print s
    résumé
    >>> len(s)
    8
    >>> for i in xrange(len(s)):
    ...     print s[i]
    ... 
    r
    e
    
    s
    u
    m
    e

    >>> print ' '.join(s)
    r e ́ s u m e ́
> Have you never had to indicate how many bytes you are sending over a stream...

I have and I generally like the syntax to be more explicit (since I so rarely work in bytes): string.getBytes().length

You make all fair points, and I guess it is my opinion that varies then but I think the rule of least surprise would involve returning the rune count for both strings and string slices. Also, wow, range iterates over runes but len returns byte count. That's messed up.

Aside from API, there's a convincing performance argument to have len() return the byte count rather than number of utf8 characters.

The implementation of strings in Go is a 2-word struct containing a pointer to the start of the string and the length (in bytes). Under this implementation, len(s) is O(1) and RuneCountInString(s) is O(n). It makes sense to have the default case also be the fast one, particularly since people appreciate Go for its performance.

Alternatively, you could store the rune-count in the 2-word struct to reverse the above runtimes. However, this is detrimental for the common operation of converting between []byte/string as well as writing a string to a buffer. Both of those operations are a simple memcpy with the actual Go implementation, but would be O(n) using this alternate implementation.

Perhaps you could make it a 3-word struct that contains both byte-length and rune-length; Then all strings take up additional memory as well as requiring more overhead when used as function arguments.

Why not a 2-word struct with the rune count instead of the byte count? There's zero performance cost for many strings because the rune count is known at compile time. For the rest, most strings are too short for Big-O analysis to be relevant and I would guess (enlighten me if I'm wrong) that the cost of computing the bounds of each character is negligible on a modern processor. Multi-byte chars in a string are going to be adjacent in memory, adjacent in cache, and therefore trivial for today's not-at-all-instruction-bound CPUs. Again, correct me if I'm wrong.
It is very seldom that you really want to deal with a string as an array of runes. (If actually you do want to, Go makes it fairly easy: Just use []rune rather than string.)

Consider a simple string: "école". How many runes does it contain? Possibly five:

    LATIN SMALL LETTER E WITH ACUTE
    LATIN SMALL LETTER C
    LATIN SMALL LETTER O
    LATIN SMALL LETTER L
    LATIN SMALL LEtTER E
Possibly six:

    LATIN SMALL LETTER E
    COMBINING ACUTE ACCENT
    LATIN SMALL LETTER C
    LATIN SMALL LETTER O
    LATIN SMALL LETTER L
    LATIN SMALL LEtTER E
If you normalize the string you can guarantee you have the first form, but not every glyph can be represented as a single rune.

Fortunately, you generally don't need to deal with any of this. If you're working with filenames, for example, you really only care about the path separator ('/' or '\' or whatever); everything else is just a bunch of opaque data. You can write a perfectly valid function to split a filename into components without understanding anything about combining characters. When you're dealing with data in this fashion, you rarely if ever care about the number of runes in a string; instead you care about the position of specific runes.

Thank you for the explanation! Converting to a rune slice and back does give me the behavior that I wanted. It still looks butt ugly to me, but at least it works.

In Go:

    fmt.Printf("%s", string([]rune("нєℓℓσ")[1:4]))
    // єℓℓ
In Python:

    print("нєℓℓσ"[1:4])
    # єℓℓ
You almost always need the byte count for protocols, buffers and stuff. For user-interfacing use-cases (e.g. editors) you probably want the glyph or grapheme count, not the rune count. Please remember, runes / code points, graphemes and glyphs are different things:

http://www.icu-project.org/docs/papers/forms_of_unicode/

Have you spent much time in C? Go is directly descended from C, not Python. Its requirements are different.

- If you are passing a number of related arguments, often a struct is a better data structure than default or named parameters.

- I am ambivalent about panic/recover, but have no problem with the rest.

- I do tend to agree that "and" and "or" would be more readable, but it's such a minor issue.

- Regarding your UTF-8 example, I imagine the Go authors believed most Go users would be spending more of their time dealing with bytes than characters. Go is not a language optimized for text manipulation, it is optimized for byte manipulation, like C. This is apparent in the lack of effort put toward optimizing the regular expression engine to date.

> explains that leaving out overloading is "simpler", meaning _simpler for them_.

This also means your program code is simpler, and therefore, faster.

Function overloading usually means virtual method tables, and therefore indirect method calls. Depending on how deep your inheritance / overloading structure is, these vtables can get really messy.

(I had a class in university where we were given a C++ UML class diagram, and told to draw the vtables that resulted when one instance of a subclass was instantiated.)

Function overloading can be accomplished by name mangling at compile time.

  PROGRAMMER SEES        INTERNAL REPRESENTATION
  foo(int a, char b)     foo_int_char
  foo(int x)             foo_int
He did not mean overriding (methods in derived classes), but overloading (functions with the same name). You can resolve the latter at compile time, no indirection needed. For example you can have two overloaded functions println() and println(String).