Hacker News new | ask | show | jobs
by barrkel 1534 days ago
Ruby doesn't have symbols because of AST or VM details.

Ruby has symbols in all probability because Lisp and Smalltalk have symbols.

It could get most of the same practical upside of symbols from interned strings - the important thing is being able to compare using pointer equality and look up hash tables without needing to walk a string. What symbols at the type level do is ensure that these string-like things have already been interned, that is, de-duplicated, when they hit lookup points like member access.

But the implementation could do something very similar behind the scenes by setting a bit on interned string values. Besides, symbols aren't enough for the more advanced dynamic language optimization techniques like you see in V8.

7 comments

I'd say that Ruby has symbols because Ruby has mutable strings.

If your strings are immutable and interned, they are as good as symbols; this is why Python does not have symbols.

ECMASript introduced symbols because JavaScript strings, while immutable, are not necessarily interned. Symbols are much cheaper to compare for equality: you only need to compare the pointers / ids, not actual string bytes.

Lisp has symbols for the same reason: Lisp strings are vectors, which are also mutable.

Lisp has symbols, because they were used in symbolic expressions (s-expressions) as named entities. In the programming language Lisp these symbols serve also as identifiers for functions, variables and other things. Thus a symbol originally had an internal structure made of an association list (a list of keys and values). That association list then had various entries, including a print name -> the thing to print when a symbol gets externalized. Since symbols can serve as function names, these symbols also had functions stored in their association list. Different function types could be stored under different keys.

Since Lisp symbols serve a central role as identifiers and structured objects, they are not like what Ruby uses. Lisp uses symbols also for named interned things, but that is only one purpose.

In Common Lisp symbols have a name, a value, a function, a package and a property list (a list of keys and their values). By default in a call like (mult 1 2 3), the global function will be retrieved from the symbol and the function will be called with the arguments. The property list sometimes will be used by an IDE to store information about the symbol: like where it was defined, what its definition is and similar.

At least in V8 they are last I checked. The symbols feature is a property privacy feature. A symbol can be treated as a private secret owned by a library thus restricting access to a property on a shared object.
This is untrue. Symbols provide no privacy. They provide a mechanism to avoid collisions. You can even ask an object for its symbol properties: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...
JS symbols are the opposite of what you are saying.

If you have five instances of `:foo` in Ruby, you can guarantee they will be IDENTICAL.

If you have five instances of `Symbol('foo')` in Javascript, you are guaranteed they will be completely DIFFERENT.

JS symbols are like Common Lisp's `gensym` which it uses to guarantee macro variables won't collide with existing variable names.

I feel quite confiscate that Ruby has Symbols because Strings are mutable which causes issues for when you hold on to something but you also give out a reference.
> It could get most of the same practical upside of symbols from interned strings

I don't think so - unless you mean always interning all strings. The point of symbols is you can do a single address comparison. How can you do that if you could have two strings that are the same but have different addresses?

There is also a down-side of symbols - they by definition always escape the compilation unit since they're interned!

Always interning all strings is what Lua does, and the concept of a symbol in Lua is merely a particular string pattern which the parser will recognize. They aren't syntactically identical, you can replace any .field with ["field"] but you can't say `local ["field"] = value`, but there is no distinction in the types.

I get a lot of use out of both of those decisions (immutable strings and string/symbol identity), they work well together, and I'd (much) rather have the problem of string-builders than the problem of tracking references to strings and copying them if I need both the original and revision.

> Always interning all strings is what Lua does

That'd be catastrophic for performance in Ruby - every string allocation would always have to be reified, and would always need to access a shared data structure.

I'll take your word for that. It's the opposite in Lua, which is several times faster than stock Ruby. If your runtime takes it as a given that every string will be interned, there are all sorts of assumptions this enables which mutation invalidates.
> If your runtime takes it as a given that every string will be interned, there are all sorts of assumptions this enables which mutation invalidates.

Yes but if you're regularly creating strings, which is what Ruby web servers do all the time, then your intern table is going to become a white-hot hotspot, contended by all threads all the time.

That's one way of stating the problem, sure.

It's more accurate to say that Ruby has mutable strings as part of its semantics, and that can't be changed, and would interact disastrously with immutability.

Languages with immutable strings simply architect a web server around that semantic, the problem you're describing doesn't happen. A mutable string is replaced with (in Lua) a mutable array containing immutable string fragments, which are made into a string using the builtin table.concat.

Always measure. I use Lua at $job, where it's processing millions of SIP messages (text! Lots of parsing) in real time (it's part of the call processing flow), and over the past seven years or so, I've only had to tune the code once in that time (maybe two, three years ago now?).
Honestly, many more language runtimes should perform a lazy version of interning: have String.equals(String) (or its equivalent in your language), when it determines that two Strings are equal, fix up the String internal representations to share more state and hit the fast path for future String comparisons.

In older Sun/Oracle JVMs, String looked something like

  class String {
    private char[] contents;
    private int length;  // substring() used to share contents with parent string
    private int offset;
  }
Since they switched to memoizing hashCode calculation, it looks something like

  class String {
    private char[] contents;
    private int length;  // Allows Substrings to share contents with parents if they are a prefix of the parent
    private int hashCode;
  }
The fast path for equals() should first check for address equality between the two Strings, and if that fails, check if length and contents are equal, and if that fails, have a fast path for returning false if they both have memoized non-equal hash codes. If both strings compare as equal in the slow path, then one of them should be patched up to save memory and also hit the fast path next time equals() is called on them.

When two Strings compare as equal, you probably want to use some heuristic to determine which one is likely to live longer into the future, and switch the other String to use the former's contents. Maybe in the Java case you'd only enable this optimization if length == contents.length to avoid some trickiness around prefix sharing optimizations. For a mark-sweep-compact garbage collector where lower addressed objects are more likely than not to have been allocated earlier, using the lower addressed contents as a tie breaker is a decent heuristic. Even in cases where address comparison is only as good as a coin flip in guessing age, it's a decent arbitrary tie breaker that in the long run leads to equal strings all converging on sharing the same contents char[].

Note that maximum String length in Java is 2*31-1 chars, so the sign bit of String.length is free for use as a flag for either keeping track if hashCode is memoized (if you don't want to use a sentinel value to indicate non-memoized hashCode) or if this String's contents should be the preferred version (for instance, if its backing store is in a shared library's read-only data section).

In languages with mutable strings, if maximum string lengths are also 2*31-1 characters (or if String hash codes can be safely truncated to 31 bits, etc.), then a single bit could be found to keep track of which contents arrays are copy-on-write.

> I don't think so - unless you mean always interning all strings. The point of symbols is you can do a single address comparison. How can you do that if you could have two strings that are the same but have different addresses?

You intern all the literals (which includes lexical symbols) and are 99% if the way there.

> and are 99% if the way there

I don't understand - if you're not 100% of the way there then you can't rely on address comparison. 1% of your string comparisons would fail!

That's a better explanation. There is a clear Smalltalk influence in Ruby, especially around the object-oriented aspects of the language. The best example is how the language doesn't call function, but sends a message to a method. And also how everything is an object. Matz talked quite a bit about the various other languages that influenced the design and Smalltalk and Lisp are part of that list (and Perl).
Prolog also have symbols (called "atoms"). Erlang too (influenced by Prolog).
Erlang has symbols because its strings are ridiculously expensive (and kinda shit), so while it does have immutable strings identifying objects based on that would be ridiculously costly.
Interned strings are fine if you don't have mutable strings, but for one Ruby does have mutable strings and two it's nice having that syntactic sugar! Makes it clear that some value is something programmer-written, or at least programmer endorsed. I don't use python much but I do wish there was an alternative syntax for strings I only plan on using like symbols.
> It could get most of the same practical upside of symbols from interned strings

Aren't symbols and interned strings the same thing? Of course you can get all the upside of symbols by having symbols...?

Yeah—to reference a symbol `:foobar` in MRI's C API, you even call `ID2SYM(rb_intern("foobar"))`