| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by JoshTriplett 4133 days ago
	Seems like there should be a standard Python mechanism for constructing "atoms" or "symbols" that automatically get commoned up.

2 comments

tlb 4133 days ago

Check out intern(): https://docs.python.org/2/library/functions.html#intern

link

JoshTriplett 4133 days ago

Appears to be deprecated though.

link

icebraining 4133 days ago

Not really, it was just moved to the sys module: https://docs.python.org/3.4/library/sys.html#sys.intern

link

rat87 4133 days ago

I'm pretty sure symbols are not meant to be created from "user" input where user is untrusted, can't this lead to ddos atacks? Same thing for interning. De-Duping doesn't have that risk.

link

munificent 4133 days ago

Lua has an interesting approach here. In Lua, all strings are interned. If you have "two" strings that consist of the same bytes, you are guaranteed that they have the same address and are the same object. Basically, every time a string is created from some operation, it's looked up in a hash table of the existing strings and if an identical one is found, that gets reused.

However, that hash table stores weak references to those strings. If nothing else refers to a string, the GC can and will remove it from the string table.

This gives you great memory use for strings and optimally fast string comparisons. The cost is that creating a string is probably a bit slower because you have to check the string table for the existing one first.

It's an interesting set of trade-offs. I think it makes a lot of sense for Lua which uses hash tables for everything, including method dispatch and where string comparison must be fast. I'm not sure how much sense it would make for other languages.

link

TheLoneWolfling 4132 days ago

A problem with that approach:

You can discover what internal strings are held in a web application via a timing attack.

Better hope you never hold onto a reference to internal credentials inside the application! (Say... DB username / password? Passwords before they're hashed? Etc.)

link

jarman 4133 days ago

Depends on symbol implementations and intended usage.

For example Erlang symbols are deeply ingrained into language, and vm doesn't even garbage collects them, so creating symbols from user data is basically giving user 'crush vm' button.

On the other hand, if symbols are treated as another data type, as string with some optimizations - no such problems shall arise

link

michaelmior 4133 days ago

I think most JSON structures are unlikely to have user input be used as keys. This is also likely where there would be the most benefit from interning since keys are often repeated many times.

link