Hacker News new | ask | show | jobs
by jmoiron 4086 days ago
When I wrote the same kind of article in Nov 2011 [1], I came to similar conculsions; ujson was blowing everyone away.

However, after swapping a fairly large and json-intensive production spider over to ujson, we noticed a large increase in memory use.

When I investigated, I discovered that simplejson reused allocated string objects, so when parsing/loading you basically got string compression for repeated string keys.

The effects were pretty large for our dataset, which was all API results from various popular websites and featured lots of lists of things with repeating keys; on a lot of large documents, the loaded mem object was sometimes 100M for ujson and 50M for simplejson. We ended up switching back because of this.

[1] http://jmoiron.net/blog/python-serialization/

2 comments

Hey Jason. Thats pretty interesting. I have also noticed similar things but for my case, we needed faster loading/unloading for some cases, hence ujson.
I would love to see benchmarks in PyPy as well! I wonder how well a JIT would handle de/serialization.
Seems like there should be a standard Python mechanism for constructing "atoms" or "symbols" that automatically get commoned up.
Appears to be deprecated though.
Not really, it was just moved to the sys module: https://docs.python.org/3.4/library/sys.html#sys.intern
I'm pretty sure symbols are not meant to be created from "user" input where user is untrusted, can't this lead to ddos atacks? Same thing for interning. De-Duping doesn't have that risk.
Lua has an interesting approach here. In Lua, all strings are interned. If you have "two" strings that consist of the same bytes, you are guaranteed that they have the same address and are the same object. Basically, every time a string is created from some operation, it's looked up in a hash table of the existing strings and if an identical one is found, that gets reused.

However, that hash table stores weak references to those strings. If nothing else refers to a string, the GC can and will remove it from the string table.

This gives you great memory use for strings and optimally fast string comparisons. The cost is that creating a string is probably a bit slower because you have to check the string table for the existing one first.

It's an interesting set of trade-offs. I think it makes a lot of sense for Lua which uses hash tables for everything, including method dispatch and where string comparison must be fast. I'm not sure how much sense it would make for other languages.

A problem with that approach:

You can discover what internal strings are held in a web application via a timing attack.

Better hope you never hold onto a reference to internal credentials inside the application! (Say... DB username / password? Passwords before they're hashed? Etc.)

Depends on symbol implementations and intended usage.

For example Erlang symbols are deeply ingrained into language, and vm doesn't even garbage collects them, so creating symbols from user data is basically giving user 'crush vm' button.

On the other hand, if symbols are treated as another data type, as string with some optimizations - no such problems shall arise

I think most JSON structures are unlikely to have user input be used as keys. This is also likely where there would be the most benefit from interning since keys are often repeated many times.