I have been hit by this many times. Do you know of any good resources on how to reduce this overhead? (Choosing other data structures, choosing other string types etc)
There's a whole range of techniques! Many of them are language dependent, and won't be applicable elsewhere.
A blunt instrument is running your code in 32-bit mode, which halves pointer sizes. The downside is it also limits your maximum data size, and blocks the use of the 64-bit instruction sets that generally speed up data processing.
Java has an interesting hybrid mode where it uses 32-bit pointers but runs in 64-bit mode.
Not storing the entire decoded document all at once is usually the recommended approach, via "streaming" parsers that give you one element at a time. This then lets you immediately throw away data that you're done with processing.
The downside is that it makes certain common idioms impossible or difficult. For example, MVC web architectures generally assume that the controller produces a "complete" model object that then gets passed to the view.
Unfortunately, language-level support for efficient data processing is generally lacking. Rust and C++ are okay, but have annoying gaps in their capabilities.
A common trick with something like parsing is to keep the original encoded string to be decoded "as-is", and then simply reference into it using integer indexes. I.e.: don't copy the strings out individually, instead just use "slices" into the original document string.
Most commonly, unique strings are detected using a hash table and not stored separately. This works better with garbage-collecting languages like C#, Java, and JavaScript. With Rust or C++ you have to use reference counting or other tricks.
A blunt instrument is running your code in 32-bit mode, which halves pointer sizes. The downside is it also limits your maximum data size, and blocks the use of the 64-bit instruction sets that generally speed up data processing.
Java has an interesting hybrid mode where it uses 32-bit pointers but runs in 64-bit mode.
Not storing the entire decoded document all at once is usually the recommended approach, via "streaming" parsers that give you one element at a time. This then lets you immediately throw away data that you're done with processing.
The downside is that it makes certain common idioms impossible or difficult. For example, MVC web architectures generally assume that the controller produces a "complete" model object that then gets passed to the view.
Unfortunately, language-level support for efficient data processing is generally lacking. Rust and C++ are okay, but have annoying gaps in their capabilities.
A common trick with something like parsing is to keep the original encoded string to be decoded "as-is", and then simply reference into it using integer indexes. I.e.: don't copy the strings out individually, instead just use "slices" into the original document string.
Most commonly, unique strings are detected using a hash table and not stored separately. This works better with garbage-collecting languages like C#, Java, and JavaScript. With Rust or C++ you have to use reference counting or other tricks.