| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bluGill 807 days ago

There is a lot in that statement and it deserves challenge. Note two things: there is no universal right answer here, sometimes you are correct, sometimes you are wrong; my responses will be contradictory (and also contradict things I didn't think to write) each project will need to figure out which of the combinatorical points is important their project to come up with the right also. Not doing this is premature pessimization - getting your data structures wrong up front will force slow runtime on you in a way that no micro-optimization (which is what premature optimization is really referring to) can ever fix and getting out will cost a very expensive rewrite.

> 1. Such data structures (or more generally, std::vector<std::string, std::vector<std::string>> or something like that) are the natural way to represent e.g. dictionaries.

They are natural, and easy but that doesn't mean they are the right way.

Often I find with a little thinking that there is a enum key under it all that I can hard code - and the enum key is also type safe so that the compiler can prove my code is correct against some class of bugs in a way a string as key cannot.

Why are your keys and values std::string which (baring small string optimization) will allocate more? Often you can place a maximum length on one of both that is small enough that you can replace std::string with a struct containing a char array (or std::string_view - I still have to support C++14 so I haven't been able to try it) and avoid a lot of allocations.

Failing that, often the correct data structure is a database (I reach for sqlite first, but there are plenty of options with pros and cons) which is fast lookup, allows for more complex structures easially, it persists the data to disk so that next startup I don't have to spend all the time reconstructing all that data (realistically performance of creating that large data dictionary is of far greater concern that getting rid of it).

> 2. This general issue extends to virtually all collections. The idea that you should avoid large collections is not a practical solution to real world problems.

This article is about systems programming. In systems programming you rarely have such a large dictionary in that format. You often will in something like the filesystem, but there the point is to have the data persisted from disk and typically you only read the subset you care about. You may have a copy in cache (and cache invalidation becomes an issue), but the in memory portion of data itself is never in a dictionary of that format.

> 3. An alternative solution would be lazy destruction, but that comes with its own issues, such as a really bad worst-case memory overhead or making RC sufficiently more complex that it's not really a win over tracing GC anymore [1].

That is one option, there are more.

Intentionally leak that whole dictionary. There is no reason to care if all the destructors run for any of the examples you presented and realistically if you are getting rid of the whole thing you are probably in shutdown anyway so who cares if you leak memory? Many systems programs and libraries do this.

Move the data to a different thread that only handles reclaiming memory and then don't worry about it. (in discussion we talk about making that thread idle time priority so it won't affect performance - but in practice all schedulers I'm aware of do this poorly in some way)

Finally, if after reading all that and consideration all your options (including things I didn't think of!) if you still conclude a large dictionary of strings to strings is your correct answer, then you probably should demand good tracing garbage collector. I'm not against using them where they are appropriate - I'm just arguing that if you think about your data a little more you will discover they are not needed nearly as much as it seems - and this time spending thinking is usually worth it because it results in much faster programs anyway (even if there is one large dictionary in the code how many others did you get rid of?).

1 comments

rbehrends 807 days ago

> They are natural, and easy but that doesn't mean they are the right way.

> Often I find with a little thinking that there is a enum key under it all that I can hard code - and the enum key is also type safe so that the compiler can prove my code is correct against some class of bugs in a way a string as key cannot.

There's a fundamental misunderstanding here, I think. By dictionaries I mean language dictionaries, e.g. English -> French. You won't find an underlying "enum key" here. Nor am I sure how an enum would ever get large.

> This article is about systems programming. In systems programming you rarely have such a large dictionary in that format.

First, the article is, but the subthread that started this discussion was more general than that.

Second, even in systems programming, you will commonly have arrays or other collections. Caches are a very common thing in systems programming, after all.

> That is one option, there are more.

It is trivially true that there are alternative options, but all these are workarounds around the problem, i.e. you have to complicate your design or make silent assumptions that may not hold in the future (e.g. when disposing of memory at process exit does no longer work because you now need the code as a library).

link

bluGill 806 days ago

> By dictionaries I mean language dictionaries, e.g. English -> French. You won't find an underlying "enum key"

That is a niche case not a common one though. There are a lot of niches each either different requirements.

>It is trivially true that there are alternative options, but all these are workarounds around the problem, i.e. you have to complicate your design

This attitude of non-systems programmers is why people argue garbage collection is slow. Garbage collection can be faster, but people who work in garbage collected languages think that solves all problems and so they don't build efficient data structures. Sure they are not leaking memory, but garbage collection is not enough more efficient than reference counted as to make up for thousands of destruction's when the more complex reference counted version only had a couple. Yes the code is more complex, but it is also faster and that is a trade off I'll take.

link

rbehrends 805 days ago

> That is a niche case not a common one though.

It was an example meant as an illustration. The general case of "collection of objects" is hardly niche.

> This attitude of non-systems programmers is why people argue garbage collection is slow.

I started programming writing Z80 assembler in the 1980s, counting T-states in order to make hard realtime code work. I wrote a graphics driver for Atari ST/TT hardware not to soon after. I think I have a pretty good idea what working in a real-time and/or resource-constrained environment means.

> This attitude of non-systems programmers is why people argue garbage collection is slow.

That is an incorrect generalization. In fact, I see plenty of inefficient code in C++ and Rust (e.g. because a lot of the workarounds for not having GC require additional copying).

> Sure they are not leaking memory, but garbage collection is not enough more efficient than reference counted as to make up for thousands of destruction's when the more complex reference counted version only had a couple.

This is some really unclear statement. If you're trying to make a connection between absence of (tracing) GC and having value types, they are not inherently related. You can have tracing GC and value types (e.g. C#) or reference counting and a lack of value types (e.g. Python).

What is true is that in general memory allocation is relatively expensive, so you want to avoid unnecessarily allocating objects, but that's true regardless of whether you use malloc()/free() or a garbage collector and the strategies for dealing with that are the same in both cases.

> Yes the code is more complex, but it is also faster and that is a trade off I'll take.

Again, this is an untrue generalization.

link

neonsunset 805 days ago

C# has structs with generics/explicit layout/object references/fixed buffers/etc. and byrefs which are special pointers that can point to managed memory, stack or unmanaged one yet are still tracked by GC despite allowing pointer arithmetic on them, and regular C pointers yet successfully uses GC.

Much damage has been dealt to non-JVM languages by stereotypes and perception that the way JVM languages work is the way all GC languages work.

link