Hacker News new | ask | show | jobs
by salted-fry 930 days ago
I remembered where some of my old files are and re-tested; forward-gron was "only" about 7GB for the 15MB file. gron -u was the real killer, clocking in around 53GB.
1 comments

Yeah that's a bug, no amount of buffering can justify that amount of memory.

And gron -u in theory should use less memory than gron-ifying a JSON, as you just have to fill a data structure in memory as you go.

> as you just have to fill a data structure in memory as you go

You don't know the size, shape, or type of any of the levels in the data structure until you get to a line specifying one part of it. If you did, yep, it would be trivial!

But you do: if the first line is `users[15].name.family_name = "Foo"`, all you need to know is there: there is an array of users, each is a map containing a field called name, which is a map with a field called family_name.

If users[14] is a string, or there are 1500 users, the amount of memory usage to ungrok that line is exactly the same. Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.

> there is an array of users

How big is it? All you know at this point is that it's at least 16 entries long. If the next line starts with `users[150]`, now it's 151 entries. Next line might make it 2000 entries long. You have no idea until you see the line.

> each is a map

But a map of what? `string -> object`? Ok, the next line is `users[15].flange[15]` which means your map is now `string -> (object|array)`.

Then the next line is `users[15].age = 15` and you've got `string -> (object|array|int)`. Each line can change what you've got and in Go this isn't a trivial thing to handle without resorting to `interface{}` (or `any`) all over the show and reflection to handle the management of the data structures.

> Prove me wrong, I can't think of any way it would not be trivial, provided one uses the correct datastructures.

All I can suggest is that you try to build `ungron` in Go and have it correctly handle disordered input. If you find a better way of doing it, I'd be happy to hear about it because I spent several months fighting Go in 2021-22 trying to optimise this without success.