Hacker News new | ask | show | jobs
by halayli 234 days ago
I don't know what kind of data you are dealing with but its illogical and against all best practices to have this many keys in a single object. it's equivalent to saying having tables with 65k columns is very common.

on the other hand most database decisions are about finding the sweet spot compromise tailored toward the common use case they are aiming for, but your comment sound like you are expecting a magic trick.

7 comments

Every pathological case you can imagine is something someone somewhere has done.

Sticking data into the keys is definitely a thing I've seen.

One I've done personally is dump large portions of a Redis DB into a JSON object. I could guarantee for my use case it would fit into the relevant memory and resource constraints but I would also have been able to guarantee it would exceed 64K keys by over an order of magnitude. "Best practices" didn't matter to me because this wasn't an API call result or something.

There are other things like this you'll find in the wild. Certainly some sort of "keyed by user" dump value is not unheard of and you can easily have more than 64K users, and there's nothing a priori wrong with that. It may be a bad solution for some specific reason, and I think it often is, but it is not automatically a priori wrong. I've written streaming support for both directions, so while JSON may not be optimal it is not necessarily a guarantee of badness. Plus with the computers we have nowadays sometimes "just deserialize the 1GB of JSON into RAM" is a perfectly valid solution for some case. You don't want to do that a thousand times per second, but not every problem is a "thousand times per second" problem.

redis is a good point, I've made MANY >64k key maps there in the past, some up to half a million (and likely more if we didn't rearchitect before we got bigger).
re: storing data in keys

FoundationDB makes extensive use of this pattern, sometimes with no data on the key at all.

You seem to be assuming that a JSON object is a "struct" with a fixed set of application-defined keys. Very often it can also be used as a "map". So the number of keys is essentially unbounded and just depends on the size of the data.
Erm, yes, structs seems to be the use-case this is consciously and very deliberately aiming at:

SICK: Streams of Independent Constant Keys

And "maps" seems to be a use case it is deliberately not aiming at.

Isn't the example in the readme just a smaller map style object instead off a larger one?
Let's say you have a localization map: the keys are the localization key and the values are the localized string. 65k is a lot but it's not out of the question.

You could store this as two columnar arrays but that is annoying and hardly anyone does that.

A pattern I've seen is to take something like `{ "users": [{ "id": string, ... }]}` and flatten it into `{ "user_id": { ... } }` so you can deserialize directly into a hashmap. In that case I can see 65k+ keys easily, although for an individual query you would usually limit it.
Hmm...would all the user id's be known beforehand in this use-case?
I wouldn't get worked up about the actual names of things I used here, but there's no difference between having the key contained in the user data versus lifted up to the containing object... every language supports iterating objects by (key, value).

You would do a query like "give me all users with age over 18" or something and return a `{ [id: string]: User }`

There is a huge difference between a fixed, constant set of keys vs. the keys being an open-ended set that depends on user data.
Not really, this is a very minor difference that people exploit all the time to minimize the size of serialized data or make it more readable. This is a great example of bikeshedding.
Yes there is if you optimize for the former case.

If that optimization isn't for you, choose a different library.

If that optimization works for your use-case, it can make a huge difference.

> I don't know what kind of data you are dealing with but its illogical and against all best practices to have this many keys in a single object.

The whole point of this project is to handle efficiently parsing "huge" JSON documents. If 65K keys is considered outrageously large, surely you can make do with a regular JSON parser.

> If 65K keys is considered outrageously large

You can split it yourself. If you can't, replace Shorts with Ints in the implementation and it would just work, but I would be very happy to know your usecase.

Just bumping the pointer size to cover relatively rare usecases is wasteful. It can be partially mitigated with more tags and tricks, but it still would be wasteful. A tiny chunking layer is easy to implement and I don't see any downsides in that.

How wasteful?

Presumably 4 bytes dedicated to the keys would be dwarfed by any strings thrown into the dataset.

Regardless, other than complexity, would there be any reason to not support a dynamic key size? You could dedicate the first 2 bits on the key to the length of the key. 1 byte would work if there's only 64 keys, 2 bytes would give you 16k keys and 3 4M. And if you wanted to you could use a frequency table to order the pointers such that more frequently used keys are smaller values in the dictionary.

Most of the data the library originally was written for consists of small objects and arrays with high levels of duplication (think state of the world in a videogame with tons of slightly varying objects). Pointer sizes really matter.
That's like saying it's illogical to have 65k elements in an array.

What is the difference?

If the limitation affects your usecase, you can chunk your structures.

The limitation comes with benefits.

Fair enough. Implementation details matter.

I was just responding to the “X is an absurd way to do JSON”. Which seemed to single out objects vs arrays.

Like in this case maybe, but I don’t see a reason to make that general statement.

I do not miss having to use “near” and “far” pointers in 16-bit mode C++ programming!
Data shape often outlives the original intentions.