Really curious…how do you meaningfully overlay any indexes on top of a big-ass JSON file? Technical details are appreciated, and no problem if it’s your secret sauce—just very curious how this is accomplished!
So, a funny thought experiment is what happens when you parse a JSON file at the same time? You also index by primary key (the field name).
So, I mirror this thinking and having be just an object with the keys being the primary key. Then, I simply index all the children by their fields based on insights from the developer via the index keyword.
So, if you have
record R { public int id; client int owner; int age; index age; }
table<R> rows;
then queries for age can be accelerated by the table.
like "iterate rows where age==42" will basically hone in on the bucket of age==42. I currently only index clients by hash and integers.
The critical aspect which makes this work is that I monitor all mutations. When a child object has a field mutated, then it is removed from all indices and placed into an unknown index. Any queries will also consider it as the purpose of queries to simply narrow the field. Once data changes are persisted, the index is updated and items are moved out of the unknown bucket. This works fairly well because the indices are primarily used during the privacy check phase.
The normal way? You can implement whatever kind of index you like — b-tree index, bitmap index, hash index are all useful and conceptually simple if you're familiar with the backing data structures.
For example, if you want to index a "foreign key" id stored in each "record" in a JSON array of objects, you build a hash table from the FK id values to the JSON array indices of the objects that have that id. It can be as stupid simple as an `fk_index = defaultdict(set)` somewhere in your program, to use a Pythonism.
Now when someone wants JSON objects in that array matching a given FK id, they can just O(1) look in the index to know the position of records that match. Much better than an O(N) scan of every item in the array.
Of course you have to to maintain the index as writes to the JSON happen, but that's not bad once you understand how things work. No real secret sauce.
So, I mirror this thinking and having be just an object with the keys being the primary key. Then, I simply index all the children by their fields based on insights from the developer via the index keyword.
So, if you have
record R { public int id; client int owner; int age; index age; } table<R> rows;
then queries for age can be accelerated by the table.
like "iterate rows where age==42" will basically hone in on the bucket of age==42. I currently only index clients by hash and integers.
The critical aspect which makes this work is that I monitor all mutations. When a child object has a field mutated, then it is removed from all indices and placed into an unknown index. Any queries will also consider it as the purpose of queries to simply narrow the field. Once data changes are persisted, the index is updated and items are moved out of the unknown bucket. This works fairly well because the indices are primarily used during the privacy check phase.