| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by whilo 2156 days ago

Interesting, honestly speaking we have not thought about time series data a lot yet, but I think we should be able to provide custom indices and extend Datalog with more efficient query primitives, if this is necessary. Can you elaborate a bit? I have used HDF5 binary blobs for tensors of experimental recordings (parameter evolution in spiking neural networks) in Datomic a few years ago and it is definitely possible to integrate external index data structures, but eventually the query engine will need to be aware of how to join them efficiently.

W.r.t. security, our current approach is to shard access rights and encryption on a database level and just provide many databases, one for each user. This is obviously not the most space efficient, but the most general approach. If users can share access keys and data we can also do structural sharing between these instances and factorize further. We envision doing joins potentially over dozens of distributed Datahike instances in a global address space during single queries. Since the indices are amortized data structures it does not make too much sense to encrypt chunks of them for different users as this defeats the optimality guarantees of B+-trees, i.e. you could have very bad scan behaviour over huge ranges of encrypted Datoms. How have you tried to partition the data? This is an interesting problem.

We can also expose the datahike-server query endpoint directly and you can write static checks for access right restrictions. We only do this so far to limit the usage of builtin functions to safe ones, but you could also go ahead and do the same for more complex access controls. Some work in this direction for Datahike has also been done here: https://github.com/theronic/eacl Doing this openly on the internet will also require a resource model to fend of denial of service attacks, fortunately Datalog engines can have powerful query planners and we can restrict our runtime to limited budgets as well.

1 comments

synthc 2156 days ago

For timeseries data I encoded a [entityId,timestamp,attribute] tuple to a big integer, using a order preserving mapping to ensure that the datoms are sorted by the timestamp. This provided the right functionality, for example using seek-datoms we could retreive the datoms with timestamps between some range, but performance was poor. I think a custom index could help a lot here. We also had problems with the database growing to large, and needed to manually shard the database over time.

A datalog equivalent to TimescaleDB (which extends Postgres with timeseries optimiziations and time based table partitioning) would be great.

For client access I tried to define access rules based on attributes (similar to how many graphql frameworks handle this), I tried to express this using datalog rules. For example, users hava permission to access :user/items, and :items/blabla, so a user X can access [X :user/items Y] and [Y :items/blabla Z] Some experiments were promising, but it was slow and I did not find a good way to integrate this.

whilo 2155 days ago

I see, so your problem was that you wanted to scan over all Datoms for one entity over a time period and you would have needed to have an EVAT index? In Datahike it would be fairly simple to add new indices like this.

Yes, access management must not incur a large overhead, that is why many systems have a separate restricted way to express and track rules. My hunch is that it still would be better to keep it in Datalog and specialize the query engine that it is fast on these (potentially restricted) rules and relations.