Hacker News new | ask | show | jobs
by jakebol 2157 days ago
Most every (analytic) RDMS database system can model sparse arrays. A sparse array is modeled by defining a clustered index on the table "array" dimensions and defining a uniqueness constraint on that clustered index. This works well with columnar storage because the data needs to have (and assumed to naturally have) a total sort order on the dimensions. Ex. Vertica, Clickhouse, Bigquery... all allow you to do this. TileDB allows for efficient range queries through an R-Tree like index on the specified dimensions.

Most real world data though is messy and defining a uniqueness constraint upfront (upon ingestion) is often limiting, so for practical use cases this gets relaxed to a multi-set rather than sparse array model for storage, and uniqueness imposed in some way after the fact (if required).

1 comments

I agree that many use cases of sparse data, uniqueness of the dimensions can't be guaranteed or you might not want to enforce the uniqueness. With the recent TileDB 2.0 release we introduced support for duplicates in sparse arrays which adds the support for multi-sets[1].

[1] https://github.com/TileDB-Inc/TileDB/pull/1504

Just to note that language "duplicates in sparse arrays" doesn't make sense, if you allow for duplicates it is no longer an array by definition.
Arrays (vectors) can have duplicates - don’t you mean a set?
I think the duplicates are in the coordinate dimension, like x[1] having more than one value, rather than x[1] and x[2] have the same value.