| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jakebol 2202 days ago
	Most every (analytic) RDMS database system can model sparse arrays. A sparse array is modeled by defining a clustered index on the table "array" dimensions and defining a uniqueness constraint on that clustered index. This works well with columnar storage because the data needs to have (and assumed to naturally have) a total sort order on the dimensions. Ex. Vertica, Clickhouse, Bigquery... all allow you to do this. TileDB allows for efficient range queries through an R-Tree like index on the specified dimensions. Most real world data though is messy and defining a uniqueness constraint upfront (upon ingestion) is often limiting, so for practical use cases this gets relaxed to a multi-set rather than sparse array model for storage, and uniqueness imposed in some way after the fact (if required).

1 comments

Shelnutt2 2202 days ago

I agree that many use cases of sparse data, uniqueness of the dimensions can't be guaranteed or you might not want to enforce the uniqueness. With the recent TileDB 2.0 release we introduced support for duplicates in sparse arrays which adds the support for multi-sets[1].

[1] https://github.com/TileDB-Inc/TileDB/pull/1504

link

jakebol 2202 days ago

Just to note that language "duplicates in sparse arrays" doesn't make sense, if you allow for duplicates it is no longer an array by definition.

link

DaiPlusPlus 2202 days ago

Arrays (vectors) can have duplicates - don’t you mean a set?

link

evanpw 2202 days ago

I think the duplicates are in the coordinate dimension, like x[1] having more than one value, rather than x[1] and x[2] have the same value.

link