Hacker News new | ask | show | jobs
by dandermotj 3160 days ago
Sparse heterogeneous data is often the type of data stored in NoSQL dbs. Modelled in a relational way, this produces many tables with many NULL fields, while keeping it in a key value format is neat and tidy.

I'd recommend the paper What Goes Around Comes Around[1], the first paper in Readings in Database Systems[2]

[1] https://scholar.google.com/scholar?cluster=73661829057771494... [2]redbook.io

2 comments

> Sparse heterogeneous data is often the type of data stored in NoSQL dbs.

I still can't imagine what sparse heterogeneous data exists in the world that makes sense to store. Any type of querying or processing requires some kind of structure (even if implicit in the code) which you can just put in different table structures.

You have to make sense of data to process it and that kind of implies a structure, doesn't it? Am I missing some obvious example of heterogeneous data?

Customer Analytical Record / Feature Engineering Store

One customer column, tens of thousands of attribute columns.

If you need everything about a customer it is a single, O(1) fetch operation which makes it perfect for driving chat bots, call centres, websites, operational decisioning engines, dashboards etc. Almost every large company will have one of these.

You can't really do it in relational systems properly because (a) you hit the column limit, (b) often it is sparse i.e. lots of NULLs everywhere, (c) you need this system to be distributed since it often gets a lot of load.

What would the attribute columns consist of? My experience has been with named columns defined individually by humans, of which I've never seen more than a few hundred; how do you get tens of thousands? Are they a different kind of thing?
Most companies who do it purely by humans can easily get into the thousands of attributes. Have seen it many times before where you hit the column limit of a SQL database.

But where you get into tens/hundreds of thousands is when you have machine learning models automatically selecting and storing important features from the data.

Tick (market) data is another good example of this. A given 'Tick' is just an event that can have any of up to thousands of different attributes set (often just a handful).
No.

EAVT is great as an intermediate format but it is absolutely useless to query for since most of the time you are trying to find a set of attributes for a given entity i.e. full table scan.

What you want is a "wide table". One entity column and all the attribute columns to the right. Often with most of the values set to null.

This is the dream use case for MongoDB since it you can ignore sparse values yet when you query it via their drivers it will appear as a wide table. You can't do this at all in PostgreSQL since you will hit a column limit.

> EAVT is great as an intermediate format but it is absolutely useless to query for since most of the time you are trying to find a set of attributes for a given entity i.e. full table scan.

This is what indexes are for. An index on the entity id should avoid any full table scans.

EAVT table with 100 million entities and 10000 attributes = one trillion row table.

And you want to build indexes on half the table ?

Good luck with that.

> Often with most of the values set to null.

Your math is at odds with your own requirements, null values don't need a row.

You clearly don't understand what you're talking about.

Sparcity is an issue for the wide table not the EAVT form.

> You can't do this at all in PostgreSQL since you will hit a column limit.

JSONB is designed for exactly this, isn’t it?

Sure. But MongoDB is far better at scaling, has infinitely better drivers (including Spark) and is about an order of magnitude faster than PostgreSQL for partial updates.

The lack of a Spark driver alone renders PostgreSQL useless for most companies.