Most storage and query efficient database to store 100B rows of data?

Y	Hacker News new \| ask \| show \| jobs

1 points by paperwhite 768 days ago

been looking into options for storing 50-60 billion rows of data in a database:

What I have so far:

- Mongo has a large storage overhead but query performance is good

Postgres is more storage efficient but query performance is poor at this scale

-- Haven't explored Apache yet because of the setup overhead but happy to explore options that are slightly more technical than our current skillset

I'm @harlakshS on Twitter in case anyone of you wants to DM instead. Thanks!

2 comments

redwood 768 days ago

Large storage overhead should be a thing of the past (8+ years ago) with wiredtiger's automatic compression.

link

paperwhite 767 days ago

Overhead comes from the BSON format

Even after the wiredtiger auto-compression, the basic overhead doesn't have a workaround

link

redwood 767 days ago

Can you point out what you mean? Are you worried about memory pressure for the decompressed filesystem cache or storage footprint, or perhaps something else like data returned to your application?

link

icsa 768 days ago

For what kind of data? For what kinds of queries?

If the columns are scalar then consider a column store.

link

paperwhite 767 days ago

We have Data heavy products

Typical use case:

Filtering billions of 'docs' or rows across 50 attributes in any number of combinations

A doc holds data on the techstack of a url - For one of the products. Ever evolving schema (hence why we picked Mongo)

Number of queries are low but the complexity of the queries are high

We've moved object storage from JSON to Parquet already before uploading to Mongo so that's already 1/4th the object storage

Question if something can match the querying performance of Mongo but has better storage efficiency

link

icsa 767 days ago

If the attributes are scalar, I would still suggest a column store that supports null values. Column compression will save you much space and give you excellent OLAP query performance.

As the schema evolves, simply add new columns.

link

paperwhite 767 days ago

Checked

Some attributes are scalar but others are not

There are array of objects in a lot of places. will need to modify the schema to be completely scalar

Will run a local test & see how this goes. Thanks a lot!

link

icsa 767 days ago

If the arrays of objects can conform to a single schema (across all scalar attributes), then make a second table to hold the objects in the arrays.

Now you have a two table schema with (at most) one join in a given query.

Check out Vertica. It does a great job at various forms of compression. In addition, DuckDB is an easy way to get started with efficient OLAP queries.

link