Hacker News new | ask | show | jobs
Most storage and query efficient database to store 100B rows of data?
1 points by paperwhite 768 days ago
been looking into options for storing 50-60 billion rows of data in a database:

What I have so far:

- Mongo has a large storage overhead but query performance is good

Postgres is more storage efficient but query performance is poor at this scale

-- Haven't explored Apache yet because of the setup overhead but happy to explore options that are slightly more technical than our current skillset

I'm @harlakshS on Twitter in case anyone of you wants to DM instead. Thanks!

2 comments

Large storage overhead should be a thing of the past (8+ years ago) with wiredtiger's automatic compression.
Overhead comes from the BSON format

Even after the wiredtiger auto-compression, the basic overhead doesn't have a workaround

Can you point out what you mean? Are you worried about memory pressure for the decompressed filesystem cache or storage footprint, or perhaps something else like data returned to your application?
For what kind of data? For what kinds of queries?

If the columns are scalar then consider a column store.

We have Data heavy products

Typical use case:

Filtering billions of 'docs' or rows across 50 attributes in any number of combinations

A doc holds data on the techstack of a url - For one of the products. Ever evolving schema (hence why we picked Mongo)

--

Number of queries are low but the complexity of the queries are high

We've moved object storage from JSON to Parquet already before uploading to Mongo so that's already 1/4th the object storage

Question if something can match the querying performance of Mongo but has better storage efficiency

If the attributes are scalar, I would still suggest a column store that supports null values. Column compression will save you much space and give you excellent OLAP query performance.

As the schema evolves, simply add new columns.

Checked

Some attributes are scalar but others are not

There are array of objects in a lot of places. will need to modify the schema to be completely scalar

Will run a local test & see how this goes. Thanks a lot!

If the arrays of objects can conform to a single schema (across all scalar attributes), then make a second table to hold the objects in the arrays.

Now you have a two table schema with (at most) one join in a given query.

Check out Vertica. It does a great job at various forms of compression. In addition, DuckDB is an easy way to get started with efficient OLAP queries.