Hacker News new | ask | show | jobs
by mikljohansson 2779 days ago
Our whole dataset exported as compressed JSON is likely not much more than 100TB. The petabytes come from all the index datastructures Elasticsearch/Lucene builds, as well as the high replication factor needed to keep up with the query throughput.

We index a lot of NLP and other enrichments on our documents. This also adds a lot of storage on top of the base text. And like Karl mentioned, we have lots of small social media documents available for analytics (currently 34B, sometime next year upward of 230B). But also 7+ billion documents of news, blogs and other long-text articles indexed, basically 10 years of all news media from the entire world online and available for analytics.

We're building the https://fairhair.ai/ data science platform to allow other companies to access, and run online search and analytics on top of this massive dataset and compute clusters. For example to embed analytics over this dataset into their own SaaS products

1 comments

In my experience, Elasticsearch is triple the size of the data.

    First is the actual json data in quasi plain text.
    Second is the _source field that duplicates the original input object (necessary for reindexing/rebuilding)
    Third is the _all field that duplicates the json data as text (only used for some text search, better disable it).
Finally, the index is duplicated to replicas, at least one if you want any redundancy.

Index compression with lz4 takes 20 or 30% off, new feature of elasticsearch v5.0, on by default.