|
|
|
|
|
by mikljohansson
2779 days ago
|
|
Our whole dataset exported as compressed JSON is likely not much more than 100TB. The petabytes come from all the index datastructures Elasticsearch/Lucene builds, as well as the high replication factor needed to keep up with the query throughput. We index a lot of NLP and other enrichments on our documents. This also adds a lot of storage on top of the base text. And like Karl mentioned, we have lots of small social media documents available for analytics (currently 34B, sometime next year upward of 230B). But also 7+ billion documents of news, blogs and other long-text articles indexed, basically 10 years of all news media from the entire world online and available for analytics. We're building the https://fairhair.ai/ data science platform to allow other companies to access, and run online search and analytics on top of this massive dataset and compute clusters. For example to embed analytics over this dataset into their own SaaS products |
|
Index compression with lz4 takes 20 or 30% off, new feature of elasticsearch v5.0, on by default.