Hacker News new | ask | show | jobs
by rizkeyz 1560 days ago
At this point in time (2022) I consider everything below say 40TB not big (textual) data at all. It can be compressed 40TB -> 10TB (or less) and that fits fine on a single 16T drive.

For many questions, you won't need all the raw data, so you end up with some form of projection of the data that is maybe 1/10 in size, so 10TB -> 1TB. Heck, if you tune GNU sort a bit, it will blast through that TB quite quickly.

2 comments

The problem isn't necessarily (just) capacity. It's also the he I/O bandwidth needed to read the data.
If you just want cold storage, you can put 10TB of compressed textual data on a spinning hard drive. If want to run some processing of that data within a workday, you need multiple drives in parallel (still possible on a single server). However, if you want to process it in less than 30 mins you need a cluster.