Hacker News new | ask | show | jobs
by roenxi 399 days ago
> As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.

This isn't really saying much. It is a bit like saying the 1:1000 year storm levy is overbuilt for 99.9% of storms. They aren't the storms the levy was built for, y'know. It wasn't set up with them close to the top of mind. The database might do 1,000 queries in a day.

The focus for design purposes is really to queries that live out on the tail - can they be done on a smaller database? How much value do they add? What capabilities does the database need to handle them? Etc. That is what should justify a Redshift database. Or you can provision one to hold your 1Tb of data because red things go fast and we all know it :/

3 comments

If you only have 1tb of data then you can have it in ram on a modern server.
AND even if you have 10TB of data, NVMe storage is ridiculously fast compared to what disk used to look like (or s3...)
In the last few years, sure, but certainly not in 2012.
1TB memory servers weren't THAT exotic even in say 2014~2018 era either, I know as I had a few at work.

Not cheap, but these were at companies with 100s of SWEs / billions in revenue / would eventually have multi-million dollar cloud bills for what little they migrated there.

You can take a different approach to the 1-in-1000 jobs. Like don't do them, or approximate them. I remember the time I wrote a program that would have taken a century to finish and then developed an approximation that got it done in about 20 minutes.
> This isn't really saying much.

On the contrary, it's saying a lot about sheer data size, that's all. The things you mention may be crucial why Redshift and co. have been chosen (or not - in my org Redshift was used as standard so even small dataset were put into it as the management want to standardize, for better or worse), but the fact remains that if you deal with smaller datasets all of the time, you may want to reconsider the solutions you use.