Hacker News new | ask | show | jobs
by drej 1671 days ago
What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).

While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.

Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).

4 comments

I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.
The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.

To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.

Did anyone else notice the surge of brand new accounts that are appearing on these discussions of Databricks with pro-Databrick opinions?

If we had access to IP address of the posters, I sure would be interested in looking at correlation among them.

What about my comment above is pro-Databricks? Snowflake works the same way. So do most large scale DW insert Exadata, Netezza, etc...

Does anyone else notice people questioning common sense?

with most people working from home, not sure if this heuristic works.

disclaimer: works for databricks, but not on spark, and first time posting in this thread

Resume driven development FTW!
Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.
its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.
PostgreSQL and MySQL can handle a few TB just fine. It is when you reach over 10TB that you need something else.