Hacker News new | ask | show | jobs
by dreyfan 1824 days ago
Spark is this weird ecosystem of people who take absolutely trivial concepts in SQL, bury their heads in the sand and ignore the past 50 years of RDBMS evolution, and then write extremely complicated (or broken) and expensive to run code. But whatever it takes to get Databricks to IPO! Afterwards the hype will die down and everyone will collectively abandon it just like MongoDB except for the unfortunate companies with so much technical debt they can't extricate themselves from it.
3 comments

There's certainly some of that and I have experienced project managers asking me to put 5GB datasets in spark... but there's definitely a set of problems where vertical scaling is a PITA and MPP basically generally breaks the SQL guarantees anyway, costs a milli, requires rewrites, etc.

When you want to process N+1 TB/PB its hard to throw standard relational approaches at it imo.

SQL is strings all the way down, testing the database itself is often shitshow...

While I agree that it can easily be "strings all the way down", as often the way folks make spark testable is only slightly more advanced than using views in a sql world. Add in an understanding of windowing functions, and some trivial assertions on expected query results go a long way.
spark is far more testable and composable than sql! and you even get static typing checking. plus i can read data from anywhere - local fs, s3, rdbms, json, parquet, csv... rdbms could not compete
Many (most?) DBs have no problem ingesting json, parquet, csv etc from S3. Some can query those formats without first ingesting them.
Is it best to just use spark.sql?