| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by remilouf 2620 days ago

You changed my perspective a little bit by asking the right questions.

> Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically

I did a quick estimate of the volume, and we won't reach 1Tb before > 5 years. We're not in a line of business where the number of clients can increase dramatically so it's fairly predictable. I don't want to design for imaginary scaling issues.

> Potentially introduces yet more sources of truth for some data.

It is more intended to replace the current mess.

> SQL is terrible language to write transformations in (its a query language, not an ETL pipeline)

Actually this is the point that concerns me the most. The need to transform the data in non-trivial ways. But surely people didn't wait for Spark to do this?

> Unless you can very clearly demonstrate that what you're making is meaningfully better

This is a very good point, and I think I should come up with a quick POC to demonstrate and get buy-in.

> Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?

I feel that it would just be solving the mess by adding more mess.

1 comments

mremes 2620 days ago

I disagree with the author of the parent comment in regards of using SQL and using Spark instead. I actually first wrote my "SQL advocation" as a reply to this comment but decided to leave leave this view for what it is and write my own "rant" against complicating "big" data transformations with Spark or EMR (Hadoop Pig) or vendor-locked Spark-instrumentations like AWS Glue.

But I agreed with the parent comment's author about pretty much anything until the third bullet point of the second list. I'd like to get more reasoning behind his SQL hate.

link