| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shadowmint 2620 days ago

It sounds like you already have an idea of what you want to do, but I think you should pause and think more deeply about what you have, vs. what you want.

What I would want in your situation is:

    - All the data in one place.
    - An easy way to explore the data. 
    - A single source of truth for transformed data.
    - Metadata to explain the data model (ie. documentation).

What you're proposing does some of those things, but it also:

    - Adds yet another maintain-forever technology to your stack.
    - Adds yet another pipeline (or set of pipelines) that does the same thing.
    - Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically (postgres). 
    - Potentially introduces *yet more* sources of truth for some data.

> I was thinking that in a first iteration, data scientists would explore their denormalized, aggregated data and create their own feature with code.

^ Moving data into postgres doesn't make this somehow trivial, it just enables people to use a different SQL dialect. The spark API is, for anyone competent to be writing code, not meaningfully less complicated than using the postgres API.

I appreciate the naive attractiveness of having a traditional "data warehouse" in a SQL database, but there is actually a reason why people are moving away from that model:

    - it doesn't scale
    - SQL is terrible language to write transformations in (its a *query* language, not an ETL pipeline)
    - it's only vaguely better when you have many denormalised tables, vs. s3 parquet blobs
    - you have to invent data for schema changes (ie. new table schema, old data in the table) (ie. migrations are hard)

More tangibly, I know people who have done exactly what you're talking about, and regretted it. Unless you can very clearly demonstrate that what you're making is meaningfully better, it won't be adopted by the other team members and you'll have to either live forever in your silo, or eventually abandon it and go back to the old system. :/

So... I don't recommend it.

The points you're making are all valid, and for a small scale like this, if you were doing it from scratch it would be a pretty compelling option... but migrating entirely will be prohibitively expensive, and migrating partially will be a disaster.

Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?

Personally I think that databricks offers a very attractive way to allow data exploration without a significant architecture change.

The architecture you're using isn't fundamentally bad, it just needs strong across the board data management... but that's something very difficult to drive from the bottom up.

1 comments

remilouf 2620 days ago

You changed my perspective a little bit by asking the right questions.

> Moves from an architecture that is clustered for scale (ie. spark) to one that only scales vertically

I did a quick estimate of the volume, and we won't reach 1Tb before > 5 years. We're not in a line of business where the number of clients can increase dramatically so it's fairly predictable. I don't want to design for imaginary scaling issues.

> Potentially introduces yet more sources of truth for some data.

It is more intended to replace the current mess.

> SQL is terrible language to write transformations in (its a query language, not an ETL pipeline)

Actually this is the point that concerns me the most. The need to transform the data in non-trivial ways. But surely people didn't wait for Spark to do this?

> Unless you can very clearly demonstrate that what you're making is meaningfully better

This is a very good point, and I think I should come up with a quick POC to demonstrate and get buy-in.

> Could you perhaps find better way to orchestrate your spark tasks, eg. with airflow or ADF or AWS Glue or whatever?

I feel that it would just be solving the mess by adding more mess.

mremes 2619 days ago

I disagree with the author of the parent comment in regards of using SQL and using Spark instead. I actually first wrote my "SQL advocation" as a reply to this comment but decided to leave leave this view for what it is and write my own "rant" against complicating "big" data transformations with Spark or EMR (Hadoop Pig) or vendor-locked Spark-instrumentations like AWS Glue.

But I agreed with the parent comment's author about pretty much anything until the third bullet point of the second list. I'd like to get more reasoning behind his SQL hate.