| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by deepyaman 2613 days ago

> Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?

I actually had the same questions when I was first introduced to Kedro! In my case, I didn't understand the value proposition over something like Apache Beam. After using it, I feel like Kedro provides:

    1. a consistent structure across analytics pipelines. It's easy to start and pick up other Kedro projects after you've
       used it once.
    2. convenient and consistent I/O via the data catalog. The fact that we can configure and swap out data sources at ease
       is a huge plus, and we also rely heavily on data versioning.
    3. easy integration with existing frameworks (PySpark, vanilla Pandas, Dask, Airflow, Luigi, etc.)

Additionally, it aligns well with standards we have internally, like data layering. (edit: Apparently this is also part of the FAQ: https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h... Who knew!)

> Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.

I'd say 80-90% of projects at QuantumBlack use (Py)Spark, so we've built out `SparkDataSet`s, `pandas_to_spark` and `spark_to_pandas` utility decorators, etc. There's a brief integration tutorial here: https://github.com/quantumblacklabs/kedro/tree/develop/kedro...

Full disclosure: I'm a data engineer at QuantumBlack (if it wasn't obvious already!)