Hacker News new | ask | show | jobs
by wokwokwok 2566 days ago
tldr, if you really dig past the marketing (from the FAQ (1)):

> We see Airflow and Luigi as complementary frameworks: Airflow and Luigi are tools that handle deployment, scheduling, monitoring and alerting. Kedro is the worker that should execute a series of tasks, and report to the Airflow and Luigi managers.

> Create the data transformation steps as pure Python functions

Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.

Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?

Pretty hard to see the use case to me.

1. https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h...

3 comments

> Is it a processing solution? Not really, since it suggests you can offload the heavy lifting to an engine, eg. spark. An orchestrator? Apparently not, because that's a complementary product. So... it's like, a configuration management tool?

I actually had the same questions when I was first introduced to Kedro! In my case, I didn't understand the value proposition over something like Apache Beam. After using it, I feel like Kedro provides:

    1. a consistent structure across analytics pipelines. It's easy to start and pick up other Kedro projects after you've
       used it once.
    2. convenient and consistent I/O via the data catalog. The fact that we can configure and swap out data sources at ease
       is a huge plus, and we also rely heavily on data versioning.
    3. easy integration with existing frameworks (PySpark, vanilla Pandas, Dask, Airflow, Luigi, etc.)
Additionally, it aligns well with standards we have internally, like data layering. (edit: Apparently this is also part of the FAQ: https://kedro.readthedocs.io/en/latest/06_resources/01_faq.h... Who knew!)

> Personally, I feel mystified why you would use something like this rather than a more mature product like say, Spark, that natively supports clustering, etc, which is what I would really like to see in the FAQ.

I'd say 80-90% of projects at QuantumBlack use (Py)Spark, so we've built out `SparkDataSet`s, `pandas_to_spark` and `spark_to_pandas` utility decorators, etc. There's a brief integration tutorial here: https://github.com/quantumblacklabs/kedro/tree/develop/kedro...

Full disclosure: I'm a data engineer at QuantumBlack (if it wasn't obvious already!)

Because running Spark to do anything that doesn’t actually require a whole cluster is like using earthmoving equipment to assemble a series of small ikea tables?
If you're doing something that trivial, you don't need anything more complicated than airflow.
We experienced a big hit on our productivity when we were using airflow, as there is significant overhead when running pipelines.

We think this is easier than airflow and needs less setup:

  - You don't need a scheduler, neither a db, nor any initial setup. On the contrary, kedro provides the `kedro new` command which will create a project for you that runs out of the box (optionally with a small pipeline example).
  - You can run your pipelines as simple python applications, making it easy to iterate in IDEs or terminals
  - Tasks are simple python functions, instead of operators
  - Datasets are first level citizens. You don't need to explicitly define dependencies between the tasks: they are resolved according to what each task produces/consumes
We also think that a big differentiating factor is the `DataCatalog`. Being able to define in YAML files where your data is and how it is stored/loaded means that the same code will run in any environment given the appropriate configuration files.

This makes testing & moving from development to production much easier.

(Disclaimer - I am one of the lead developers of kedro)

We hope that you give it a try and give us feedback :)

I personally don't think it's that black and white. Not everyone has the same training in best practices for software engineering, and this tool looks like it places some constraints on the anarchy that can result, without requiring huge amounts of front-loading.
I personally find it simpler then airflow since there is less boiler plate required to construct DAGs and in my opinion there is less of a learning curve.
I think one of the big differences is that during development the pipeline DAG is inferred from the data catalog and not explicitly coded in the same way you need to do in something like Airflow.

The logic being that once you've finished experimenting and iterating it's much easier to move to AirFlow.