Hacker News new | ask | show | jobs
by rathboma 3668 days ago
Using Sqoop from something like Luigi as the ETL manager is a pretty great workflow - https://github.com/spotify/luigi

You can define dependencies between jobs based on output file which allows you to re-run only part of your pipeline

2 comments

Thats a great idea - but could you elaborate on the scheduling of jobs on Luigi - it does not have a scheduler like AirFlow - how do you schedule Luigi tasks ?
Check out this Foursquare talk that goes through how we used to do scheduling -- basically you make jobs dependent on a date - http://www.slideshare.net/OpenAnayticsMeetup/luigi-presentat...
You have to use an external scheduler. We built one on top of AP Scheduler: https://apscheduler.readthedocs.io/en/latest/
+1 to this, we kick off our Sqoop jobs using Airflow - http://airbnb.io/projects/airflow/

Airflow is very similar to Luigi; we've been using in in production to schedule all of our workflows for ~4 months now and it's worked out really well for us.