| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ktamura 3668 days ago
	Any good alternatives for Sqoop? I feel that an ETL tool just for HDFS is too limiting and leads to further fragmentation on the data pipeline.

5 comments

technofiend 3668 days ago

I feel like I've recommended it enough times I'm turning into a shill but Pentaho is an open-source and commercially supported ETL tool that will natively do what you want, or call sqoop when you discover that's kinda slow. :-) And no I definitely don't work there.

link

dn5 3668 days ago

Have a look at Kafka Connect (http://docs.confluent.io/2.0.0/connect). The JDBC Connector will poll for database changes changes and push them to a Kafka topic. Means you should see all the changes in the database rather than a snapshot say once a day

link

rathboma 3668 days ago

Using Sqoop from something like Luigi as the ETL manager is a pretty great workflow - https://github.com/spotify/luigi

You can define dependencies between jobs based on output file which allows you to re-run only part of your pipeline

link

machbio 3668 days ago

Thats a great idea - but could you elaborate on the scheduling of jobs on Luigi - it does not have a scheduler like AirFlow - how do you schedule Luigi tasks ?

link

rathboma 3668 days ago

Check out this Foursquare talk that goes through how we used to do scheduling -- basically you make jobs dependent on a date - http://www.slideshare.net/OpenAnayticsMeetup/luigi-presentat...

link

allengeorge 3668 days ago

You have to use an external scheduler. We built one on top of AP Scheduler: https://apscheduler.readthedocs.io/en/latest/

link

natekupp 3668 days ago

+1 to this, we kick off our Sqoop jobs using Airflow - http://airbnb.io/projects/airflow/

Airflow is very similar to Luigi; we've been using in in production to schedule all of our workflows for ~4 months now and it's worked out really well for us.

link

Joeri 3668 days ago

We've been trying out goldengate to get streaming replication, but it has proven rather unreliable. Stops replicating if you sneeze in its general vicinity. I wonder whether the alternatives like shareplex and tungsten are more reliable.

link

capkutay 3668 days ago

You can try out Striim for streaming data integration (full disclosure, I work there):

http://www.striim.com/download-striim/

link

falaki 3668 days ago

Ever since using Apache Spark's Data Sources API was released, I have been relying no different Spark Data Source packages for my ETL jobs.

link