| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lgsilver 2616 days ago

Really frustrating. I went through this process recently. The data was a couple orders of magnitude bigger and so I tend to agree that maybe just straight to Redshift / Bigquery would probably work best, but here were our steps:

1.) Insure that ingestion / S3 jobs were stabilized (in our case, the legacy were in Informatica, and maintenance took up all the teams' time). We moved to Luigi for this, but Airflow is great too.

2.) Get Presto schemas defined and make Presto the interface for querying / basic pipelining.

3.) Add Mode Analytics or another basic query UI on top for ad-hoc queries. This cleared a massive bottle-neck for our teams because Analysts and data scientists now have direct access to data w/o technical help.

4.) Build "gold" records, for specific sets/types that are valuable, and get them piped from S3 into Redshift/Bigquery (we built a streaming layer for this). This speeds up querying, makes governance easy, and is extremely reliable.

Honestly, the hardest part here was the change management among our legacy teams.. That said, it's incredible how widely this has been embraced now we have it up and working.

1 comments

kwillets 2613 days ago

Mode missed the boat by not including ETL capabilities. I worked for a time on the tool that Mode was based on, and its ETL capability was the hidden hand that got Data Scientists to build and maintain the data pipeline.

link