Hacker News new | ask | show | jobs
by itronitron 2811 days ago
Sounds like fun, you should probably look at Jupyter or Spark as a system to manage the data transformations and that allows team members to create and share scripts and workbooks.

Develop the ETL process so that it just pulls data and writes it to your team's ideal form as flat files, then write another process that pushes that data where and how you want it (because that can change in six months), also develop automated processes for measuring/ensuring the quality of data being added to your system.

1 comments

Not familiar with spark beyond being aware of it. We have a good amount of space allocated on a fairly powerful oracle server. Was thinking to store all of our tables there. If there are big advantages over an alternate system, I could get it done but would have to get the IT software team on board with letting me install it, getting a server, etc.

Jupyter have used but was thinking to have the team settle on mainly using SAS code to built the ETL process since that is the language most of them are familiar with using. (even though I personally HATE writing SAS)

if you want to test out spark, then a trial account on databricks.com is probably a good place to start. If the team is used to SAS though I'd stick with that.