Hacker News new | ask | show | jobs
by 65 847 days ago
I built an ETL pipeline for a government client using just AWS, Node, and Snowflake. All Typescript. To cache the data I store responses in S3. If there's a cache available, use the S3 data, if not get the new data. We can also clean the old cache occasionally with a cron job. Then do transforms and put it in Snowflake. Sometimes we need to do transforms before caching the data in S3 (e.g. adding a unique ID to CSV rows), or doing things like splitting giant CSV files into smaller files that can then be inserted into Snowflake (Snowflake has a 50mb payload limit). We have alerts, logging, and metadata set up as well in AWS and Snowflake. Most of this comes down to your knowledge of cloud data platforms.

It's honestly not that difficult to build ETL pipelines from scratch. We're using a ton of different sources with different data formats as well. Using the Serverless framework to set up all the Lambda functions and cron jobs also makes things a lot easier.

1 comments

i appreciate you sharing all that, but it seems like we might be on similar levels of knowledge/experience. i've been a dev who does a lot data engineering for 5 years. i'm looking more for best practices and theory about designing the pipeline, how to arrange the order of operations, how to separate each step, logging practices, how to make it reproducible, how to restart when it fails halfway in without going back to the beginning, how many retries, what to do if a step gets stuck in failed state, how to flag that bad data, etc. so. many. questions. while i build these pipelines.

i have figured out these questions by seeing how more experienced devs do it and on my own, but i want to learn from a book or video series because you can only figure out so much yourself, eventually you need to seek out experts and sometimes the experts around you also figured it out themselves and you need to find an expert outside of your circle. unfortunately a lot of the "ETL experts" teaching stuff online are trying to sell me on prefect or airflow or snowflake etc