|
|
|
|
|
by 65
847 days ago
|
|
I built an ETL pipeline for a government client using just AWS, Node, and Snowflake. All Typescript. To cache the data I store responses in S3. If there's a cache available, use the S3 data, if not get the new data. We can also clean the old cache occasionally with a cron job. Then do transforms and put it in Snowflake. Sometimes we need to do transforms before caching the data in S3 (e.g. adding a unique ID to CSV rows), or doing things like splitting giant CSV files into smaller files that can then be inserted into Snowflake (Snowflake has a 50mb payload limit). We have alerts, logging, and metadata set up as well in AWS and Snowflake. Most of this comes down to your knowledge of cloud data platforms. It's honestly not that difficult to build ETL pipelines from scratch. We're using a ton of different sources with different data formats as well. Using the Serverless framework to set up all the Lambda functions and cron jobs also makes things a lot easier. |
|
i have figured out these questions by seeing how more experienced devs do it and on my own, but i want to learn from a book or video series because you can only figure out so much yourself, eventually you need to seek out experts and sometimes the experts around you also figured it out themselves and you need to find an expert outside of your circle. unfortunately a lot of the "ETL experts" teaching stuff online are trying to sell me on prefect or airflow or snowflake etc