|
|
|
|
|
by markus_zhang
1156 days ago
|
|
Regarding coding practices: I work as a data engineer and I always wanted to enforce some rules for the team: 1. DAGs must have a section that describes what it does and have link to jira ticket and documentation if applicable. Stakeholders should be named too. 2. DAGs must have a created by and modified by section with a summary of what they did. Yes we can track changes in GitHub but if we do a lift and shift upgrade of Composer instances all trackable changes are gone. 3. Comment the sql queries wherever hacky. There is no way I know why the original author put up some conversions in the code. 4. Every etl should check source data before and check loaded data after. Put up some test cases there. For critical processes, failure should always trigger emails and slack message sent to at least two places: DE team and related stakeholder team. God there are SO MANY cases we can get rid the trouble to pick and clean the data because we did not reject obviously wrong source data. 5. Think about potential errors. If you have to SUM a column of large integers, watch for overflows. If your column is nullable, watch for unexpected nulls. Such and such. |
|