|
|
|
Ask HN: How to build data pipelines to continuously ingest customer files?
|
|
6 points
by SpeakerFrThDead
1075 days ago
|
|
I'm working on onboarding a customer who wants to deliver their data as a weekly dump of CSVs. We’re supposed to ingest that data and get it into our system to provide analytics for their team. I was initially thinking of just building a one-off ETL script, but I was warned the files may randomly break spec (new/renamed fields, etc) due to errors in the process that generates them.
Is there a standard way to handle this type of thing? |
|
> but I was warned the files may randomly break spec (new/renamed fields, etc) due to errors in the process that generates them. Is there a standard way to handle this type of thing?
This is a political decision. I think you have two options:
- Set a clearly defined schema, and validate against that. Whenever the input doesn't match the schema, put it in a separate 'error folder' and notify someone to edit the data so it can be retried.
- Accept schema changes. Either by having some sort of auto migrations, or by simply not defining a schema at all (like a document store).
Imho, the last option is not a good one if you want to build analytics on top of it, because you need to have some cleaned up and structured version of the data that you can rely on.