|
|
|
|
|
by benrutter
492 days ago
|
|
I'm definitely biased because my day job is writing ETL pipelines and supporting software, and my current side project is a data contracts library for helping the above[0]. Still I'm not sure I see this happening. 80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc). I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables. For areas that are reliability focused, LLMs still need a lot more improvments to be useful. [0] https://github.com/benrutter/wimsey |
|
Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.
My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.