Hacker News new | ask | show | jobs
by benrutter 492 days ago
I'm definitely biased because my day job is writing ETL pipelines and supporting software, and my current side project is a data contracts library for helping the above[0]. Still I'm not sure I see this happening.

80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc).

I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

For areas that are reliability focused, LLMs still need a lot more improvments to be useful.

[0] https://github.com/benrutter/wimsey

3 comments

> I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.

My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.

Yes, LLMs are not always the best option, they are an option. Sometimes requirements of the project are such that they are also the best option.

There is one browser that uses price matching example that is impossible to do without a full-blown data science team right now: https://github.com/Pravko-Solutions/FlashLearn/tree/main/exa...

Inappropriate tools are always an option? I can cut a cake with a jackhammer, but....

Anyway, like I said, there are certainly good applications of LLMs, and this is probably one? I wouldn't describe "do market research on prices" as a traditional "data pipeline", but that's just me, I guess.

I think you'd tell the LLM to design the pipeline, not be the pipeline. That way you can see exactly what it's done and tweak as needed. Plus should be way more cost effective.
Hah. I remember being forced to use MapReduce for a tiny dataset, back in the early 2010's. Hadoop was all the rage.
"lemme just fire up a dbt workflow to analyse this CSV file"
You may have meant that sarcastically, but i just did that for 2 csv files that i needed to do a bunch of cleanups and joins to analyze. With llm help the whole adventure was easy.
What I really like to do for this is loading it into SQLite, there are built in macros for reading/writing CSV files. And they're queryable with SQL which makes for a great jumping point to do some basic cleaning, joining and analysis.

This also I'd argue makes the job easier with LLMs since you can ask it to write a SQL query which you can validate / reason about rather than relying on it for transforming the data itself (which I've seen a lot under this post)

Honestly I spend ten times as much effort figuring out people's sloppy notebooks or pandas stuff than when they just use DBT and SQL. And 90% of the time SQL is all they needed.
For your wimsey library, using “pipe” to validate the contracts would seem to me to drastically slow down the Polars query because the UDF pushes the query out of Rust into Python. I think a cool direction would be to have a “compiler” which takes in a contract and spits out native queries for a variety of dataframe libraries (pandas/polars/pyspark). It becomes harder to define how to error with a test contract but that can be the secret sauce.
Actually you're almost 100% describing how Wimsey works! It's using native df code rather than a UDF of some kind. Under the hood it uses Narwhal's which converts polars style expressions into native pandas/polars/spark/dask code with super minimal overheads.

If you're using a lazy dataframe (via polars, spark etc) Wimsey will force collection, so that can have speed implications. Reason being that I can't find a cross-language way yet of embedding assertions for fail later down the line.

I belive that LLMs will become better and better in the near future and pipelines will replace classic approaches with LLM-enriched pipelines will drastically simplify the ETL flows.
Not that I don't love LLMs and play with them and their potential but if we don't get proper mechanism that ensure quality and consistency then it's not really a substitute for what we have.

It's very easy to produce something that seemingly works but you can't attest to its quality. The problem is producing something resilient, that is easy to adapt and describes the domain of what you want to do.

If all these things are so great, them why do I still need to do so many things to integrate a bigtech cloud agent with popular tool? Why is it so costly or limited?

UX matters, validation matters, reliability matters, cost matters.

You can't simply wish for a problem not to happen. Someone owns the troubleshooting and the modification and they need to understand the system they're trying to modify.

Replacing scrapers with LLM is an easy and obvious thing, specially when you don't care about quality to a high degree. Other systems such as financial ones don't have that luxury.

You may be right! I guess we'll find out soon.

One thing I'd be wary of is what "LLM-enriched pipelines" look like. If it's "write a sentence and get a pipeline" then I think that does massively simplify the ammount of work, but there's another reality where people use LLMs to get more features out of existing data, rather than doing the same transformations we do now. Under that one, ETL pipelines would end up taking more time, and being more complex.

But at what cost?

We're in an energy/environmental crisis, and we're replacing simple pipelines with (unreliable) gas factories?

Cost per token has cratered a thousand percent over the last two years, and that's not just lighting VC on fire, efficiency gains are made left and right.
How much do we need to progress before it becomes comparable in terms of energy to the (often already rather energy-inefficient) data pipelines we've been using so far?

Recall that while the cost per token may decrease, CoT multiplies the number of tokens by several orders of magnitude.

LLMs are not the most efficient way to solve the problem, but they can solve it.
They can do it, they're just slower, less reliable and orders of magnitude more energy-expensive.

But yes, they're potentially easier to setup.