Hacker News new | ask | show | jobs
by plaidfuji 492 days ago
This is where things are headed. All that ridiculous busywork that goes into ETL and modeling pipelines… it’s going to turn into “here’s a pile of data that’s useful for answering a question, here’s a prompt that describes how to structure it and what question I want answered, and here’s my oauth token to get it done.” So much data cleaning and prep code will be scrapped over the next few years…
8 comments

I'm definitely biased because my day job is writing ETL pipelines and supporting software, and my current side project is a data contracts library for helping the above[0]. Still I'm not sure I see this happening.

80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc).

I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

For areas that are reliability focused, LLMs still need a lot more improvments to be useful.

[0] https://github.com/benrutter/wimsey

> I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.

My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.

Yes, LLMs are not always the best option, they are an option. Sometimes requirements of the project are such that they are also the best option.

There is one browser that uses price matching example that is impossible to do without a full-blown data science team right now: https://github.com/Pravko-Solutions/FlashLearn/tree/main/exa...

Inappropriate tools are always an option? I can cut a cake with a jackhammer, but....

Anyway, like I said, there are certainly good applications of LLMs, and this is probably one? I wouldn't describe "do market research on prices" as a traditional "data pipeline", but that's just me, I guess.

I think you'd tell the LLM to design the pipeline, not be the pipeline. That way you can see exactly what it's done and tweak as needed. Plus should be way more cost effective.
Hah. I remember being forced to use MapReduce for a tiny dataset, back in the early 2010's. Hadoop was all the rage.
"lemme just fire up a dbt workflow to analyse this CSV file"
You may have meant that sarcastically, but i just did that for 2 csv files that i needed to do a bunch of cleanups and joins to analyze. With llm help the whole adventure was easy.
What I really like to do for this is loading it into SQLite, there are built in macros for reading/writing CSV files. And they're queryable with SQL which makes for a great jumping point to do some basic cleaning, joining and analysis.

This also I'd argue makes the job easier with LLMs since you can ask it to write a SQL query which you can validate / reason about rather than relying on it for transforming the data itself (which I've seen a lot under this post)

Honestly I spend ten times as much effort figuring out people's sloppy notebooks or pandas stuff than when they just use DBT and SQL. And 90% of the time SQL is all they needed.
For your wimsey library, using “pipe” to validate the contracts would seem to me to drastically slow down the Polars query because the UDF pushes the query out of Rust into Python. I think a cool direction would be to have a “compiler” which takes in a contract and spits out native queries for a variety of dataframe libraries (pandas/polars/pyspark). It becomes harder to define how to error with a test contract but that can be the secret sauce.
Actually you're almost 100% describing how Wimsey works! It's using native df code rather than a UDF of some kind. Under the hood it uses Narwhal's which converts polars style expressions into native pandas/polars/spark/dask code with super minimal overheads.

If you're using a lazy dataframe (via polars, spark etc) Wimsey will force collection, so that can have speed implications. Reason being that I can't find a cross-language way yet of embedding assertions for fail later down the line.

I belive that LLMs will become better and better in the near future and pipelines will replace classic approaches with LLM-enriched pipelines will drastically simplify the ETL flows.
Not that I don't love LLMs and play with them and their potential but if we don't get proper mechanism that ensure quality and consistency then it's not really a substitute for what we have.

It's very easy to produce something that seemingly works but you can't attest to its quality. The problem is producing something resilient, that is easy to adapt and describes the domain of what you want to do.

If all these things are so great, them why do I still need to do so many things to integrate a bigtech cloud agent with popular tool? Why is it so costly or limited?

UX matters, validation matters, reliability matters, cost matters.

You can't simply wish for a problem not to happen. Someone owns the troubleshooting and the modification and they need to understand the system they're trying to modify.

Replacing scrapers with LLM is an easy and obvious thing, specially when you don't care about quality to a high degree. Other systems such as financial ones don't have that luxury.

You may be right! I guess we'll find out soon.

One thing I'd be wary of is what "LLM-enriched pipelines" look like. If it's "write a sentence and get a pipeline" then I think that does massively simplify the ammount of work, but there's another reality where people use LLMs to get more features out of existing data, rather than doing the same transformations we do now. Under that one, ETL pipelines would end up taking more time, and being more complex.

But at what cost?

We're in an energy/environmental crisis, and we're replacing simple pipelines with (unreliable) gas factories?

Cost per token has cratered a thousand percent over the last two years, and that's not just lighting VC on fire, efficiency gains are made left and right.
How much do we need to progress before it becomes comparable in terms of energy to the (often already rather energy-inefficient) data pipelines we've been using so far?

Recall that while the cost per token may decrease, CoT multiplies the number of tokens by several orders of magnitude.

LLMs are not the most efficient way to solve the problem, but they can solve it.
They can do it, they're just slower, less reliable and orders of magnitude more energy-expensive.

But yes, they're potentially easier to setup.

This is a head-scratcher of a take. Have you actually done any in-depth work on data pipelines and analytics tooling? If so, what precisely do you see LLMs making easier?

I tried using enterprise chat gpt to write a query to load some json data into a data warehouse. I was impressed with how good a job it did, but it still required several rounds of refinement and hand-holding and the end result was almost, but not quite, correct. So I'm not coming at this from the perspective of hating LLMs a priori, but I am unimpressed with the hype and over-selling of its capabilities. In the end, it was no faster than writing the query myself, but it wasn't slower either, so I can see it being somewhat helpful in limited conditions.

Unless the technology makes another quantum leap improvement at the same time the price drops like a stone, I don't see LLMs coming anywhere close to your claim.

That said, I expect to see a huge amount of snake oil and enterprise dollars wastefully burned on executive pipe dreams of "here's a pile of data now magic me a better business!" in the next few years of LLM over-hyped nonsense. There's always a quick buck to make in duping clueless execs drooling over replacing pesky, annoying, "over-paid" tech people.

Let me give you a complementary perspective. Same problems all of you have but I work in a small lab team of PhD biologist who generate huge omics data set and even larger lightsheet microscopy and MRI datasets but don’t know how to do a VLOOKUP in Excel. And who do not know the exotic acronyms: LIMS, QA, QC, or SQL. Yes, really.

What do we typically do in academic biomedical research in this situation?

The lead PI looks around the lab and finds a grad student or postdoc who knows how to turn on a computer and if very lucky also has had 6 months of experience noodling around with R or Python. This grad or postdoc is then charged with running some statistical analyses without any training whatsoever in data science. What is an outlier anyway, what do you mean by “normalize”, what is metadata exactly?

You get my drift: It is newbies in data science and programming (often 40-and 50-year-olds) leading novices (20- and 30-year-olds) to the slaughter. Might contribute to some lack of replicability ;-)

And it has been this way in the majority of academic labs since I started using CPM on an Apple 2 in 1980 at UC Davis in an electrophysiology lab in Psychology, to the first Macs I set up at Yale in a developmental neurobiology lab in 1984, and up to the point at which I set up my own lab in neurogenetics at the University of Tennessee with a pair of Mac IIs in 1989 and $150,000 in set-up funds, just enough for me to hire one very inexperience technician to help me do everything.

So in this context I hope all of you can appreciate that ANY help in bringing some real data science into mom-and-pop laboratories would be a huge huge boon.

And please god, let it be FOSS.

I feel you, and LLMs are no doubt a boon in tooling to help in this kind of scenario. I'm not poo-pooing LLMs in general; they are very cool! I wish they were allowed to just be very cool while we incorporate them into our tooling and workflows, rather than over-hyped.
You have more faith in LLMs than I do. The reality is it will probably get you 70 to 80% there, then you'll spend a ton of time debugging / fixing your pipelines, only to realize it would've been simpler, faster, and more reliable to not involve an LLM in the first place.
I believe that we'll learn how to incorporate LLMs to improve parts of data pipelines, particularly those that involve extracting unstructured or semistructured data into structured data, especially if it can provide a reliability score or confidence level with the extract. I'm much more skeptical of claims beyond that.

I also think there are unanswered questions about reliability, cost (dollar and energy), and AI business models; I don't think OpenAI can burn $2+ to make a dollar forever.

Unless you can provide some "citation", I don't think you are right. I do this every day now and it gets me 99 % there with very little debugging.
As always, "it depends." How simple are your pipelines? Single CSV? Sensible column names that are totally unambiguous? Consistent, clean data? Then LLMs are probably fine...
This is completely wrong, if anything an increase in the usage of LLMs to generate small pipelines will lead to increased demand for professional pipelines to be built. Because if any small thing breaks the dashboards/features break which is immediately noticeable. I think you'll see a big increase in the number of models a data scientist can create, but making those python notebooks production ready can't be done by an LLM. That's to say as analysts create more potential use cases, there will be more demand to get those implemented.

There's so much that goes into ensuring the reliability, scalability and monitoring of production ready data pipelines. Not to mention the integration work for each use case. An LLM will give you short term wins at the cost of long term reliability - which is exactly why we already have DE teams to support DA and DS roles.

>This is completely wrong, if anything an increase in the usage of LLMs to generate small pipelines will lead to increased demand for professional pipelines to be built. Because if any small thing breaks the dashboards/features break which is immediately noticeable. I think you'll see a big increase in the number of models a data scientist can create, but making those python notebooks production ready can't be done by an LLM. That's to say as analysts create more potential use cases, there will be more demand to get those implemented.

I agree. There is a lot of data people want that isn't made because of labor costs. Not just in quantity, but difficulty. If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.

> If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.

That applies to so many other jobs.

My productivity as a single IT developer, making a rather large and complex system mostly skyrocketed when LLM's became actually useful (around GPT4 era).

Work where i may have spend hours dealing with a bug, being maybe 10 minutes because my brain was looking over some obvious issue that a LLM instantly spotted (or gave suggestions that focused me upon the issue).

Implementing features that may have taken days, reduces to a few hours.

Time taken to learn things massive reduces because you can ask for specific examples. Where a lot of open source project are poorly documented or missing examples or just badly structured. Just ask the LLM and it puts you in the right direction.

Now, ... this is all from the perspective of a 25+ year experienced dev. The issue i fear for more, is people who are starting out, writing code but not understanding why or how things work. I remember people before LLM's coming in for Senior jobs, that did not even have basic SQL understanding, because they non-stop used ORM's. But they forgot that some (or a lot) of this knowledge was not transferable to different companies that used SQL or other ORM's that may work different.

I suspect that we are going to see a generation of employees that are so used to LLMs doing the work but not understanding how or why specific functions or data structures are needed. And then get stuck in hours of LLM loop questioning because they can not point the LLM to the actual issue!

At time i think, i wish this was available 20 years ago. But then question that statement very fast. Was i going to be the same dev today, if i relied non-stop on LLMs and not gritted by teeth on issues to develop this specific skillset?

I see more productivity from Senior devs etc, more code turnout from juniors (or code monkies), but a gap where the skills are a issue. And lets not forget the potential issue of LLM poisoning with years of data that feeds back on itself.

I see it as a gray area - long term there will be a need for both and you will have just one tool to choose from when presented with time-budget-quality constraints.
Yeah I can also see it very much depending on the demands - I'm definitely not saying every pipeline has to be the most reliable, scalable piece of software ever written.

If a small script works for you and your use case / constraints there's nothing I can say against it, but when you do grow past a certain point you'll need pipelines built in a proper way. This is where I see the increased demand since the scrappy pipelines are already proving their value.

Exactly, scale after you need to.
This would require massively more compute than regular pipelines...
(1) that delta will decrease quickly, and (2) corporations will gladly pay for compute over headcount to maintain fragile data pipelines
> (1) that delta will decrease quickly

Is your data pipeline o(n^3) in the number of tokens? If not, then no, it won't.

The price will go down, but LLMs reaching 100% accuracy and reliability is another story. We are nowhere close right now.
If your problem is compute, you are already optimizing. This is here for all the steps before you start thinking latency-compute. Not all use cases are made equal.
no, not so simple.. the simplicity of this idea is like a gravitational pull for your human mental model mind. Meanwhile, LLMs are like a non-reproducible cotton-candy machine. Quality will be an elusive light at the end of the tunnel, not a result, for non-trivial systems IMHO. Simple systems? sure, but economics will assign low-skill humans to the task, and other problems emerge.

What is the intoxication that assumes the engineering disciplines are now suddenly auto-automatable ?

not data pipelines, not yet at least since usually those require high degree of accuracy (depending on the company, of course). Where I see it (already) move in is data exploration, which effectively are data pipelines before data pipelines are being developed.
Good point! LLMs are best when you are starting from point 0.
Exactly!