Hacker News new | ask | show | jobs
by timr 493 days ago
> I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.

Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.

My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.

3 comments

Yes, LLMs are not always the best option, they are an option. Sometimes requirements of the project are such that they are also the best option.

There is one browser that uses price matching example that is impossible to do without a full-blown data science team right now: https://github.com/Pravko-Solutions/FlashLearn/tree/main/exa...

Inappropriate tools are always an option? I can cut a cake with a jackhammer, but....

Anyway, like I said, there are certainly good applications of LLMs, and this is probably one? I wouldn't describe "do market research on prices" as a traditional "data pipeline", but that's just me, I guess.

I think you'd tell the LLM to design the pipeline, not be the pipeline. That way you can see exactly what it's done and tweak as needed. Plus should be way more cost effective.
Hah. I remember being forced to use MapReduce for a tiny dataset, back in the early 2010's. Hadoop was all the rage.
"lemme just fire up a dbt workflow to analyse this CSV file"
You may have meant that sarcastically, but i just did that for 2 csv files that i needed to do a bunch of cleanups and joins to analyze. With llm help the whole adventure was easy.
What I really like to do for this is loading it into SQLite, there are built in macros for reading/writing CSV files. And they're queryable with SQL which makes for a great jumping point to do some basic cleaning, joining and analysis.

This also I'd argue makes the job easier with LLMs since you can ask it to write a SQL query which you can validate / reason about rather than relying on it for transforming the data itself (which I've seen a lot under this post)

Honestly I spend ten times as much effort figuring out people's sloppy notebooks or pandas stuff than when they just use DBT and SQL. And 90% of the time SQL is all they needed.