Hacker News new | ask | show | jobs
by owenthejumper 492 days ago
This hits home. I am helping someone analyze medical research data. When I helped before a few years ago we spent a few weeks trying to clean the data, figure out how to run the basic analysis (linear regression, etc), only to arrive at "some" results that were never repeatable because we learned as we built.

I am doing it again now. I used Claude to import the data from CSV into a database, then asked it to help me normalize it, which output a txt file with a lot of interesting facts about the data. Next step I asked to write a "fix data" script that will fix all the issues I told it about.

Finally, I said "give me univariate analysis, output the results into CSV / PNG and then write a separate script to display everything in a jupyter notebook".

Weeks of work into about 2 hours...

8 comments

we've built a business[0] around this workflow, but in cases where the source data isn't as simple as a CSV. Think Stripe, Hubspot, Salesforce, etc. where you'd normally need to write a ton of API calls or buy something like Fivetran. The flow for Definite is:

1. Add your sources (Postgres, S3, CRM, Quickbooks, Google Sheets, etc.)

2. We deploy standard, pre-baked data models (e.g. how do you calculate ARR using Stripe data)

3. AI answers questions using the standard models and starts updating the model with SQL for anything that's not already answered.

We spin up a datalake to store all the data (similar to this one[1]) for our customers, so it's very cost effective.

0 - https://www.definite.app/

1 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws

>Weeks of work into about 2 hours...

Only if the output from Claude is correct. If not...

This. I get why people have started using LLMs for this and I think it's great in theory, but the black box nature and possibility of hallucination makes it a non starter for me. Having the LLM generate scripts which you can then validate for correctness seems more plausible.

I also worry that this approach will lead to a sort of further reification of data science. While things have already trended this way, data science is not about applying a few routine formulas to a data set. Done properly, it is far more exploratory and all about building an understanding of the unique properties and significance of a particular data set. I worry the use of these tools will greatly reduce the exploratory phase and lead to analyses that simply confirm biases or typical conclusions rather than yielding new insight.

The output is not black box. I always see myself as responsible for the output. The models give hints.
Definitely the right way to approach this. You already need to know what you're doing (for validation and error checking), but if you do it can be faster. As long as P != NP the validation is faster than coming up with the solution. My only concern is how far away from a "good" solution is the quick LLM + check vs expert solution. It may be worth using human expertise in 2 weeks than validated LLM solution in 2 hours. (And i'd question good validation of traditionally 2 week work in 2 hours.)

There's going to be a lot of moving fast and breaking things coming. Hopefully less breaking than moving.

> Only if the output from Claude is correct. If not...

Had a task at work to clear unused metrics.

Exported a whole dashboard, thought about regexes to extract metrics out of xml (bad, I know) asked chat gpt to produce the one-liners to produce the data.

Got 22 used metrics.

Next day I just gave chat gpt the whole file and asked it to spit all the used metrics.

46 used metics.

Asked Claude, Deepseek and Gemini the same question. Only Gemini messed it up by missing some, duplicating some.

Re-checked the one-liners chat-gpt produced. Turns out it/I messes up when I told it to generate a list of unique metrics from a file containing just the metric names one per line. What I wanted was a script/one-liner that would print all the metric names just once (de-duplicate) and chat-gpt ad-literam produced a script that only prints metrics that show up exactly once in the whole file.

In the end, just asking LLMs to simply extract the names from the grafana dashboard worked better, parsing out expressions, only producing unique metrics names and all that, but there was no way to know for sure, just that given that 3/4 of the LLMs produced the same output meant it was most likely corect.

I fixed the programatic approach and got thr same result, but it was a very wiered feeling asking the LLMs to just give me the result of what for me was a whole process of many steps.

Are you sure you didn't also have a bunch of typos in your prompts? ;)
Unlike humans, LLMs seem to deal surprisingly well with typos.

Freed from the "the other human must not be up to my exquisite eloquency " and given that it's a machine that I'm talking to (20 years of "the compiler is never wrong") -- I've learned more about my communication inadequacies through talking with LLMs in the past 2 years than 40 years of talking to humans.

But I am not giving Claude a csv and saying 'clean it up'. I am asking it to write me a python script to clean it up. That way I can validate the script myself.
Think about it logically: Are you really sure you can validate the script yourself? If it takes you weeks to do what Claude does in some hours, it seems misplaced confidence in your capabilities.
There is, in fact, a large body of work studying classes of problems which are hard to solve but easy to verify. So I'm not sure why this kind of usage is a surprise to so many people.
I'm not sure that source code verification is such a problem. It feels like it's definitely easier to write code to solve a problem than to verify some code written by someone else is correct and fault free.
All processes and by extension code tolerate some level of error, even our most reliable systems. Whether LLM produced output is within that tolerance is up to each practitioner to test and verify.

I think AI has revealed that there is a lot of low hanging fruit that is very tolerant of errors across many disciplines that isn’t met by our current supply of software engineers. In my own day to day that’s a lot of low impact bash scripts that automate personal things while at work it’s sales and lead gen where it’s not a big deal if a salesperson cold calls someone who couldn’t use our product (other than the temporary embarrassment it causes both parties).

It's a lot easier to check the code / check the output of the code / spot verify than it is to do the work itself... if I'd write my own code, I'd still have to verify (bc I trust my own coding ability even less than Claude lol)
Are you aware of this tool? https://openrefine.org
I’ve come to this same conclusion. Been able to code up something that would’ve taken me a week to do back in the day with Claude in 2 hours. I’ve given canvas csvs and seen it run analysis on them in minutes that would’ve take me day to do when I used to run R scripts and throw them into slides. This probably just the beginning too…
What happens when that 'weeks of work' is just shifted into the future, as you find out the LLM made things up and you have to figure out what went wrong?
Humans make mistakes too.

I find this "LLMs can be wrong" argument a bit tiresome, and also a bit lazy.

I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.

> Humans make mistakes too.

Well, yes, but fortunately, we build computers to automate things using simple algorithms to remove the risk of such mistakes.

Except when we use LLMs, in which case we increase the risk of mistakes.

> I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.

Well, Wikipedia is a great tool, but it is permanently weaponized.

C/Assembler vs. garbage-collected languages was about decreasing the risk (at the cost of increasing the resource requirement), so, unless I misunderstand what you write, it kinda feels like you're arguing against your side?

Funny you mention Wikipedia, since in most professional settings (particularly research roles) you can't just cite Wikipedia. Maybe in highschool that's okay, but when there are actual stakes on the table, putting some effort into your research beyond reading the Wikipedia article is probably necessary.
For my ETL pipelines I have not had this issue.
“I am doing it again now” is the operative phrase here I think. I’ve found LLMs are quite good helping me build things much better and faster in this case. Maybe not so much for stuff I haven’t done before and don’t really quite know what I’m trying to accomplish or what a good solution looks like.
Can I ask you to beta test my product? I'm building something like this and I want to focus on medical data (from omics to RCTs)
I really don't mean this in a rude way, but if it took you a few weeks to do that on your own, you are really bad at googling for tutorials and walkthroughs. You could have watched a one hour bootcamp video and learned how to do it yourself.

What you are saying Claude helped you do is like 15 lines of python. A few weeks? 120 hours of effort?

the task above is not 15 lines of python with a real world dataset.

the tutorials you reference? yes, 15 lines of python when you're starting with the titanic.csv. But a real world dataset normally takes hours or days of cleaning before it's ready to run any statistical analysis on.

Data cleaning is hard. That is not what OP said they had Claude do. They just said Claude normalized it. Normalizing data does not take days unless you are learning to do both statistics and programming for the first time ever