| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by llm_trw 497 days ago

>Digging into use cases you’d fine that for a particular question you needed to not just get all the rows from a column, you needed to do some obscure JOIN ON operation. This fact was only known by 2 data scientists in charge of writing the report.

>I still work on AI powered products and I don’t see even a little line of sight on this problem. Everyone’s data is immensely messy and likely to remain so.

I've worked in the space as well and completely unstructured data is better than whatever you call a database with a dozen ad hoc tables each storing information somewhat differently to each other for reports written by a dozen different people over a decade.

I have a benchmark for an agentic system which measures how many joins between tables the system can do before it goes off the rails. But there is nothing off the shelf that does it and for whatever reason no one is talking about it in the open. But there are companies working to solve it in the background - since I've worked with three so far.

Without documentation giving some grounding about what the table is doing, you're left with hoping the database is self documenting enough for the agent to figure out what the column names mean and if joining on them makes sense - good luck doing it on id1, id2, idCustomerLocal, id_customer_foreign though.

1 comments

Magmalgebra 497 days ago

Descriptions of tables is insufficient (we had it) - you also need descriptions of the systems writing to the tables.

My favorite example was a report that was only accurate if generated on a Tuesday or Thursday due to when the ETL pipeline ran. A small config change on the opposite side of a code base completely altered the semantics of the data!

link

llm_trw 497 days ago

If you're interested please drop an email. I've only worked deeply with pipelines extracting data from documents and I'd be interested in hearing what the challenges with databases are.

link