Hacker News new | ask | show | jobs
by rglullis 291 days ago
Wrote this about one month ago here at https://news.ycombinator.com/item?id=44839132

I'm completely out of time or energy for any side project at the moment, but if someone wants to steal my idea: please take an llm model and fine tune so that it can take any question and turn it into a SparQL query for Wikidata. Also, make a web crawler that reads the page and turns into a set of RDF triples or QuickStatements for any new facts that are presented. This would effectively be the "ultimate information organizer" and could potentially turn Wikidata into most people's entry page of the internet.

2 comments

I asked "Which country has the most subway stations?" and got the query

  SELECT ?country (COUNT(*) AS ?stationCount) WHERE {
    ?station wdt:P31 wd:Q928830.
    ?station wdt:P17 ?country.
  }
  GROUP BY ?country
  ORDER BY DESC(?stationCount)
  LIMIT 1
https://query.wikidata.org/#SELECT%20%3Fcountry%20%28COUNT%2...

which is not unreasonable as a quick first attempt, but doesn't account for the fact that many things on Wikidata aren't tagged directly with a country (P17) and instead you first need to walk up a chain of "located in the administrative territorial entity" (P131) to find it, i.e. I would write

  SELECT ?country (COUNT(DISTINCT ?station) AS ?stationCount) WHERE {
    ?station wdt:P31 wd:Q928830.
    ?station wdt:P131*/wdt:P17 ?country.
  }
  GROUP BY ?country
  ORDER BY DESC(?stationCount)
  LIMIT 1
https://query.wikidata.org/#SELECT%20%3Fcountry%20%28COUNT%2...

In this case it doesn't change the answer (it only finds 3 more subway stations in China), but sometimes it does.

Even without tuning Claude is pretty solid at this, just give it the sparql endpoint as a tool call. Claude can generate this integration too.
But the idea of tuning the model for this task is to make a model that is more efficient, cheaper to operate and not requiring $BILLIONS of infrastructure going to the hands of NVDA and AMZN.
I've built an mcp for sparql and rdf. Used claude on iphone to turn pictures of archeological site information shields to transcription, to an ontology, to an rdf, to an er-model and sql statements, and then with mcp tool and claude desktop to save the data into parquet files on blobstorage and the ontology graph into a graph database. Then used it to query data from parquet (using duckdb), where sonnet 4 used the rdf graph to write better sql statements. Works quite well. Now in the process of using sonnet 4 to find the optimal system prompt for qwen coder to also handle rdf and sparql: i've given sonnet 4 access to qwen coder through an mcp tool, so it can trial and error different system prompt strategies. Results are promising, but can't compete with the quality of sonnet 4.

Graph database vendors are now trying to convince you that AI will be better with a graph database, but what i've seen so far indicates that the LLM just needs the RDF, not an actual database with data stored in triplets. Maybe because these were small tests, if you need to store a large amount of id mappings it may be different.

>what i've seen so far indicates that the LLM just needs the RDF, not an actual database with data stored in triplets

While I guarantee you know much more than I do about graph databases and RDFs, in practice, what is the difference between an RDF graph database and an RDF? They're both a set of text-based triplets, no?