Hacker News new | ask | show | jobs
by rglullis 318 days ago
Uh, the author got so close to make the same realization I had while working on a project [0] for the Wikimedia Foundation: we wouldn't need search engines if we had better tooling to query semantic databases like wikidata.

However, the thing that the author might be missing is that the semantic web exists. [1] The problem is that the tools that we can use to access it are not being developed by Big Tech. Remember Freebase? Remember that Google could have easily kept it around but decided to fold it and shoved it into the structured query results? That's because Google is not interested in "organizing the world's information and make it universally accessible" unless it is done in a way that it can justify itself into being the data broker.

I'm completely out of time or energy for any side project at the moment, but if someone wants to steal my idea: please take a llm model and fine tune so that it can take any question and turn it into a SparQL query for Wikidata. Also, make a web crawler that reads the page and turns into a set of RDF triples or QuickStatements for any new facts that are presented. This would effectively be the "ultimate information organizer" and could potentially replace Wikidata as most people's entry page of the internet.

[0]: https://meta.wikimedia.org/wiki/QuickStatements_3.0

[1] https://guides.library.ucla.edu/semantic-web/datasets

2 comments

DBpedia Spotlight and entity-fishing already do something similar to your idea - they extract structured data from text and link to knowledge bases. Combining these with LLM-based query translation to SPARQL could indeed bridge the gap between semantic web's structure and natural language interfaces.
ChatGPT etc does an OK job at SPARQL generation. Try something like "generate a list of all supermarkets, including websites, country, description" and you get usable queries out.

In a much, much more limited way, this is what I was dabbling with with alltheprices - queries to pull data from wikidata, crawling sites to pull out the schema.org Product and offers, and publish the aggregate.