Hacker News new | ask | show | jobs
by moxious 2852 days ago
Can you expand on major points of how this will make the content different, (for example, Wikipedia is curated and non-notable people pages get thrown out, so if you're reading all of the web, presumably you'd know about non-notable people) -- and why it's better?
2 comments

Founder here. There are many differences in the result when you have an automated system building a Knowledge Graph vs. a human one.

Obvious one is scale, Wikipedia has on order 10M entities and represents the work of thousands of humans whereas the Diffbot KG has 10B entities and is discovering about 120M each day, and is largely limited by the number of machines running the algorithms in the datacenter. The properties and facts indexed about each entity are also a superset because it is not limited to those that would be worthwhile for a human to curate. Lastly, it can be more accurate than facts found in a single source because the automated system utilizes multiple sources of that fact found across the web to estimate a probability of the accuracy of the fact.

The result is that you have a Knowledge Graph that is more useful for work and business because they are the entities you interact with day to day, not the "head" entities that optimize for popularity and the constraints of human curation.

Fantastic question. A major component of a machine generated ontology has to be a notability score, otherwise it would be practically impossible to store all entities (and their relationships).

Further, for this to scale, Diffbot has to have a way to align their entity IDs with IDs from other notable graphs like Wikipedia, Wikidata, Freebase, Wordnet or even Yelp, and the like, otherwise the data could be potential of diminished value.

How would I know that the "Cardi B" that's in my database with ID 321 and wikidata ID Q29033668 is the same as Diffbot's "Cardi B" with ID 561?