| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by moxious 2854 days ago
	Wanna thread other known attempts at this? Maybe someone will jump in with extra detail about how these approaches are different, or what extra value we'd expect. - OpenCyc http://www.cyc.com/opencyc/ - DBPedia https://wiki.dbpedia.org/

1 comments

miket 2854 days ago

Founder here. OpenCyc (as well as Freebase) are human attempts to enter and curate a structured knowledge base. Likewise DBPedia is a set of scripts that extract Wikipedia infoboxes (semi-structured data which is also human crowd-sourced).

The Diffbot Knowledge Graph is built by applying computer vision and natural language processing techniques to reading all the pages on the web (which can be in any structure and human language) and extracting it into a structured form, without the element of human annotation in the build pipeline.

link

moxious 2854 days ago

Can you expand on major points of how this will make the content different, (for example, Wikipedia is curated and non-notable people pages get thrown out, so if you're reading all of the web, presumably you'd know about non-notable people) -- and why it's better?

link

miket 2854 days ago

Founder here. There are many differences in the result when you have an automated system building a Knowledge Graph vs. a human one.

Obvious one is scale, Wikipedia has on order 10M entities and represents the work of thousands of humans whereas the Diffbot KG has 10B entities and is discovering about 120M each day, and is largely limited by the number of machines running the algorithms in the datacenter. The properties and facts indexed about each entity are also a superset because it is not limited to those that would be worthwhile for a human to curate. Lastly, it can be more accurate than facts found in a single source because the automated system utilizes multiple sources of that fact found across the web to estimate a probability of the accuracy of the fact.

The result is that you have a Knowledge Graph that is more useful for work and business because they are the entities you interact with day to day, not the "head" entities that optimize for popularity and the constraints of human curation.

link

subhobroto 2854 days ago

Fantastic question. A major component of a machine generated ontology has to be a notability score, otherwise it would be practically impossible to store all entities (and their relationships).

Further, for this to scale, Diffbot has to have a way to align their entity IDs with IDs from other notable graphs like Wikipedia, Wikidata, Freebase, Wordnet or even Yelp, and the like, otherwise the data could be potential of diminished value.

How would I know that the "Cardi B" that's in my database with ID 321 and wikidata ID Q29033668 is the same as Diffbot's "Cardi B" with ID 561?

link