Hacker News new | ask | show | jobs
by Eridrus 3085 days ago
Definitely not solved. I wouldn't even say search is a well defined problem.

In any case, PageRank is a method for estimating quality of a page based on the amount of inbound links, not a solution to all of search.

But it's a property of the web at the time, not something universal to the search problem, e.g. it's not a statistic that exists if you want to search books.

I think the work being done on question answering (given a question and a document that answers the question, provide a concise answer) is a place where a lot of interesting work is being done, both in academia and at Google with the snippets of web pages it provides.

1 comments

In particular, PageRank can't be used for corpora whose documents do not link to each other explicitly somehow - which, outside of hypermedia, is nowhere.
Academic papers, patents, textbooks, anything with citations also works
The problem with these kinds of citations is that they are time determined, that is to say paper X will never come to reference paper Y because paper X came first, thus resources that exist first will accrue more rank using a non-hypermedia citation system.

In order to rank papers I think you would instead have to rank people, so that the people who have written on Paper X can in Paper Z reference the authors of Paper Y.

But of course it would need a CiteRank of equivalent quality to pagerank to be at all useful.

This is sort of true, but the graph of citations is basically a DAG - very unlike the graph of hyperlinks on the web. From what I've seen it's not obvious that PageRank on a DAG tells you anything super interesting.
That makes me wonder if there are actually any cycles in paper citations. I can imagine this scenario:

A paper X has been updated to reference a response or later work Y, which itself referenced X from the start in such a way as to make the 'version' of X in the reference unknown. Citation-trawling software might bite hard on a loop like that :P

Anyway, I also wonder why having cycles makes PageRank useful and lacking them makes it less so -- you can still count inbound links and such with a DAG, and huge huge amounts of the content of the web would exist in DAG-equivalent subtrees, wouldn't they? I could have this pretty wrong, haven't looked at the paper in years and should go do so!

PageRank is overkill if you don't have cycles, since you can just trivially count the DAG.
Interesting, how explicit is explicit? So much of literature is implied references. Assuming that later works are always “linking to” earlier works, could you not use page rank for, say, rap lyrics?