| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by s-c-h 3085 days ago
	I am curious to know what major innovations in search engines happened since the page rank algorithm, or were there only incremental improvements? Also is search considered a solved problem?

6 comments

Eridrus 3085 days ago

Definitely not solved. I wouldn't even say search is a well defined problem.

In any case, PageRank is a method for estimating quality of a page based on the amount of inbound links, not a solution to all of search.

But it's a property of the web at the time, not something universal to the search problem, e.g. it's not a statistic that exists if you want to search books.

I think the work being done on question answering (given a question and a document that answers the question, provide a concise answer) is a place where a lot of interesting work is being done, both in academia and at Google with the snippets of web pages it provides.

aisofteng 3085 days ago

In particular, PageRank can't be used for corpora whose documents do not link to each other explicitly somehow - which, outside of hypermedia, is nowhere.

hn_throw_1234 3085 days ago

Academic papers, patents, textbooks, anything with citations also works

bryanrasmussen 3084 days ago

The problem with these kinds of citations is that they are time determined, that is to say paper X will never come to reference paper Y because paper X came first, thus resources that exist first will accrue more rank using a non-hypermedia citation system.

In order to rank papers I think you would instead have to rank people, so that the people who have written on Paper X can in Paper Z reference the authors of Paper Y.

But of course it would need a CiteRank of equivalent quality to pagerank to be at all useful.

1787 3084 days ago

This is sort of true, but the graph of citations is basically a DAG - very unlike the graph of hyperlinks on the web. From what I've seen it's not obvious that PageRank on a DAG tells you anything super interesting.

wcarss 3084 days ago

That makes me wonder if there are actually any cycles in paper citations. I can imagine this scenario:

A paper X has been updated to reference a response or later work Y, which itself referenced X from the start in such a way as to make the 'version' of X in the reference unknown. Citation-trawling software might bite hard on a loop like that :P

Anyway, I also wonder why having cycles makes PageRank useful and lacking them makes it less so -- you can still count inbound links and such with a DAG, and huge huge amounts of the content of the web would exist in DAG-equivalent subtrees, wouldn't they? I could have this pretty wrong, haven't looked at the paper in years and should go do so!

Eridrus 3084 days ago

PageRank is overkill if you don't have cycles, since you can just trivially count the DAG.

zachrose 3085 days ago

Interesting, how explicit is explicit? So much of literature is implied references. Assuming that later works are always “linking to” earlier works, could you not use page rank for, say, rap lyrics?

colechristensen 3085 days ago

My impression is that "search" isn't so much of a solved problem, but the question is changing.

The next–very unsolved–problem is being able to "understand" natural language queries and "understand" source materials such that a user can ask for something and get it.

"Understand" is in quotes because because it means something rather specific.

fh973 3085 days ago

Is this what is missing to get something like Gmail search working at the same level as web search?

bsder 3085 days ago

> Also is search considered a solved problem?

Ha, not only is search not a solved problem, I would posit that search is getting WORSE.

Computer knowledge is a particularly good example for how search is degrading with time.

Try to figure out how to do X on the Beaglebone Black (I presume the Raspberry Pi has a similar problem, but it's not something I'm that familiar with).

The problem is that the Linux implementation for the Beaglebone went from weird distribution (Angstrom) to mainline Debian Linux kernel 3.8 -> 4.4 -> 4.14 in a VERY short time so the number of links to new stuff stayed flat.

Consequently, the old Angstrom stuff almost always fills the initial search positions for quite a ways even though it's completely useless.

This is occurring in other things, as well. Stack Overflow, for example, has no way to mark an answer as "This was correct 5 years ago but is now wrong."

Effectively, the web is becoming sclerotic and search engines are following it.

I REALLY miss old AltaVista's feature where it would give you a graphical representation of the clusters in your search so you could drill down into a less popular grouping. The fact that nobody has recreated this makes me wonder ...

MaxBarraclough 3084 days ago

> Stack Overflow, for example, has no way to mark an answer as "This was correct 5 years ago but is now wrong

Not counting comments? What more could you ask for?

bsder 3084 days ago

Candide would be proud of you...

What more could I ask for?

Which comment is the correct one? There are always multiple "No, that isn't correct, this is the one true way" comments. One posted 5 years after the flurry is unlikely to get many votes.

How about bad information not showing up in my search at all?

How about ageing out votes so it makes sense to come back to a topic and revote?

And this doesn't even account for the information that is simply wrong but nobody cares enough (or has enough karma) to fix.

Curation isn't always bad.

MaxBarraclough 3083 days ago

> One posted 5 years after the flurry is unlikely to get many votes.

Interesting point - perhaps a hybrid scheme to decide the ordering of the answers, that balances upvotes and submission date.

It would need to be carefully tuned though - for some questions, answers will age badly ("What's the best way to do parallel stream processing in Java?"), but for others, they essentially won't age at all ("Why is there a small numerical error in this floating-point calculation?").

Perhaps it could be tuned by tag, as a means to estimate how the answers will age.

> How about bad information not showing up in my search at all?

You don't want the system to be overly sensitive to undeserved downvotes.

> Curation isn't always bad.

Of course, but traditional curation isn't on the cards simply because of scale - StackOverflow isn't like an academic journal - and we're whining about a system that works incredibly well.

Look at YouTube comments, or Yahoo Answers, and you see what a shitshow it can be when the Internet tries to have a conversation. It's a small miracle that intellectually worthwhile forums like this one can ever work. StackOverflow does a lot right.

vinn124 3085 days ago

> I am curious to know what major innovations in search engines happened since the page rank algorithm, or were there only incremental improvements?

a ton has happened! since pagerank, theres been a ton of advances around nlp that has changed the way queries are processed prior to information retrieval. for example, google's rankbrain seems to do a lot of the heavylifting around word similarity.

saagarjha 3085 days ago

> Also is search considered a solved problem?

I certain wouldn't, since I still encounter things that I know are on the internet but Google can't find. It's possible that the next advance won't be actually indexing the web but rather figuring out what the user wants rather than what they requested.

busyant 3085 days ago

> what the user wants rather than what they requested.

Amusing anecdote regarding this issue.

  - I teach an introductory online chemistry class. 
  - If the students are determined enough, they can/do cheat on their quizzes.
  - In one of my quizzes, I give the students a formula for a pretend material and ask them to compute its molar mass.
  - If you perform the calculation, the molar mass works out to something like 108 grams / mole.
  - If you try to Google the answer, Google is smart enough to know that my compound is unstable. 
  - Instead, Google provides the molar mass for a _related_ material (86 grams / mole)
  - Each semester, I find a handful of students who dutifully tell me the answer is 86 g / mole.

DenisM 3085 days ago

Magnificent.

Reminds me of my metal work craft classes in school. We were making our own wrenches from scratch, and that requires a bit of geometry and drafting to make the blueprint (and the template). A couple of guys decided to cheat by pressing an existing wrench against the paper and using a pencil to copy the shape. Sounds like fine idea in theory, but the result is obviously fake, and also distorted enough to be unusable (pencil-surface angle will and pencil thickness will not let you have exact measurements, and inability to maintain the stable angle distorts the shape). Didn’t end well for them, got nearly kicked out.

osrec 3085 days ago

Google does this to some extent. Recently I've found that Google uses my past queries to make cogent suggestions for my next search (essentially to predict what I want next based on prior info). Eg, if I've just searched for "sully", then a short time after type "t" into Google, the first suggestion is Tom Hanks. I've only noticed this in the last few months.

leggomylibro 3085 days ago

My favorite example of this is the Bullet physics library.

Google eventually learned to give me documentation in response to stuff like 'bullet collision', but for awhile it was big on youtube links and gun ranges.

Swizec 3085 days ago

I’ve experienced this when learning new programming languages and other domains. When you start, Google results are mostly garbage with a few useful bits. A few months later, it feels like the web is awash in materials to answer your every question and you can always find an answer for your problem in the first 5 results.

Part of it is learning which phrases to use when searching, but I’m sure a big part of it is also Google figuring out what you want.

ianai 3085 days ago

Googles monetization strategy blinds the results. My guess would be either a way to search beyond google or force it to give results that aren’t manipulated somehow.

sehugg 3085 days ago

There's Google Panda, which was meant to combat content farms, but unfortunately generates a lot of false positives: https://en.wikipedia.org/wiki/Google_Panda