| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Benjammer 3325 days ago
	>There's nothing to index Huh? What do you mean? Google indexes HTML web page content from the entire public internet using web crawlers...

1 comments

logicallee 3325 days ago

I'm confused. By "clever indexing" I thought they meant, in the database sense of the word.

The reason my search took 30 seconds is because it started by getting a list of every site with "from" on it, every site with "what" on it, and so on, intereseecting them all. That's how it ended up finding my quote. how else do you think it did it?

----- edit:

to find the string "from what it is to a" which occurs only hidden in the middle of shaespeare's texts -- what do you think they do?

In my opinion they combine the list of sites that have every word - starting with the least common ones. It's easier if you search for something that has a few uncommon words. Then you start with a small list, and have to combine it with other small lists.

When every word in the phrase has billions of sites (there are billions of pages that have the word "to" on them, same for "from", "what", "it", "is", "a"), you have to combine them all. Then you have to do a string search within the resulting set, since I put it in quotation marks. There is no easy strategy. Hence the long search time.

what else could they be doing?

link

Benjammer 3325 days ago

I'm curious how else you think large-scale data is stored other than in an index in the database sense of the word as well. You think Google has some kind of massive heap-like, unstructured data-store that they run search queries against? That doesn't make sense to me, but I've also never worked in global scale web search, soooo idk.

link

Benjammer 3325 days ago

You said "There's nothing to index," as if Google is making web requests to every domain in existence, parsing the document responses, and seeing which sites have these words on them, all at runtime when you type a search query. Google obviously indexes the web in the sense that they store their own cached versions of web pages "locally," on top of which they then build an insanely complicated, web-facing, search architecture.

link

logicallee 3325 days ago

we're talking past each other. sova referred to this meaning - https://en.wikipedia.org/wiki/Database_index when they said "clever indexing."

the sense you mean is a different sense of the word index - meaning, to crawl. Yes, of course it does that too.

link

sova 3325 days ago

I was not referring to database indexes. That is not pertinent here. I was thinking about the index that Google creates, its locally cached version, that it queries. If you have a locally cached version, you are not going to rifle through them one by one until you find matches, nor are you going to rifle through them and find partial matches and then intersect them all to see if any overlap in your final product. Among other weird assumptions, that final method assumes there is a solution for every query.

Google, no doubt, has a very sophisticated way of querying against their cache of the WWW and it has probably evolved over time. However, it is inappropriate to say Google does a join over the entire internet for one query. It is much more reasonable to say that Google checked your query string against their gigantic index of terms, and it took a while to dig that deep into the pile. The performance hit such a complex query takes is more like unzipping a large archive to get a specific megabyte's worth of info, rather than saying it smashed all the files together and then searched for the exact term like notepad.

Anyway, think about it for a while, it's clearly a cool issue in search, and programs and algorithms do not have to visually search things as humans must.

link

falsedan 3325 days ago

> what else could they be doing?

I recommend reading the Stanford paper[0] (page 12), which spells out in a lot of detail exactly what they were doing.

In short, your pathological query would have searched for every document which contained one of your words, discarded those which didn't match all, and then sorted by word proximity. I expect for a literal phase search, there would be a final pass to look for the exact phrase in order.

[0]: http://ilpubs.stanford.edu:8090/361/1/1998-8.pdf

link