| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 1vuio0pswjnm7 1629 days ago

"(BTW, any good search engines these days that aren't indirectly using Google or Bing ?)"

The code for Gigablast is open-source, including the crawler.

I could be wrong but I do not think search.marginalia.eu nor wiby.me use Google or Bing.

The comment about "hundreds of millions" is interesting. Assume hypothetically a search engline claimed to be searching millions of sites for a given query but in truth it was actually only searching 120 sites that it had determined answered this query (i.e., was the most popular answer source) for the majority of users. How would a user verify the search engine's claim about searching millions of sites was true. What if the search engine only allowed the user to retrieve a maxmimum of about 230 results, not matter how many sites it claimed to search.

2 comments

jerf 1629 days ago

"How would a user verify the search engine's claim about searching millions of sites was true."

Search for things specifically on those pages, by very specific phrases and such.

Of course you have to find them yourself first for that verification.

I can say having set up some very teeny tiny websites here and there that the googlebot is hooked up to a lot of stuff. I'm not even sure how it found a couple of them as quickly as it did. Things like "if someone adds an RSS feed to Feed.ly" seem to do the trick. None of them were sites trying to "hide" or anything and I expected them to be found eventually, but they got found much faster than I expected. Or maybe they just scan new domain registrations, though it seemed to me it wasn't that that triggered it.

link

1vuio0pswjnm7 1629 days ago

Imagine searching for something that is quite common that will produce a large number of results but the user can only retrieve, say, 230 results total. How does the user verify that all of the "millions of sites" that contain results were actually searched when the user submitted her query.

A search engine can tell users some large number of sites were searched at the time of the user's query and some large number of results exist, but what if it does not allow the user to actually view all the results.

To put it another way, the question is not what Google has discovered about the www,^1 but what Google is willing to let the user search and retrieve. If retrieving the 963rd result for a common string is not allowed, then it is impossible for the user to verify that the site containing that result was searched when the user submitted her query. Even if the search produced a 963rd result, what difference does it make if the user cannot retrieve it. What is the point of the search engine locating the 963rd result if it never has to show this result to the user querying a common string.

1. What Google has discovered about the www^2 and what Google users are able to discover about the www through Google may be two different things.^3 Google has its own interests to pursue in the name of online advertising and these may conflict with users' interests. "Censorship" is one concept that often draws negative connotations but there are many more subtle forms of filtering and manipulation that are possible here, including unintentional ones.

2. The most important focus would be what is "popular".

3. Some users might care less about what is "popular". Such users would, by and large, be less interesting to an advertising company. Individual interests might become subverted in favour of "popular" interests, to the extent they conflict. An advertising company (that runs a search engine) will favour the larger audience.

link

imachine1980_ 1629 days ago

Gigablast resource tend to be full of trash in my short experience whit it

link

1vuio0pswjnm7 1629 days ago

All the search engines have trash. I retrieve results from a variety of search engines and mix them into a simplified SERP with zero cruft that can be read very quickly. Some call searching multiple search engines "meta-search". The main differences with mine is 1. it is all done client side (there is no remote "meta-search" engine) and 2. searches can be "continued" where they left off at any time. This allows one to avoid rate limits. There are always trash results, every search engine has them in their SERPs, but I find that the more results and the more varied the results the better the chance of finding useful, non-trash ones. Gigablast allows returning at least 100 results at a time. Few search engines allow 100 results at a time that anymore. Google still allows it but will not allow a user to retrieve more than 200-something results total.

link