Hacker News new | ask | show | jobs
by rickdeveloper 1623 days ago
I built this website a couple of months ago because I was annoyed by how hard it was to find useful things on Google. As "Google no longer producing high quality search results in significant categories" [0] is currently #1 on the front page I figured I'd share this project again. I hope it's useful to some people.

'No Trash Search' is very focussed on STEM and not "for daily use". It's surprisingly good when you're looking for certain kinds of information. Under the hood it's little more than a programmable search engine [1] with a whitelist of ~120 sites.

[0] https://news.ycombinator.com/item?id=29772136

[1] http://programmablesearchengine.google.com

6 comments

> Under the hood it's little more than a programmable search engine [1] with a whitelist of ~120 sites

So back to what web search was in the 1990s, roughly: an index from a curated selection of sites.

120 sites is pretty hilarious and sad. "Here you go, the worthwhile part of the internet!"
While I can understand the appeal, restricting your search engine to only ~120 websites out of hundreds of millions (?) is basically giving up on the Web.

(BTW, any good search engines these days that aren't indirectly using Google or Bing ?)

> restricting your search engine to only ~120 websites out of hundreds of millions (?) is basically giving up on the Web.

Sure - the web is now a cesspool optimized for advertising and attention. The traditional search engines made a lot more sense at the dawn of the internet when it was more about discovery. Now, for the most part, it's closer to an information retrieval tool, where a finite list of established sites have the bulk of what one is looking for. It only makes sense to have a tool that lets one navigate the established, legit internet, and not have to deal with all the crap.

That doesn't mean there is no use case for google as it is, but some more focused competition is a no brainer.

There's http://yandex.com . It's great if you want to search controversial subject matter and controversial results that Google wouldn't give you. The reverse image search is also amazing.
The reverse image search in particular is very, very good.

Far better than Bing or Google. It's not obvious why theirs is so terrible, unless that product is not a moneymaker for them, in which case it explains everything.

I should have mentioned : ideally from the EU.

Big Russian or Chinese software is even more out of the question than the GAFAMs (if they're big, they definitely have authorities messing with the results).

Hmm, what about Baltic or Ukrainian or Israeli search engines ?

Which results are different than Google's?
Most. Yandex is great, especially for programming searches. It generally ranks GitHub, Stack Overflow and other content-heavy sites highly. Google has been taken over by weird clones of GitHub and SO lately, Yandex has no such trash.

It completely boggles my mind that the useless GitHub and SO clones rank first page on Google. Do engineers at Google not use their own product?

Regarding stackoverflow there is a fair chance they can congratulate themselves:

If I am right they played stupid games and won stupid prizes. More specifically they have allowed rampant deletionism for years so while I am fairly certain the questions and answers originated on Stack Overflow it wouldn't surprise me if a good number of of those aren't visible on Stack Overflow anymore which would explain why they rank higher in Google.

Done right this would actually be a service.

Sadly some of them seems to mix together various questions and answers in the same page to generate text matches for unusual queries.

Engineers at Google build what the ads and sales teams tell them to.
Don't have time to mess with it right now, but does it normally return about half results in Russian or is that something my phone/browser is doing?
I get usually about 10-25% in Russian.
Frankly, no. It's kind of a running joke that you can't Google any of your problems at Google because everything is internal.
> Google has been taken over by weird clones of GitHub and SO lately

Do you have an example search leading to a GitHub clone?

French is my mother tongue, but I've quickly learned during my studies that using English keywords in my STEM-related searches would simply lead me to better (and more abundant) results.

A few weeks/months ago however, while I was trying to solve an issue whith a colleague who would search using french keywords, I noticed that some websites featured on the first page of the Google results were off.

In short, they were machine-translated versions of Stack Overflow threads. And they would appear in most of the searches using french keywords.

Those websites also appeared rarely in my searches while I was using English keywords, but most of the time I never bothered opening them. But now I notice them every time.

Some examples: When searching for "wget set http proxy" on Google, the fourth result leads me to qastack.fr, and the ninth to it-swarm-fr.com, both are websites featuring scrapped and machine-translated threads from Stack Overflow.

When searching deliberately in french for "Eclipse CDT stdout ne s'affiche pas" ("Eclipse CDT stdout not displayed [in console]"), the first result leads me to askcodez.com and the fourth one to qastack.fr (askodez is the same as the other two).

I have never stumbled upon Github clones, yet, however.

I don't have an example search, although I'll try to remember to update this comment the next time it happens. On average I come across these things at least once a day, but it depends what I'm working on. It tends to be when searching for more obscure bugs, for which there is a GitHub issue but it's not ranked highly on Google for whatever reason, but these spam sites are ranked highly.

GitMemory is probably the most well-known example; it's just a thin layer over the GitHub API with a completely garbage UI, yet it often ranks higher than GitHub itself.

Try searching for movie name + torrent for example
yep it's always a bunch of movie subscription sites instead of the torrent. it's almost like Google's search engine is predominantly focused on collecting advertising dollars...?
"(BTW, any good search engines these days that aren't indirectly using Google or Bing ?)"

The code for Gigablast is open-source, including the crawler.

I could be wrong but I do not think search.marginalia.eu nor wiby.me use Google or Bing.

The comment about "hundreds of millions" is interesting. Assume hypothetically a search engline claimed to be searching millions of sites for a given query but in truth it was actually only searching 120 sites that it had determined answered this query (i.e., was the most popular answer source) for the majority of users. How would a user verify the search engine's claim about searching millions of sites was true. What if the search engine only allowed the user to retrieve a maxmimum of about 230 results, not matter how many sites it claimed to search.

"How would a user verify the search engine's claim about searching millions of sites was true."

Search for things specifically on those pages, by very specific phrases and such.

Of course you have to find them yourself first for that verification.

I can say having set up some very teeny tiny websites here and there that the googlebot is hooked up to a lot of stuff. I'm not even sure how it found a couple of them as quickly as it did. Things like "if someone adds an RSS feed to Feed.ly" seem to do the trick. None of them were sites trying to "hide" or anything and I expected them to be found eventually, but they got found much faster than I expected. Or maybe they just scan new domain registrations, though it seemed to me it wasn't that that triggered it.

Imagine searching for something that is quite common that will produce a large number of results but the user can only retrieve, say, 230 results total. How does the user verify that all of the "millions of sites" that contain results were actually searched when the user submitted her query.

A search engine can tell users some large number of sites were searched at the time of the user's query and some large number of results exist, but what if it does not allow the user to actually view all the results.

To put it another way, the question is not what Google has discovered about the www,^1 but what Google is willing to let the user search and retrieve. If retrieving the 963rd result for a common string is not allowed, then it is impossible for the user to verify that the site containing that result was searched when the user submitted her query. Even if the search produced a 963rd result, what difference does it make if the user cannot retrieve it. What is the point of the search engine locating the 963rd result if it never has to show this result to the user querying a common string.

1. What Google has discovered about the www^2 and what Google users are able to discover about the www through Google may be two different things.^3 Google has its own interests to pursue in the name of online advertising and these may conflict with users' interests. "Censorship" is one concept that often draws negative connotations but there are many more subtle forms of filtering and manipulation that are possible here, including unintentional ones.

2. The most important focus would be what is "popular".

3. Some users might care less about what is "popular". Such users would, by and large, be less interesting to an advertising company. Individual interests might become subverted in favour of "popular" interests, to the extent they conflict. An advertising company (that runs a search engine) will favour the larger audience.

Gigablast resource tend to be full of trash in my short experience whit it
All the search engines have trash. I retrieve results from a variety of search engines and mix them into a simplified SERP with zero cruft that can be read very quickly. Some call searching multiple search engines "meta-search". The main differences with mine is 1. it is all done client side (there is no remote "meta-search" engine) and 2. searches can be "continued" where they left off at any time. This allows one to avoid rate limits. There are always trash results, every search engine has them in their SERPs, but I find that the more results and the more varied the results the better the chance of finding useful, non-trash ones. Gigablast allows returning at least 100 results at a time. Few search engines allow 100 results at a time that anymore. Google still allows it but will not allow a user to retrieve more than 200-something results total.
Try Mojeek https://blog.mojeek.com/2021/03/to-track-or-not-to-track.htm... Disclosure: team member. Feedback good or bad appreciated
Check out marginalia[1], made by another user on HN.

[1]: https://search.marginalia.nu/

Yeah I do my own crawling, and offer results from around 200k sites (although it's indexed 700k domains, most of which are crap).
I think https://www.qwant.com/ use their own, just started using it so I can't really say much about it other than it seems alright compared to ddg and google(?)
Last time I checked, it just used an old index from Bing ?
You might want to add cppreference.com to your list of programming sites.
Seems to be in there now :)
FYI, I think this is just the case where you should prefix the submission title with “Show HN:”. Can mods update it so it shows with the others? @dang?

https://news.ycombinator.com/show

https://news.ycombinator.com/showhn.html

I emailed this suggestion to the mods.
Woot! It's updated now. Thanks!
Hey I was looking for something like this. Thanks.