Hacker News new | ask | show | jobs
by throwaway13337 3570 days ago
Alternative general purpose search engines are an exciting idea.

It seems a lot like we're about the time when yahoo was dominant and searching was sort of awful. When you searched, what ranked highest was market-driven sorts of stuff.

Right now, for topics normal people search for - not techies -all you get are content farm sites with js-popups asking for your email address. Try searching for anything health related, for example. We've regressed.

My own half-solution is to look only for sites that are discussions - reddit, hn, etc. It could be better. A search engine that favored non-marketing content could really steal some thunder.

This doesn't look like that, but maybe its a start?

10 comments

Speaking of which, it seems possible for a computer to detect content which is just mostly marketing, versus content which is not (based on how spam filters work). The search engine should just show a "marketing index" score right next to the result. Even better is to whitelist certain sites (Wikipedia,popular .edu and .org domains) to begin with and prioritize those results.

It would likely be really niche, but it could become the anti-Google, which would be great when people actually seek an alternative to all the noise.

I think it has to be a non-profit like Wikipedia itself, I cannot imagine a model where it can also make money. The submitted site is a candidate but it has to improve the search quality as other commenters pointed out.

> seems possible for a computer to detect content which is just mostly marketing

But based on the spam race, marketers will then tune content so that it doesn't trip those filters. Paid news and journal articles, etc.

Extremely relevant XKCD:

https://xkcd.com/810/

But yes, even here on HN there are problems distinguishing between legitimate articles and paid news.

I agree. Its a pretty tough problem, however it is good to cross the bridge when it comes. If the search engine stays really niche, perhaps it may not even be worth it for the spammers, while doing enough to cater to the somewhat self-selecting audience. For example, the number of people who want to get to the front page of HN is likely to be a really minuscule fraction of people wanting to get to the top of search results.

Also, I wonder if is it possible to detect promotional content by analyzing things like call to actions and such?

Interesting that if this were a parsing problem (i.e. https://news.ycombinator.com/item?id=12478538), folks would immediately suggest accepting known good output, instead of trying to blacklist specific problems. The analogue in search would be something that looks more like a directory than an internet wide search engine.

Of course what killed directories in the early web is that they had no hope of scaling.

If they hide their sell links, shipping baskets, closing pages and such then they'll kill the sales though.

You can have paid news, but it's not doing anything if the mark can't buy the product afterwards because you had to remove all associations with selling to get the news to rank.

It seems that you could just use Google's algorithms and modify the site trust metric using a front-page spam-score, whilst reducing the effect of link-juice from links with associated marketing keywords ("buy the doohickey on this link", or whatever).

Keeping marketing sites high in your SERPs would make you way more money on referrals though.

Solving algorithmic tasks by just building a ML model of your competitor's algorithm seems like a funny way to start. I imagine this to be the way "programming" will stop being a thing in a few hundred years.

At the moment for web search it probably would not work, because I imagine from feature extraction to result there are several models involved to create intermediate results.

This search engine seems to use only tf-idf inverted index for it searches and then a vector space model for ranking the similarity.

A search for "java twitter bot" places more emphasis on "bot" then on Java and then on twitter which is what a tf-idf would do.

A good start like you said but it's miles away even from yahoo or bing.

Wow, the contrast between what this engine returns for that query and what google returns is amazing. Literally zero relevant links from the former and only relevant links from the latter. Search relevance is a serious high-science research problem, and it's going to be tough to compete with established players that have probably man-centuries' worth of proprietary research IP and some of the world's best scientists.
>Search relevance is a serious high-science research problem

We are working on this scenario at http://www.shoten.xyz using document clustering and apache spark graphx+ giraph.

Ain't been an easy task so far but we have made some headway

I'd love a search engine which only indexes forums. Something I've been thinking of doing for years, but it'd be a lot of work.
There were a few attempts at that in the past, one being http://omgili.com/ that now seems to return pretty much garbage.

BTW About 12 years ago I was building this search engine, and I was toying with the idea of building a classifier that classifies web pages based on their "genre" rather than category, so you can limit your search for shopping websites, forums, blogs, news sites, social media, etc. It was a bitch to train, and my classifier's algorithm was pretty crappy, but it showed some potential.

I think today modern search engine do that behind the scene, and try to diversify the result to include pages from multiple genres, but they usually don't let you choose.

Heh, classifying by "genre" is exactly what I was thinking of doing.

Had some debate with myself if I should start by focusing on training for shopping pages (product pages & product reviews) - because that might make some money; or start by training for forums - which I'd enjoy a lot more. Or build a more general system which would definitely never work and never get finished.

Google actually let you filter by "discussions" until a few years ago, so they certainly do this kind of classification. It didn't work perfectly but sometimes did the trick. Don't know why they removed that feature.

Google removed it because they aim at the mass market.

Another perspective: people who find answers in forums are less likely to be interested in ads. And who knows, maybe making search shitty(in so many ways, not just formus), ad revenues rise ?

IIRC I found that the easiest was to train on shops, forums, and porn. But another tricky bit was conceptual - genre and category overlap sometimes (e.g. porn). Anyway I couldn't get it to yield proper results. But today we have things we didn't back then like opengraph and schema.org tags, that give more semantic info.
Iirc Gigablast had such a feature.
I thought about doing the same thing, found http://boardreader.com/ and then moved on. No idea how useful it is.
Often times when searching for something, I would love this as well. I usually add phpbb, forums or discussion to the search keywords, but it's never 100%.

I would love something like this as well...

Oh, Is this the search wish-list thread? Ok: I just want a search engine that only indexes the sites that would be of interest to people like me.

For most of my searches over the last year, Google has been so broken that it's almost unusable. At this point, to get any relevant results, I have to anticipate how Google will work, and then trial and error 10-15 times until I find what I'm looking for.

But, if I happen to be looking for incorrect information from 2005, Google works like a charm.

Another product (Discussions) that Google discontinued.
Probably the easiest way would be to use Google Search APIs and do custom queries and/or filtering.
That's been my pet idea, as well. Would love to support it however I can. boardreader.com is OK but could be a lot better.
I always had this, maybe stupid idea, of vouching referrals search engine. You pick a few sites you know are good and they (their 'webmasters') would vouch for new ones, and those could vouch for new ones, etc. Catch is, if one or two (whatever) of your child or grandchild vouched sites screws up, then they're toast, out of index, but so are you. That way you would pick wisely.

Same idea would probably work for online commenting. Vouch with a chain of responsibility. That's essentially how pagerank did its thing, but with no repercussions and vouching was automatic based on links from initial seed of what they thought was good. I'd do it with humans.

This is very similar to PageRank except "vouch" is accomplished by linking to the other page.
Key difference being there's a responsibility of recommendations with consequences.
I always had an inkling for a Search Engine that ONLY indexed the root page of every domain. Not sure if I'm right about this, but it sure would sort the chaff from the wheat for general purpose queries.
Seems like that would just give you all those made-for-seo sites that tend to have second-rate content at best. ie, you search for 'best electric lawn mower', and you'll get bestelectriclawnmowers.com, 10bestelectricmowers.com, etc. Those sort of sites exist for every imaginable topic, and in my experience are rarely worth visiting.

I would almost want the opposite. The best content on most topics I've found tends to be a page on a discussion forum of some sort, followed by blogs and more general editorial sites.

So what we really want is a more granular search system i.e. 'only forums/blogs' 'not shopping sites' etc?
What we really want is a system that classifies your query as being one of "forum/blog/shopping" and then makes a scan only over that class of pages.

So on launch-day you would have checkboxes. On the one-year anniversary you´d have those checkboxes removed.

Google probably implemented this a couple of decades ago, though, so what we _really_ need is someone to come up with a new business model more attractive than Google's.

Ad-free is quite refreshing.

The sheer scale required to attempt a new search engine is pretty staggering... it seems like one area where decentralisation might actually be worthwhile; the key obstacle being everyone's interest in gaming search results. I wonder if there's a useful application of ledgers that'd be useful in there somewhere...
Why not have a search engine with "sub-reddits" that can be subscribed to...

Whereby - a site would self-identify as being in a particular genre, say "healthcare" - and I could launch a tab to the engine and set my sub to "health, health-tech, healthcare, medicine, etc.." and then do my search and only those sites that set their category will show up in that search - but if I dont find my search, I can then easily slide out to other areas where I may not have thought what I was looking for would have identified with. Further - any post by any company/site could individually been given a topic to self-declare as... thus even if the company or site isnt necessarily in that space - their page or object could at least be a part of that result ranking....

Or has this been tried/found to be stupid?

> Or has this been tried/found to be stupid?

You are describing the keywords meta tag.

While it is often told that competitors before Google did not use something like PageRank, which is not true, Google's PageRank algorithm was better and cheaper than the competitors' and effectively killed your idea 20 years ago.

Appreciate the insight....

But I find it slightly ironic that people are bitching about PageRank having slightly some issues with respect to the specificity of what they are searching for...

meaning that even though "killed this idea twenty years ago" we are coming back to the same problem...

Is that perhaps just due to the volume of info that is available on the web and the much more complex way we have categorized (mentally, not digitally) all the knowledge and information thats out there now?

I've started appending "forum" to many of my queries when I am looking for answers from users.
A few years ago Google had a "discussions" category that you could pick alongside "images", "videos", etc. I wonder why they removed it. It was indexing forums, Google and Yahoo groups, etc.
I would love to have a search engine that would allow you supply your own ranking function.
How would this work? You could boost the query term for instance, like it possible to boost the column score in postgresql but that is all. Otherwise allowing user to provide their own ranking function (which is itself an art) would not be pratical performance wise. It should be noted that search engine interface, the search box is already a DSL for the underlying algorithm that support OR/AND and NOT.
Its not postgres or RDBMS for text search. Its usually bigdata. For example we use apache spark to query parquet files on HDFS
That is a lovely idea. Unfortunately, a scoring scheme has one foot in the indexing process (that thing that the google bot does) and another in the querying part, so switching schemes would often mean you would need to re-index your data to cater for the new metrics you now need for a new type of scoring.
Neither indexing nor querying does the ranking. Ranking is done after indexing and can be either tf-idf , pagerank or combination of that. Once the document similarity to the query is calculated, by for example vector space model, the documents are ranked by pagerank.

What OP is saying that instead of pagerank we can have other ranking methods which is surely plausible.

Sure, but what I was saying was that what good is a new ranking method, when you only have at your disposal the same set of metrics as the method you are trying to replace? A new ranking would quite often mean adding new metrics. For example, when Lucene when from tf-idf to bm25 they added lots of new metrics to be able to cater for the new algorithm.
did lucene go from tf-idf to okapi bm25? Surprising. Need to research it up.

We use tf-idf too but augment with page rank and clustering. gets more relevant docs

The "ideal" search engine is probably not possible without some sort of AI having access to all the content on internet.
> some sort of AI having access to all the content on internet

Isn't this a description of any decent search engine?