Hacker News new | ask | show | jobs
by boomer918 1543 days ago
These solutions don't answer any of the fundamental problems with Google:

- who pays for the service (ads? users pay? Average user will never use a paid service if a free one is available)

- how to resist attacks against the algorithm (Google has been fighting spam for decades)

- how to personalize without invading privacy, e.g. Google had an option to search through your email in Google search...it's gone now, I wonder why?

8 comments

Kagi is "users pay". Yes, average users won't pay, but I don't see how that matters to me as a Kagi user.
Another important feature of Kagi that I'm paying close attention to is: they are currently privately bootstrapped as far as funding goes.

To me the fact that Kagi is not currently VC funded is huge for me as far as adoption. Every customer facing VC funded startup I've worked at inevitably starts to institute increasingly anti-user practices while grinning and talking about "customer first!"

I know it's a huge ask, but if Kagi remains privately funded/non-VC I'll happily pay the moment I can.

I've only been using the service for a short while now but have been enjoying it a lot. The ability to blacklist domains has already dramatically improved my search.

Yes, I noticed this too.

I'm afraid what happens to 1Password since they get big funding recently.

A paid search engine? Bold. I'd pay if it could do anything close to what the old google code search could do. I miss it every day.
It's actually so good I plan to pay when they start charging.
It's really good in some ways, and very lacking (for me) in others. The search is fine. Good enough that I would switch. However, if I was out and about and quickly needed directions to Walmart, my normal flow with Google is

Pull out phone -> Safari -> type "Walmart" -> click on the map -> Maps app opens and starts guiding me

When I switched to Kagi, the flow went like

Pull out phone -> Safari -> type "Walmart" -> top result is Walmart.com... no address to nearest Walmart to be found -> Close Safari -> Open Maps app -> search for Walmart

And it got so annoying to have those extra steps. I know I can change my workflow and get used to it. But it wasn't just directions. It was other basic searches. Like if I needed a phone number for a local business. It drove me crazy. I really hope Kagi gets better at those sort of things. I want them to succeed. But it was just too much friction for me.

We found similar issues at https://you.com a while back. We just had to be good for more query families. Now we have both the walmart locations in a map app, eg https://you.com/search?q=walmart and coding related useful results, eg. https://you.com/search?q=how%20do%20I%20find%20all%20files%2...

or https://you.com/search?q=pyspark%20filter%20array%20element

Too bad you.com does not work in Firefox with the "beacon" function disabled (beacon.enabled set to false in about:config).
I agree this disrupts my muscle memory, but if we're honest, that particular workflow isn't necessarily optimal. It's just what Google has trained us to do. If you're seeking directions to Walmart, the most efficient (and one might argue intuitive) method is to open your maps app directly, and type in "Walmart".
Kagi supports DDG bangs. So just type `Walmart !g` and you will get the google search output
I’ll do the same as well. Kagi has literally improved my life a significant degree.
Same. I switched to it for a day just to give it a try and I never switched back. Absolutely willing to pay if the quality stays the same.
It's free while they're in beta. There's a "waitlist" but put your email on it and you'll get an invite within a week. Give it a try.

I've been using it for a couple weeks now on my work laptop and for programming-related searches its been great so far. And the usual annoyances that show up at the top of google and DDG don't show up on Kagi (geeks4geeks, etc).

The cost per search query is incredibly small. You could probably have <1% premium users. If your search engine is used by a billion people a day, 10 million paying users are still enough to pay a few thousand people developing your product (depending on your location).

I believe products can work that way, offer premium service and features for the very few that need it and a basic service to anyone else. In the end, the free tier is cheaper than what you'd have to spend on marketing otherwise.

I am a Kagi beta user, and I intend to pay when they start charging. I have wanted a paid ad/tracker-free Google for a decade now. I want to be the customer, not the product
I guess it depends on whether you consider "The Next Google" to imply that it becomes a the dominant company in the space, used by every "average user", or if it's enough to be a niche solution for highly technical users who prefer it to Google.
So the question will be "Will the average user prefer to feed on free junk food when healthy food is ten bucks a month?"

Chances are there will be both.

Also important for anyone actually thinking of taking Google on, very few of the features listed are things Google can't easily do, too. Attacking their strengths is crazy. You better have something both crazy good and hard to replicate by someone with more money than god.

Whatever replaces Google will be doing something that Google can't without causing them other problems. The first thing that comes to mind is make them choose traffic vs. advertisers (I don't know, if I had an idea of how to, I would not be writing this), but they're big enough that other wedges could start chipping away at their margins.

Actually, you are spot on. One simple feature of the Neeva app is that it shows inline search results as you type into the URL bar. This is because we aren't trying to show you ads, so we don't need you to visit the search results page (where Google and others show you those ads). We just show you the results straight away in the suggest experience. Now, this isn't going to show you everything you care about and you can still click to see the search results page. It is just handy to be able to quickly get to where you are trying to go and especially if it is likely to match what you are looking for (e.g., a wikipedia link). This is something Google cannot bring itself to do because it would be cost way too much in terms of lost ads revenue. There are other examples like this where Google and other ad-supported search engines just can't innovate, can't change the search experience. The current way of searching is too lucrative and there is too much business inertia around it. That's why Neeva is interesting and why I left Google to join and help :)
But that is exactly how Google searches worked on desktop platforms for more than half a decade (Instant Search), not some kind of a new idea. Given how long they kept that feature on, it seems pretty obvious that it can't have been the kind of revenue killer you suggest. If you can serve and display search results for a given possibly partial query, you can obviously serve ads too.
I was talking about mobile. As for desktop, Instant Search was serving up full page results instantly, which included ads. That's a different thing altogether, and of course, in the case of Instant Search there was plenty of room for both sponsored results as well as real results. On mobile there isn't.
I think this comment is a bit strange in the present, considering search engines like duckduckgo, which is basically Bing promoted with a "we don't track" advertising campaign (also hashbangs are pretty cool). DDG is not at google numbers, I know, but you don't need google numbers to make money. I don't think privacy is a very special angle to advertise from either, promising to remove amazon-affiliate blog-spam from results for example, would be a major feature in this space as far as I'm concerned. Being able to edit searches, and potentially gain some intuition for how the search space is set up, might be a much more significant feature, depending on how people take to it. It might flop but atm I'm excited to check it out
> but you don't need google numbers to make money

The article is titled "The Next Google." I was responding to that, not "A Profitable Also-Ran".

While "The Next Google for Wall Street" is one interpretation of "The Next Google", I am more interested in "The Next Google for me".
I've long wondered about a search engine that doesn't index a page that has AdSense or similar code. There'd be a lot of collateral damage, but it would knock out a lot of annoying made-for-adsense sites at least. Basic thinking, but along the lines of what you suggested.
> - who pays for the service (ads? users pay? Average user will never use a paid service if a free one is available) - how to resist attacks against the algorithm (Google has been fighting spam for decades)

The solution for o both of these might actually be a paid service. If you have a paid service, there is a possibility of it being profitable with much fewer users. As an example, let’s say you have 1,000,000 users at $10/month, that is a $10,000,000/month which might be enough to run the service and provide a comfortable profit.

With regards to the spam issue, the fact that you have a small user base would be to your advantage. Because there are so many Google users, it is in websites’ economic interests to spend money to try to game the algorithms. With much fewer users, your paid search users may not be worth it for the sites to spend money trying to game your algorithms.

It will still pay to put ads in to paying customer's feeds. It's more or less inevitable if the customers tolerate it. If you think that's impossible I'd point to how streaming services are now serving up ever more adds.
The spam issue can trivially be addressed by implementing actual penalties for rule-breakers. If it takes a long time to acquire a good reputation & ranking on the search engine, you're unlikely to risk it by doing something nasty in fear of your domain, keyword or brand name being banned for a long time.
Adding on to this, customization is nice but customization is not why DuckDuckGo isn't as good as Google. The reason nothing is as good as Google is because Google indexes way more content than every other service that I'm aware of
Lately Google has gotten so bad that I've even occasionally brought up Bing (which I've often referred to as the Zune of search engines) and gotten better results. It's looking more and more to me like Google has stopped trying to improve things and are simply milking their dominance for all it's worth.
Is that the key though? There's a lot of stuff I'd be quite happy if they didn't index!
Imo Yes, because that’s the main difference between Google search and everyone else. Out index Google, and it’s possible to beat.

That said, I do like the feature of ignoring entire domains. Google used to allow that

Seems the first of these can be solved by reducing the scope. Do you really need a data center to run a search engine?

Overall it seems very rare anyone ever considers this an engineering problem. Really, what's stopping you from running a search engine?

Really? Or are you the one I should have refrained from feeding.

But if you must know:

First you need to collect a lot of content from the internet. From many different sites. With very different types of code structure. Broken html. More often than not behind some SPA JS code. Behind robots.txt files and bot protection efforts.

So the first problem to solve would be building a crawler at scale. That is able to crawl anything your users might want to visit but don't know of yet.

Then storage and retrieval. You need to store and update all this content your crawler collected. You need to enrich it with meta data and organize it for efficient retrieval. So that you can surface it to your users when they use your search engine. Indexing, structure, build g connections between content pieces. A lot of interesting things to think about.

Then there is the front end. Make it easy to search, to refine. Surface relevant content for search queries.

OH maybe I forgot, but you probably need to do a bit of engineering to make your system understand the users' search intent.

This is relatively straightforward for a limited search and document space up to a few million entries in your DB. A few million documents should be doable with off the shelf parts.

Bigger than that. I would applaud you if done with orders of magnitude lower than Google. Anyone would.

All of this is a long series of solvable problems. I should know, I've dabbled in solving most of them. This is why I suggest actually taking a stab at it before you dismiss it as impossible.

There are some problems that aren't as big as they seem. Parts of an SPA can't be reliably linked to anyway even if you find interesting text there, so you can just leave them out of the index.

Likewise, there isn't as great of a need to keep a fresh index as it may seem. The odds of a document changing is proportional to how frequently it changes. This is a bit of a paradox, where even if you crawl really aggressively, the most frequently changing documents will still always be out of date. Most documents are relatively stable over time. You can actually use how often you see changes to a document or website to modulate how often you crawl it.

The bad HTML is quite manageable. You really just need to flatten the document to get at the visible text. Even with really broken formatting, that's manageable.

The storage demands are also not as bad as you might think (most documents are tiny, sub 10 Kb), there are ways to lessen the blow on top of that. Both text and indexes can compress extremely well. Since you're paying for disk access by the block, you might as well cram more stuff into a block.

Most of the crawling concerns, in general, can be gotten around by starting off with Common Crawl (even if I do my own crawling, which also is finnicky but manageable).

> This is relatively straightforward for a limited search and document space up to a few million entries in your DB. A few million documents should be doable with off the shelf parts.

Right, so shouldn't the question be how to find the documents that are even candidates for being search results? Most documents are not ever going to be relevant to any query ever. Get rid of that noise and your hardware goes a lot longer.

I'm running a search engine on consumer hardware out of my living room that can index 100 million documents. Go a bit higher budget than a consumer PC, and you've got 5 billion. That goes a long way.

> Get rid of that noise and your hardware goes a lot longer.

What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?

> I'm running a search engine on consumer hardware out of my living room that can index 100 million documents.

That's extremely cool. I would love to know more. To me an impressive feat already.

I think I was editing the comment while you were replying. Sorry about that. I was just adding to it though, didn't really rug pull on your response so I think it's fine.

> What qualifies? What defines signal, what noise? I agree, that a lot (probably nearly all) pages will receive very, very little traffic/search requests. But are these therefore not relevant?

Now this is a proper difficult problem with (probably) fairly subjective answers. I do however think it's something that warrants serious investigation. It's probably a decent candidate for a machine learning model combined with some manual tweaking for sites similar to wikipedia or github that have absurd amounts of parallel historical content.

Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.

> That's extremely cool. I would love to know more. To me an impressive feat already.

Yeah it's at <https://search.marginalia.nu/>. I've built all the software myself from scratch in Java[1], and I'm doing my own crawling and indexing. The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.

I do use a MariaDB database for some metadata, but I think it will have to go as its hardware demands is becoming a serious bottleneck.

[1] Despite using Java, I should say regarding the index. This is approaches sunk cost at this point. Building a search engine index is not something Java is at all suitable for, its limited low-level I/O capabilities is incredibly handicapping.

> Building a search engine index is not something Java is at all suitable for

Worth pointing out that Lucene/Solr, the biggest open source player, is also Java!

> I [...] didn't really rug pull on your response so I think it's fine.

No you didn't. All good. And I learned a lot from the extended answer. So I am thankful for the explanation.

> Developing heuristics for this is a bit of a hobby horse of mine. It feels tantalizingly almost doable with just a little bit more resources and time than I have.

I can totally understand the feeling. There are quite a few things that I'd like to go deeper into either at work or in private. But alas time.

> Now this is a proper difficult problem with (probably) fairly subjective answers.

I agree. And I don't have answers ready. A lot boils down to preference. Personally, for example I prefer written content over video. Except in a few areas were I like (some) explanatory videos. To me it comes down to the question of how easy I can skim the content when I am looking for an answer.

On the other hand - for deep immersion into a topic I use multiple media formats.

In terms of web search I sadly nowadays need to sift through a lot of seo-fied content that is there either to build a (personal) brand or to attract clicks for advertising revenue/affiliate revenue.

So in principle I agree with you on the noise problem. Still I also believe that there are real great gems to be found in the long tail. When I still feel like I came late to the party, but when I started out in the web in '97 there were so many lovely, quirky sites. So many places that people had put a lot of time, energy and thought into. And sites so packed full of information that I came away not only with more knowledge, but in awe that somebody would give this knowledge away for free.

There also were quite a number of horrible sites (my first ones probably included). So there was a noise vs. signal problem back then. Maybe not to the extent today, though.

> The machine it's on is a Ryzen 3900X with 128 Gb RAM. Most of the index is on a single 1 Tb consumer grade SSD.

Call me impressed. Sounds absolutely cool.

So even with a raid setup for redundancy this is doable.

May I ask how you decide to add me content? Do you follow links? Do you use other search engines' results as a starting point?

I could probably shoot many more questions, but don't want to be a nuisance.

Thanks for your time already.

You should try looking at people's profiles on HN - just click on the username.
Why? I don't change my reply based on the author. I reply to a statement to the best of my knowledge regardless of the author behind it.

And I learned already a lot in this thread after the explanations unfolded.

The initial statement sounded exactly like the armchair "experts" one so often encounters. Actually this was for a long time the first time that there is a person with substantial experience in the problem space behind such a statement.

Out of curiosity, how much disk space does your index currently use, and what's the storage hardware (SSD or spinning rust)?
The reverse index is 180 Gb, on an SSD. I do think using SSDs are a major part of why this is possible on consumer hardware. I'd need a lot of spinning rust to get the sub-100ms response times I can get it to when the index is warmed up.

Should be said I do wear through this SSD at a pretty alarming rate. I'm at 193 TBW on this disk since I started using it as an index less than a year ago.

I do have a bunch of mechanical drives I use for archiving and as intermediate working areas as well, but the index itself is on an SSD.

Thanks - I'd be keen to try this at some point, if anything just for personal usage. I've got more than enough hardware CPU & RAM-wise, if all it takes is getting a few TBs worth of solid-state storage it seems like a no-brainer.
Since you didn't seem to notice the username: you are replying to a person who developed a search engine alone. (So be prepared to applaud.)
I do think there's actually some space opening up for paid services.

From what I'm seeing, if you could create a bot free eco system, people will pay for it.

The question is "can you make it bot free". This is gonna be the next trillion dollar company.

Raising the cost of spam would be a good first step.

At the moment, spamming Google seems to be trivial with no long-term penalties if you get caught doing something nasty.

A simple rule (manually enforced on a case-by-case basis) that would ban your brand/domain for a year if you get caught breaking the rules would get Pinterest into compliance from day 1 for example.

Using ads/analytics/affiliate links as a negative ranking signal would make a lot of blogspam/listicles/clickbait disappear if their only funding method immediately makes them rank much lower below where they are no longer profitable.

This would be easily exploitable by a competitor. For example, search engines (used to) rank back links - that is other domains pointing to your domain. Some bad actors took advantage of this by creating rings of sites that voted each other up. Google responded by punishing the behavior. Then, competitors started taking advantage of this punishment by creating a network of sites that backlinked to a competitor, so they would get punished instead.

This isn’t a hypothetical example - Google actually includes in their webmaster tools a “disavow links” capability so sites can avoid getting punished for bad actors trying to make them look bad. But you can imagine if the penalties were even more severe other folks may get caught up in an unforgiving dragnet with no judge or jury and no way to appeal.

My main point is that people will find ways to game the system, and usually sharp edges (“harsh punishments”) on any system will be taken advantage of by actors, and unfairly penalize others.

Agreed, I'm not saying this is the end-game or that it will be perfect. But a simple rule (that's actually enforced) saying that you are forbidden to serve a different experience to the Google bot vs a normal visitor would take care of Pinterest for example, and they're not even doing that despite it being a major complaint especially in tech-circles where Googlers no doubt lurk.
Average users may not pay, but specialized users may pay and pay more than enough to subsidize some sort of free tier.

Not to mention, if free search engines keep devolving into an endless sea of spam, people may have no choice but to start paying. There's plenty of things out there people pay for not necessarily by choice but because there's nothing else out there that would accomplish the task at hand.

> Average users may not pay, but specialized users may pay and pay more than enough to subsidize some sort of free tier.

If I can, I'm happy to pay, so others doesn't need to. I don't understand why I'm in minority and most of the people thinking only about themselves.

I'm happy to pay because for many years I also had to use free services, paid by others (or ads, but ads I'm blocking however I can).

I would pay for an app that searches my stuff and provides some kind of intelligent agent for web knowledge.