| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nobodywasishere 22 days ago
	> didn't realize Kagi had no aspirations to build their own general purpose index Kagi employee here. We're actively working on building our own indexes beyond the limited ones we have now, not just a general index but also purpose built indexes for things like programming, etc.

8 comments

rpdillon 22 days ago

I did not intend to spread misinformation here, and would like to hear more about the general-purpose index Kagi is working on. I had based my comment on several Kagi pages, but mostly https://help.kagi.com/kagi/search-details/search-sources.htm..., which mentions Teclis as Kagi's own index, but https://teclis.com/ makes it pretty clear that it's a "small web"-focused tool:

> Teclis is an attempt to surface the less known web, the web of creativity and self expression, the more humane web.

> Teclis includes its own crawl as well as results from Kagi Small Web index and results with permission from Marginalia Search.

> Teclis works best with broad queries such as 'machine learning', 'vegan diet', 'religion' etc..

Is there another crawler doing the general-purpose stuff?

nobodywasishere 22 days ago

Not in production yet, so will have more to share in time.

rpdillon 22 days ago

Fair enough, thanks for replying!

esperent 22 days ago

How broad will they be? Do you aim to ever have large scale indexing of the web?

Xunjin 22 days ago

Hey do you guys have posts or sharing about it? It would be awesome to see what you are trying to accomplish, maybe it's time to post on HN ;)

echelon 22 days ago

Hurry. Google might give up the ghost on its search product and maintaining indices on anything not geared for LLMs.

I'm not sure antitrust will help you.

tokai 22 days ago

So you will stop buying Yandex data at some point?

sph 22 days ago

I care more from breadth of the dataset than politics in my search engine, thank you very much.

For everybody else there’d Google I guess.

ljm 22 days ago

What are the challenges of doing that when so much of the internet has turned itself into SEO slop to fit Google's algorithms?

I imagine there is still a whole load of stuff out there on the internet that Google would never surface because it doesn't have enough adsense or whatever. Are you finding that?

account42 21 days ago

There is a very easy 90% solution to that: massively downrank everything with ads with some exceptions added as needed.

nobodywasishere 22 days ago

That's what SlopStop is for :) developing methodologies that scale for detecting slop.

I mean it sounds like that already has a lot of overlap with our Small Web indexing efforts, so that part of our indexing efforts could be an extension of that. A lot if this is still in development though so I can't speak on specifics just yet.

Dwedit 22 days ago

How do you build a search index in the days of Anubis pages everywhere?

account42 21 days ago

Anubis is easy, just use a whitelisted user agent or a headless browser if some sites disable that - you need one to index web app abominations anyway. Cloudflare and Google reCaptcha are bigger problems.

majorchord 20 days ago

https://roundproxies.com/blog/how-to-bypass-anti-bots/

xnx 22 days ago

> We're actively working on building our own indexes

Lip service. You'll have some token index of Wikipedia or something so you can say your results are "a blend of our own index and other sources".

freehorse 22 days ago

Wikipedia is prob in "other sources", as they actually say they have a direct license for it.

https://blog.kagi.com/waiting-dawn-search#:~:text=Wikipedia,...

account42 21 days ago

Lol everyone has a direct license for Wikipedia.

nobodywasishere 22 days ago

It's funny you say that as we just switched the Wikipedia widget over to our own internal index. We don't intend to stop there either.