Hacker News new | ask | show | jobs
by parhamn 482 days ago
I think the magic of Grok's implementation of this is that they already have most of the websites cached (guessing via their twitter crawler) so it all feels very snappy. Bing/Brave search don't seem to offer that in their search apis. Does such a thing exist as a service?
4 comments

I’ve been wondering about this and searching for solutions too.

For now we’ve just managed to optimize how quickly we download pages, but haven’t found an API that actually caches them. Perhaps companies are concerned that they’ll be sued for it in the age of LLMs?

The Brave API provides ‘additional snippets’, meaning that you at least get multiple slices of the page, but it’s not quite a substitute.

Web search APIs can't present the full document due to copyright. They can only present the snippet contextual to the query.

I wrote my own implementation using various web search APIs and a puppeteer service to download individual documents as needed. It wasn't that hard but I do get blocked by some sites (reddit for example).

Google and Bing's Cache, Archive.org, Archive.is, CommonCrawl... many services have previously or currently presented the full document.

Google and Bing removed their cache features when LLMs started taking off – as I said in a sibling comment, I wonder if they felt that that regime was finally going to be challenged in court as people tried to protect their data.

That being said, "can't present the full document due to copyright" seems at odds with all of the above examples existing for years.

(founder here) We are working on that problem of providing deeper level of search especially on proprietary datasets (think reference works, books, papers etc.). Started off with Arxive papers )We are working on that problem of providing deeper level of search especially on proprietary/ copyright datasets (think reference works, books, papers etc.). We are working with a number of large publishers on this.

We started off with Arxive papers to test out the product- would love to get feedback :)

https://exchange.valyu.network/

Is this true? Wouldn't all the "site to markdown" type services be infringing then?
exa is your answer i think? https://latent.space/p/exa
the common crawl dataset is rather massive, though I can't speak to how well it would perform here

http://commoncrawl.org