Hacker News new | ask | show | jobs
by binarymax 485 days ago
Web search APIs can't present the full document due to copyright. They can only present the snippet contextual to the query.

I wrote my own implementation using various web search APIs and a puppeteer service to download individual documents as needed. It wasn't that hard but I do get blocked by some sites (reddit for example).

3 comments

Google and Bing's Cache, Archive.org, Archive.is, CommonCrawl... many services have previously or currently presented the full document.

Google and Bing removed their cache features when LLMs started taking off – as I said in a sibling comment, I wonder if they felt that that regime was finally going to be challenged in court as people tried to protect their data.

That being said, "can't present the full document due to copyright" seems at odds with all of the above examples existing for years.

(founder here) We are working on that problem of providing deeper level of search especially on proprietary datasets (think reference works, books, papers etc.). Started off with Arxive papers )We are working on that problem of providing deeper level of search especially on proprietary/ copyright datasets (think reference works, books, papers etc.). We are working with a number of large publishers on this.

We started off with Arxive papers to test out the product- would love to get feedback :)

https://exchange.valyu.network/

Is this true? Wouldn't all the "site to markdown" type services be infringing then?