| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pharmakom 1180 days ago
	OpenAI is actively blocking the scraping use case. Does this work around that?

6 comments

_5hxt 1180 days ago

Couldn't find any mention of this, please provide a source. Their ToS mentions scraping but it pertains to scraping their frontend instead of using their API, which they don't want you to do.

Also - this library requests the HTML by itself [0] and ships it as a prompt but with preset system messages as the instruction [1].

[0] - https://github.com/jamesturk/scrapeghost/blob/main/src/scrap...

[1] - https://github.com/jamesturk/scrapeghost/blob/main/src/scrap...

link

transitivebs 1180 days ago

I don't think this is correct at all. It's one of the main use cases for GPT-4 – so long as the scraped data or outputs from their LLMs aren't used to train competing LLMs.

link

timhigins 1180 days ago

What do you mean by this, and what would be their reason for doing so? I've tested a few prompts for scraping and there have been no problems.

link

pharmakom 1180 days ago

Ran into issues asking for JSON output

link

simonw 1180 days ago

What kind of issues?

link

dragonwriter 1180 days ago

> OpenAI is actively blocking the scraping use case.

How? And since when? Scraping is identical to retrieval except in terms of what you do with the data after you have it, and to differentiate them when you are using the API, OpenAI would need to analyze the code calling the API, which doesn’t seem likely.

link

yinser 1180 days ago

Workaround: use another tool to scrape the markdown then hand the text to OpenAI

link

sagarpatil 1180 days ago

OpenAI - scrapes the whole World Wide Web. When I ask for a script to scrape a website, you might be breaking our ToS lol.

link