| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by arkmm 308 days ago
	"There was one surprise when I revisited costs: OpenAI charges an unusually low $0.0001 / 1M tokens for batch inference on their latest embedding model. Even conservatively assuming I had 1 billion crawled pages, each with 1K tokens (abnormally long), it would only cost $100 to generate embeddings for all of them. By comparison, running my own inference, even with cheap Runpod spot GPUs, would cost on the order of 100× more expensive, to say nothing of other APIs." I wonder if OpenAI uses this as a honeypot to get domain-specific source data into its training corpus that it might otherwise not have access to.

3 comments

magicalhippo 307 days ago

> OpenAI charges an unusually low $0.0001 / 1M tokens for batch inference on their latest embedding model.

Is this the drug dealer scheme? Get you hooked later jack up prices? After all, the alternative would be regenerating all your embeddings no?

link

cedws 308 days ago

I don’t think OpenAI train on data processed via the API, unless there’s an exception specifically for this.

link

dpoloncsak 307 days ago

Maybe I misunderstand, but I'm pretty sure they offer an option for cheaper API costs (or maybe its credits?) if you allow them to train on your API requests.

To your point, pretty sure it's off by default, though

Edit: From https://platform.openai.com/settings/organization/data-contr...

Share inputs and outputs with OpenAI

"Turn on sharing with OpenAI for inputs and outputs from your organization to help us develop and improve our services, including for improving and training our models. Only traffic sent after turning this setting on will be shared. You can change your settings at any time to disable sharing inputs and outputs."

And I am 'enrolled for complimentary daily tokens.'

link

trhway 307 days ago

i'd not rule out some approach like instead of training directly on the data, may be they would train on a very high dimensional embedding of such a data (or some other similarly "anonymized", yet still very semantically rich representation of the data)

link

dannyw 308 days ago

Can you truly trust them though?

link

cedws 308 days ago

Yes, it would be disastrous for OpenAI if it got out they are training on B2B data despite saying they don’t.

link

dweinus 307 days ago

We're both talking about the company whose entire business model is built on top of large scale copyright infringement, right?

link

dymk 307 days ago

Not the same when the people you infringe on can sue you into the dirt

link

reasonableklout 307 days ago

Have they said they don't? (actually curious)

link

gkbrk 307 days ago

Yes, they have. [1]

> Your data is your data. As of March 1, 2023, data sent to the OpenAI API is not used to train or improve OpenAI models (unless you explicitly opt in to share data with us).

[1]: https://platform.openai.com/docs/guides/your-data

link

mattigames 307 days ago

Yeah, so many companies have been completely ruined after similar PR disasters /s

link

j33zusjuice 307 days ago

Their terms of service say they won’t use the data for training, so it wouldn’t just be a PR disaster; it’d be a breach of contract. They’d be sued into oblivion.

link

johnthescott 308 days ago

i am too lazy to ask openai.

link

anothernewdude 307 days ago

It'd be a way to put crap or poisoned data into their training data if that is the case. I wouldn't.

link