Hacker News new | ask | show | jobs
by barathr 1091 days ago
Rather than there being lawsuit after lawsuit of this sort, we wrote an op-ed this morning that says there should be a simple, compulsory licensing fee that AI companies pay to the public -- something we called the AI Dividend: https://www.politico.com/news/magazine/2023/06/29/ai-pay-ame...
3 comments

The order of magnitude of suggested pricing is really interesting: $0.001/word is significantly more expensive than, say, OpenAI's pricing of GPT-3.5-turbo ($0.002/1k tokens, ~750 words, so ~$0.000003/word, assuming I got my zeros correct). So this would increase the cost of running GPT-3 by about 300x.

In terms of implementation, I wonder about a few things:

Do models trained on more data have to pay more? LLaMA was trained on 1.5T tokens, the original GPT-3 was trained on ~300B tokens. And this is only partially related to model quality, LLaMA 13B and LLaMA 65B were trained on the same data, but the 65B model is better. What's the incentive to ever use the 13B model, if the licensing cost is 100x-1000x the model inference cost?

Who defines a word? Each model uses a different tokenizer. I'm personally amused by the idea of a government-mandated tokenizer.

What about generations that never see human eyes? As an NLP researcher, I've generated millions of tokens for training and automatic evaluation purposes -- are those subject to licensing as well?

Yeah, the idea is that it's much more expensive than current OpenAI pricing but much less expensive than what even a low-end marketing copy writer would charge per word. Its side effect would be to push such tools towards more valuable uses.

The idea is to keep it simple, so it wouldn't be based upon the specifics of training, just whether or not it used public data. Anything else would require companies to divulge trade secrets and that won't fly. And words are defined here as, well, words -- English words. There'd be a separate fee per pixel/voxel, and then a catchall for non-language/non-image models.

1. How would this not make tools like Github Copilot exorbitantly expensive? Why should I have to pay a tax to everyone else in the United States to use something that was disproportionately trained on my own data?

2. Given that the internet is global, is every country supposed to make their own versions of this? Will I have to pay the EU tax to use models that might have been trained on data that Europeans posted online?

To your first question, it would incentivize training of models on one's own data exclusively -- companies could train something like Copilot on their own code, for instance. To your second question, there's no way to have an international policy like this so yes each jurisdiction would do it independently -- just as they do with thousands of other similar things.
I don't think a model trained on a single company's data would be nearly as helpful as a model trained on all publicly licensed code on the internet. But suppose it were...

What if I'm not a massive corporation with millions of lines of code to train on and I want to pay for an AI coding assistant? Doesn't this make it effectively illegal for me to purchase such a product for a reasonable price when big companies will presumably be able to use it without paying the tax?

Another situation - let's say you're a company that contributes heavily to open source, but also accepts external contributions. Could Facebook train a model on the React codebase, for example, without having to pay the AI tax?

Another situation - suppose I start an LLM coding assistant and sell it to my friend. Presumably I don't have to pay the tax as a "low revenue" company. Then I get acquired or get some huge seed round and suddenly my customers have to pay the AI tax. Doesn't this just nuke all my customers?

Anyway, as a software engineer, I personally want people to use my code for whatever they want to use it for, without having to pay me for it. I indicate that by using an MIT license. Why throw that precedent out the window?

The policy would exempt all except big companies from the fees. So if you set up your own, you don't pay. And the effect of the revenue threshold creating an advantage for small businesses is commonplace in policy across the board -- SMBs don't have many of the same costs and obligations as larger companies.

And this would not prevent you from explicitly licensing your code or writing to let people to train on it. But what it would do is say that if someone didn't explicitly license it then it is covered under the policy.

Also regarding international policy - good luck getting Chinese citizens to pay the US AI tax. Effectively you'd be nerfing anyone under US jurisdiction
Not really -- it's the same as selling any service into the US. Yes people cheat on, say, sales tax, just like Amazon did in the early years, but eventually once big enough companies end up having to adhere to the policy.
Can't wait for the deluge of AI generated content dumped en masse on the internet purely to harvest "AI Dividends".
The dividend isn't paid to generated content but for generated content -- so generating content (using say ChatGPT) means you're paying into the AI Dividend fund not receiving money from it.