In my opinion, the copyright should be based on the training data. Scraped the internet for data? Public domain. Handpicked your own dataset created completely by you? The output should still belong to you. Seems weird otherwise.
I think excluding all AI creations from copyright might be one part of a good solution to all this. But you’re right that something has to be done at the point of scraping and training. Perhaps training should be considered outside of fair use and a copyright violation (without permission).
This would make obtaining training data extremely expensive. That effectively makes AI research impossible for individuals, educational institutions, and small players. Only tech giants would have the kind of resources necessary to generate or obtain training data. This is already a problem with compute costs for training very large models, and making the training data more expensive only makes yhe problem worse. We need more free/open AI and less corporate controlled AI.
That problem has been solved many times over. Go check out the Google Maps API as an example. Small scale usage is free, with a generous enough margin for startups and academics. And there is a special arrangement that can be made for non profit use, by approval.
totally. and if scraped, they must be able to provide the source. I don’t care if it costs them money/compute time. They are allowed to grow with fake money after all
The issue here is if you’d need to prove where your data came from. So the default should be public unless you can prove the source of all the training data
Of course, just because material is on the internet does not mean that material is public domain.
So AI is likely the worst of both worlds: It can infringe copyright and the publisher be held liable for the infringement, but offers no protection in and of itself down the line.
I think the next big thing is going to be proving the provenience of training data. Kinda like being able to track a burger back to the farm(s) to prevent the spread of disease.
There was an onlyfans creator on a chat group for one of the less restricted machine learning image generators a while ago.
They provided a load of their content, and there was a cash prize for generating content that was indistinguishable from them.
Provided they were sure that the dataset was only their content, they might be able to claim copyright under this.
I can start to imagine some ways that we might get a company like OpenAI to play nice, but this software is going to be in so many hands in the coming years, and most of them won’t be good actors with an enterprise business behind them.
Over 30 years ago, someone tried to claim copyright on a phone book. Their phone book had required manually collecting and verifying data from various sources, and the author of the phone book believed all that effort should be rewarded with a copyright on their product.
The Supreme Court rejected that argument. They established that copyright is not a general reward for the “sweat of the brow”. It is only meant to protect human creative expression. A phone book is not creative expression, and neither is any other handpicked database.
That’s not the take (although in a sense I agree training data should influence it especially if it materially reproduce training samples)
Instead the argument is that the individual outputs from ML can only be copyrighted if they carry a human expression (because that’s what the law is specifically meant to cover), if there’s creative height in the inputs to it resulting in an output carrying that expression.
Compare to photography - photographs aren’t protected automatically just because a button is pressed and an image is captured, rather you gain copyright protection as a result of your choice of motive which carries your expression.
Too simple prompts to ML models would under this ruling be considered to be comparable to uncopyrightable lists of facts (like a recipe) and thus the corresponding output is also not protected.
In my opinion, the copyright should be based on the training data. Scraped the internet for data? Public domain. Handpicked your own dataset created completely by you? The output should still belong to you. Seems weird otherwise.
I think excluding all AI creations from copyright might be one part of a good solution to all this. But you’re right that something has to be done at the point of scraping and training. Perhaps training should be considered outside of fair use and a copyright violation (without permission).
This would make obtaining training data extremely expensive. That effectively makes AI research impossible for individuals, educational institutions, and small players. Only tech giants would have the kind of resources necessary to generate or obtain training data. This is already a problem with compute costs for training very large models, and making the training data more expensive only makes yhe problem worse. We need more free/open AI and less corporate controlled AI.
That problem has been solved many times over. Go check out the Google Maps API as an example. Small scale usage is free, with a generous enough margin for startups and academics. And there is a special arrangement that can be made for non profit use, by approval.
totally. and if scraped, they must be able to provide the source. I don’t care if it costs them money/compute time. They are allowed to grow with fake money after all
deleted by creator
The issue here is if you’d need to prove where your data came from. So the default should be public unless you can prove the source of all the training data
Of course, just because material is on the internet does not mean that material is public domain.
So AI is likely the worst of both worlds: It can infringe copyright and the publisher be held liable for the infringement, but offers no protection in and of itself down the line.
I think the next big thing is going to be proving the provenience of training data. Kinda like being able to track a burger back to the farm(s) to prevent the spread of disease.
There was an onlyfans creator on a chat group for one of the less restricted machine learning image generators a while ago.
They provided a load of their content, and there was a cash prize for generating content that was indistinguishable from them.
Provided they were sure that the dataset was only their content, they might be able to claim copyright under this.
I can start to imagine some ways that we might get a company like OpenAI to play nice, but this software is going to be in so many hands in the coming years, and most of them won’t be good actors with an enterprise business behind them.
Over 30 years ago, someone tried to claim copyright on a phone book. Their phone book had required manually collecting and verifying data from various sources, and the author of the phone book believed all that effort should be rewarded with a copyright on their product.
The Supreme Court rejected that argument. They established that copyright is not a general reward for the “sweat of the brow”. It is only meant to protect human creative expression. A phone book is not creative expression, and neither is any other handpicked database.
That’s not the take (although in a sense I agree training data should influence it especially if it materially reproduce training samples)
Instead the argument is that the individual outputs from ML can only be copyrighted if they carry a human expression (because that’s what the law is specifically meant to cover), if there’s creative height in the inputs to it resulting in an output carrying that expression.
Compare to photography - photographs aren’t protected automatically just because a button is pressed and an image is captured, rather you gain copyright protection as a result of your choice of motive which carries your expression.
Too simple prompts to ML models would under this ruling be considered to be comparable to uncopyrightable lists of facts (like a recipe) and thus the corresponding output is also not protected.