Hacker News new | ask | show | jobs
by mdp2021 423 days ago
Obviously the training data should be preferably high quality - but there you have a (pseudo-, I insisted also elsewhere citing the rights to have read whatever is in any public library) problem with "copyright".

If there exists some advantage on quantity though, then achieving high quality imposes questions about tradeoffs and workflows - sources where authors are "free participants" could have odd data sip in.

And the matter of whether such data may be reflected in outputs remains as a question (probably tackled by some I have not read... Ars longa, vita brevis).