Engadget

It turns out you can train AI models without copyrighted material

AI companies have long claimed that their tools couldn't exist without training on copyrighted material, but a new study proves them wrong. Researchers from 14 institutions collaborated to build an 8 TB dataset using only public domain and openly licensed material. They trained a seven-billion-parameter large language model (LLM) on this data, which performed comparably to Meta's Llama 2-7B from 2023. The process was labor-intensive, requiring manual annotation and legal clearance for each website scanned. The resulting LLM is less powerful but more ethical, serving as a counterpoint to the industry's claims. This study contradicts statements from OpenAI and Anthropic, which claimed that training AI models without copyrighted materials would be impossible. While this won't change the trajectory of AI companies, it pokes a hole in one of their common arguments and may be cited in future legal cases and regulation debates. The study's findings are significant, as they show that ethical AI development is possible, albeit more difficult. The researchers' efforts demonstrate that it's possible to create AI models that respect copyright laws and intellectual property rights. Ultimately, this study may influence the development of more ethical AI practices in the future.
favicon
engadget.com
engadget.com
Image for the article: It turns out you can train AI models without copyrighted material
Create attached notes ...