AI firms say they can’t respect copyright. These researchers tried.
-
A coalition of over two dozen researchers—via Eleuther AI, MIT, CMU, University of Toronto—assembled an 8‑terabyte dataset composed exclusively of public-domain and openly-licensed texts. They then trained a 7-billion-parameter language model, Comma v0.1, matching the performance of Meta’s Llama 2‑7B, demonstrating that ethical training is feasible .
-
Ethical But Labor-Intensive
The process required manual verification of licensing and formatting, proving that while privacy-compliant data sourcing is possible, automating it at scale remains impractical .
-
Policy Implications
This research serves as a counterpoint to major AI firms (OpenAI, Anthropic), who argue licensing copyrighted content is infeasible .
It enters the debate amid legal action (e.g., Reddit vs. Anthropic) and legislative developments in the US and UK regarding AI’s use of copyrighted data .
-
Copyright Office Involvement
At the same time, Part 3 of the U.S. Copyright Office’s AI report (“Generative AI Training”) was pre-published in May. It analyzes fair use, infringement risks, and licensing practicality—smack in the core of these ongoing disputes