Market News - AntiPiracy

AI firms say they can’t respect copyright. These researchers tried.

  • A coalition of over two dozen researchers—via Eleuther AI, MIT, CMU, University of Toronto—assembled an 8‑terabyte dataset composed exclusively of public-domain and openly-licensed texts. They then trained a 7-billion-parameter language model, Comma v0.1, matching the performance of Meta’s Llama 2‑7B, demonstrating that ethical training is feasible  .

  • Ethical But Labor-Intensive

    The process required manual verification of licensing and formatting, proving that while privacy-compliant data sourcing is possible, automating it at scale remains impractical  .

  • Policy Implications

    This research serves as a counterpoint to major AI firms (OpenAI, Anthropic), who argue licensing copyrighted content is infeasible  .

    It enters the debate amid legal action (e.g., Reddit vs. Anthropic) and legislative developments in the US and UK regarding AI’s use of copyrighted data  .

  • Copyright Office Involvement

    At the same time, Part 3 of the U.S. Copyright Office’s AI report (“Generative AI Training”) was pre-published in May. It analyzes fair use, infringement risks, and licensing practicality—smack in the core of these ongoing disputes

View the original full article here: https://www.washingtonpost.com/politics/2025/06/05/tech-brief-ai-copyright-report/

Related News