2 Apr 2024

French AI start-up claims training AI models on non-copyrighted data is possible

French startup Common Corpus challenges OpenAI’s claim, introducing a public dataset for training language models. Led by Pierre-Carl Langlais, it aims to foster open science and competition.

A French startup called Common Corpus has challenged OpenAI’s claim that AI tools such as ChatGPT require access to copyrighted data for development. In the midst of increasing legal conflicts, particularly the New York Times’ lawsuit against OpenAI and Microsoft, Common Corpus has emerged as a potential solution. It has introduced the largest publicly available dataset for training large language models (LLMs), promoting open science and fostering competition. Led by Pierre-Carl Langlais and coordinated by Pleias, the initiative involves collaborations with various AI entities like HuggingFace, Occiglot, Eleuther, and Nomic AI.

Supported by Langu: IA, a project under the French culture ministry, Common Corpus aims to facilitate access to data in French and other languages for LLMs. While boasting 180 billion words in English and substantial datasets in other languages, the corpus’s reliance on non-copyrighted material limits its freshness. Langlais emphasizes the importance of synthetic data and open administrative data to improve the corpus’s quality and diversity. Despite its limitations, Common Corpus strives for ongoing enhancement, promoting collaboration and inclusivity in AI while challenging copyrighted data dominance.

Why does it matter?

This announcement comes shortly after OpenAI defended the necessity of using copyrighted material in training advanced AI tools like ChatGPT, citing its crucial role in innovation. Amid legal disputes, including actions by entities like the New York Times and authors such as George RR Martin, OpenAI argues that contemporary AI models heavily rely on copyrighted content. The company contends that restricting training data to public domain sources would lead to inferior AI models. Despite allegations of appropriation, OpenAI and other AI firms justify using copyrighted material under the legal doctrine of fair use.