The Open Source Initiative (OSI) has released its official definition of “open” artificial intelligence, setting the stage for conflict with tech giants like Meta – whose models don't play by the rules.
OSI has long set the industry standard for open source software, but AI systems include elements not covered by traditional licenses, such as model training data. For an AI system to be considered truly open source, it must offer the following:
- Access details about the data used to train the AI so others can understand and replicate it
- The complete code used to build and run the AI
- The settings and weights from training that help the AI achieve its results
This definition directly challenges Metas Llama, which is widely touted as the largest open source AI model. Llama is publicly available for download and use, but restrictions apply to commercial use (for applications with over 700 million users) and no access to training data is possible, making it non-compliant with OSI standards for unrestricted use Freedom to use, modify, etc. corresponds. and share.
Meta spokeswoman Faith Eischen said The edge that although we “agree on many things with our partner OSI”, the company does not agree with this definition. “There is no single open source AI definition, and defining it is challenging because previous open source definitions do not capture the complexity of today's rapidly evolving AI models.”
“We will continue to work with OSI and other industry groups to make AI more accessible and free, regardless of technical definitions,” Eischen added.
For 25 years, the OSI definition of open source software has been widely accepted by developers who want to build on the work of others without fear of lawsuits or licensing traps. Now, as AI reshapes the landscape, tech giants face a crucial decision: embrace these established principles or reject them. The Linux Foundation also recently made an attempt to define “open source AI,” signaling a growing debate about how to adapt traditional open source values to the AI era.
“Now that we have a solid definition, we may be able to take more aggressive action against companies that engage in 'open washing' and declare their work to be open source when in reality it is not,” says Simon Willison, an independent researcher and Inventor of “Open Source”. -Source Multitool Datasette, told The edge.
Clément Delangue, CEO of Hugging Face, called OSI's definition “a great help in shaping the discussion about openness in AI, especially when it comes to the crucial role of training data.”
OSI Managing Director Stefano Maffulli says it took two years to refine the initiative in a collaborative process with advice from experts around the world. This included collaboration with experts in machine learning and natural language processing science, philosophers, content creators from the creative commons world, and more.
While Meta cites security concerns for restricting access to its training data, critics see a simpler motive: minimizing its legal liability and protecting its competitive advantage. Many AI models are almost certainly trained on copyrighted material; in April, The New York Times reported that Meta internally acknowledged that its training data contained copyrighted content “because we have no way not to collect it.” There are a multitude of lawsuits against Meta, OpenAI, Perplexity, Anthropic and others for alleged infringements. But with rare exceptions — like Stable Diffusion disclosing its training data — plaintiffs currently have to rely on circumstantial evidence to prove their work was eliminated.
Meanwhile, Maffulli sees open source history repeating itself. “Meta makes the same arguments” as Microsoft did in the 1990s when it viewed open source as a threat to its business model, Maffulli said The edge. He remembers Meta telling him about the heavy investment in Llama and asking him, “Who do you think will be able to do the same?” Maffulli recognized a familiar pattern: A tech giant was using cost and complexity to justify , that its technology is kept under wraps. “We’re going back to the beginning,” he said.
“That’s their secret sauce,” Maffulli said of the training data. “It’s the valuable intellectual property.”