To paraphrase the late John F. Kennedy, we choose to define open-source AI not because it is easy, but because it is hard; because that goal will serve to organize and measure the best of our energies and skills.

Stefano Maffulli, executive director of the Open Source Initiative (OSI), told me that the software and data that mixes artificial intelligence (AI) with existing open-source licenses is a bad fit. “Therefore,” said Maffulli, “We need to make a new definition for open-source AI.”

Firefox’s parent organization, the Mozilla Foundation, agrees. 

The big tech giants, a Mozilla representative explained, “have not necessarily adhered to the full principles of open source regarding their AI models.” Also, a new definition “will help lawmakers working to develop rules and regulations to protect consumers from AI risks.”  

The OSI has been working diligently on creating a comprehensive definition for open-source AI, similar to the Open-Source Definition for software. This critical effort addresses the growing need for clarity in determining what makes up an open-source AI system at a time when many companies claim their AI models are open source without really being open at all, such as Meta’s Llama 3,1.

The latest OSI Open-Source AI Definition draft, 0.0.9, has several significant changes. These are:

As Linux Foundation executive director Jim Zemlin detailed at the KubeCon and Open Source Summit China, the MOF “is a way to help evaluate if a model is open or not open. It allows people to grade models.”

Within the MOF, Zemlin added, there are three tiers of openness. “The highest level, level one, is an open science definition where the data, every component used, and all of the instructions need to actually go and create your own model the exact same way. Level two is a subset of that where not everything is actually open, but most of them are. Then, on level three, you have areas where the data may not be available, and the data that describe the data sets would be available. And you can kind of understand that — even though the model is open — not all the data is available.”

These three levels — a concept that also appears in training data — will be troublesome for some open-source purists to accept. Arguments over both the models and the training data will emerge as the debate continues about which AI and machine learning (ML) systems are truly open and which are not.

Building the Open Source AI definition has been done collaboratively with diverse stakeholders worldwide. These include, among many others, Code for America, Wikimedia Foundation, Creative Commons, Linux Foundation, Microsoft, Google, Amazon, Meta, Hugging Face, Apache Software Foundation, and UN International Telecommunications Union. 

The OSI has held numerous town halls and workshops to gather input, ensuring that the definition is inclusive and representative of various perspectives. The process is still ongoing. 

The definition will continue to be refined and polished via worldwide roadshows and the collection of feedback and endorsements from diverse communities.

OSI’s Maffulli knows not everyone will be happy with this draft of the definition. Indeed, before this version’s appearance, AWS Principal Open Source Technical Strategist Tom Callaway posted on LinkedIn, “It is my strong belief (and the belief of many, many others in open source) that the current Open Source AI Definition does not accurately ensure that AI systems preserve the unrestricted rights of users to run, copy, distribute, study, change, and improve them.”

Now that the draft has seen the light of day, I’m sure others will get their say. The OSI hopes to present a stable version of the definition at the All Things Open conference in October 2024. If all goes well, the result will be a definition that most — if not everyone — can agree promotes transparency, collaboration, and innovation in open-source AI systems.