Open LLMs: Transparency or 'Openwashing'?

Researchers argue that access to source code alone doesn’t make large language models truly open

Dec 03, 2024

∙ Paid

Image of cat face inside a washing machine — Image by Gerhard Stach from Pixabay (with added cat face courtesy of Canva)

A group of AI researchers from Cornell University, Signal Foundation, and Now Institute argue in Nature that widely known "open" large language models (LLMs) are not as open as they claim.

The researchers, David Widder, Meredith Whittaker, and Sarah West, assert that merely providing access to an LLM's source code does not equate to true openness because the underlying training data remains inaccessible, and training such models requires resources beyond most users' reach.

While LLM creators promote transparency by sharing their models publicly, this openness is limited. Unlike traditional software, downloading LLM code doesn't provide access to the vast knowledge embedded in the model, which is derived from proprietary training. Most users also lack the computational power to train or retrain LLMs independently.

The researchers highlight three key factors affecting openness:

Transparency: Many LLMs, like Llama 3, restrict access to APIs instead of full models, a practice the authors term "openwashing."
Reusability: The usability of shared code varies based on how it’s written.
Extensibility: Users’ ability to modify the code to fit their needs is often constrained.

They conclude that until users have open access to training hardware, readily available datasets, and the original training data, LLMs cannot be genuinely considered open.

Read the scientific paper

More news!

Keep reading with a 7-day free trial

Subscribe to The PhilaVerse to keep reading this post and get 7 days of free access to the full post archives.