Digital fossils: The curious case of 'vegetative electron microscopy'

A meaningless phrase, born from a scanning error, is now showing up in published research—revealing the hidden risks of unchecked AI training data...

Apr 17, 2025

∙ Paid

An image of a melting pocket watch and a cat face — Image credit: Canva

The phrase “vegetative electron microscopy” has started showing up in scientific papers and AI-generated text, even though it doesn’t actually mean anything. It sounds technical, but it’s just a mistake that came from a mix of scanning errors and poor translations, which then got absorbed into the training data of large AI models.

Over time, it began appearing more widely, becoming what experts call a “digital fossil”—an error that’s been preserved and keeps spreading.

The mistake began with two old scientific papers from the 1950s. When these were scanned, words from separate columns were accidentally joined, combining “vegetative” and “electron” into a fake phrase. Years later, a translation error from Farsi to English—where the words for “vegetative” and “scanning” look very similar—led to the term showing up again in several Iranian scientific articles.

By 2025, Google Scholar showed 22 papers using the phrase. Researchers found that newer AI models like GPT-3, GPT-4o, and Claude 3.5 often suggested the term when asked relevant questions, while older models like GPT-2 and BERT did not—suggesting the error entered training data in more recent years. The most likely source is CommonCrawl, a massive collection of scraped websites used to train AI.

Fixing these kinds of errors is very difficult. The datasets are huge—far too big for most researchers to access or clean. On top of that, AI companies don’t reveal what exactly goes into their training data. Efforts to trace or remove problematic content are often blocked by legal issues, like copyright claims. Even simple fixes like banning certain keywords can backfire, removing both the errors and valid articles that talk about them.

This raises bigger questions about the trustworthiness of AI-generated content. As AI becomes more common in research and publishing, errors like this might go unnoticed or even be reinforced over time. Some journals have corrected or removed papers with the phrase, but others initially defended it. For example, Elsevier first stood by its use before later correcting it.

This isn’t an isolated case. AI-generated writing has also introduced strange phrases like “counterfeit consciousness” instead of “artificial intelligence,” or even left behind clues like “I am an AI language model” in published papers. Tools like the Problematic Paper Screener can catch known issues like “vegetative electron microscopy,” but they can’t catch brand-new ones.

In the end, these persistent errors highlight a deeper issue: when AI tools go unchecked, they can lock mistakes into our shared knowledge. These “digital fossils” show why it’s important for researchers, publishers, and tech companies to work together to make sure information remains reliable. That means more transparency from AI developers, better editorial review, and stronger critical thinking from researchers.

More news!

Keep reading with a 7-day free trial

Subscribe to The PhilaVerse to keep reading this post and get 7 days of free access to the full post archives.