Libraries release historical book datasets for AI training
Public domain texts offer researchers a legal and culturally rich alternative to copyrighted data
Tech companies are turning to centuries-old public domain texts to train AI models, moving beyond internet-based data.
Harvard University has released Institutional Books 1.0, a massive dataset containing nearly one million digitized books—dating back to the 15th century and in 254 languages—as part of its Institutional Data Initiative. Supported by Microsoft and OpenAI, the project aims to provide a rich, legally uncontroversial dataset while empowering libraries to reclaim a central role in shaping AI development.
Alongside Harvard’s release, the Boston Public Library is digitizing archives such as historic newspapers and government records, including rare French-language publications from New England. These efforts offer original, structured, and linguistically diverse content that contrasts with the copyrighted and often uncredited material typically used to train AI.
By using public domain works, the initiative sidesteps legal challenges tech firms face over scraping copyrighted content. It also promises to improve AI reasoning and accuracy, thanks to texts rooted in scientific, philosophical, and educational traditions. However, curators acknowledge the presence of outdated or harmful content and are offering guidance to mitigate its misuse.
The Institutional Books 1.0 dataset is being released to the public on the Hugging Face platform, where AI researchers and developers can freely access and use it for training new models.
More news!
How AI overviews are changing Google search—and what you can do about it
Keep reading with a 7-day free trial
Subscribe to The PhilaVerse to keep reading this post and get 7 days of free access to the full post archives.