Google, OpenAI heavyweight news content on free AI training

According to a new study from Ziff Davis, artificial intelligence giants like Google, OpenAI and Meta are placing more importance on content from reputable news sources when training large language models.

The findings could help the public understand where chatbots get their information and give media companies like Ziff Davis more leverage when seeking copyright protection or payment for their material when it is ingested by AI.

“Our work shows that LLM training’s main datasets are disproportionately composed of high-quality content owned by commercial publishers of news and media websites,” the study says. “Major LLM companies have quantitatively prioritized this content in the training of the most important LLMs throughout the technology’s short history.”

Ziff Davis is the parent company of PCMag. The study was conducted by the company’s chief AI advocate, George Wukoson, and chief technology officer, Joey Fortuna. It examined open-source copies of datasets that AI companies have admitted to using, including Common Crawl, C4, OpenWebText and OpenWebText2.

OpenAI admits it gives more weight to datasets it considers high-quality, including news media, copyrighted books, and links embedded in popular Reddit posts. This is a way to sort all the content that LLMs scrape from the web with the goal of producing better responses for users.

For example, it gave WebText2 22% weight in GPT-3 training despite accounting for 3.8% of arguments. Nearly 13.5% of URLs embedded in WebText2 come from a group of 15 major media publishers, including News Corp, The New York Times, Gannett, Ziff Davis, Vox Media, Axel Springer, Alden Capital, Hearst, The Washington Post, BuzzFeed, Future, IAC and Bustle.

The content of data sets also changes over time. For example, OpenAI placed a high emphasis on content from Washington Post in OpenWebText, but reduced its importance for the release of OpenWebText2.

WebText2 content

(Credit: Ziff Davis)

Ziff Davis says the findings define how important news media is to the future of AI chatbots, without any obligation to pay them for it. This “long-term exploitation of high-quality publisher content (extremely profitable for LLM companies) [implies] lost licensing revenue from some of the world’s most respected companies.”

Recommended by our Editors

Without paying for content, publishers could go out of business, threatening the continued flow of high-quality information in the age of AI.

The report comes after a federal judge dismissed a lawsuit against OpenAI by Raw Story and AlterNet, which said the AI ​​company used its content to train LLMs without permission, Reuters reports. A related issue from New York Times is ongoing. OpenAI has also signed licensing agreements with many major media companies.

OpenAI’s latest product launch, ChatGPT search, now cites some of its sources in addition to summarizing the content within them.

Get our best stories!

Register for What’s new now? to get our top stories delivered to your inbox every morning.

This newsletter may contain advertisements, deals or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You can unsubscribe from newsletters at any time.

About Emily Dreibelbis Forlini

Senior reporter

Emily Dreibelbis Forlini

I’m PCMag’s expert on all things electric vehicles and AI. I’ve written hundreds of articles on these topics, including product reviews, daily news, CEO interviews, and in-depth reporting features. I also cover other topics within the tech industry, keeping a pulse on emerging technologies that could shape the way we live and work.

Read Emily’s full bio

Read the latest from Emily Dreibelbis Forlini

Leave a Comment