According to a new study from Ziff Davis, artificial intelligence giants like Google, OpenAI and Meta are placing more importance on content from reputable news sources when training large language models.
The findings could help the public understand where chatbots get their information and give media companies like Ziff Davis more leverage when seeking copyright protection or payment for their material when it is ingested by AI.
“Our work shows that LLM training’s main datasets are disproportionately composed of high-quality content owned by commercial publishers of news and media websites,” the study says. “Major LLM companies have quantitatively prioritized this content in the training of the most important LLMs throughout the technology’s short history.”
Ziff Davis is the parent company of PCMag. The study was conducted by the company’s chief AI advocate, George Wukoson, and chief technology officer, Joey Fortuna. It examined open-source copies of datasets that AI companies have admitted to using, including Common Crawl, C4, OpenWebText and OpenWebText2.
OpenAI admits it gives more weight to datasets it considers high-quality, including news media, copyrighted books, and links embedded in popular Reddit posts. This is a way to sort all the content that LLMs scrape from the web with the goal of producing better responses for users.
For example, it gave WebText2 22% weight in GPT-3 training despite accounting for 3.8% of arguments. Nearly 13.5% of URLs embedded in WebText2 come from a group of 15 major media publishers, including News Corp, The New York Times, Gannett, Ziff Davis, Vox Media, Axel Springer, Alden Capital, Hearst, The Washington Post, BuzzFeed, Future, IAC and Bustle.
The content of data sets also changes over time. For example, OpenAI placed a high emphasis on content from Washington Post in OpenWebText, but reduced its importance for the release of OpenWebText2.
(Credit: Ziff Davis)
Ziff Davis says the findings define how important news media is to the future of AI chatbots, without any obligation to pay them for it. This “long-term exploitation of high-quality publisher content (extremely profitable for LLM companies) [implies] lost licensing revenue from some of the world’s most respected companies.”
Recommended by our Editors
Without paying for content, publishers could go out of business, threatening the continued flow of high-quality information in the age of AI.
The report comes after a federal judge dismissed a lawsuit against OpenAI by Raw Story and AlterNet, which said the AI company used its content to train LLMs without permission, Reuters reports. A related issue from New York Times is ongoing. OpenAI has also signed licensing agreements with many major media companies.
OpenAI’s latest product launch, ChatGPT search, now cites some of its sources in addition to summarizing the content within them.
Get our best stories!
Register for What’s new now? to get our top stories delivered to your inbox every morning.
This newsletter may contain advertisements, deals or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You can unsubscribe from newsletters at any time.