The developer of ChatGPT is trying to get the media to sign licensing agreements that will simplify AI training and help avoid copyright issues.
According to The Information, OpenAI offers between $1 million and $5 million a year to license copyrighted news articles to train AI models. Another report previously reported that Apple is willing to pay media companies at least $50 million for multi-year use of such materials.
Meanwhile, OpenAI’s numbers are somewhat similar to other non-AI licensing deals. Meta, when it launched the Facebook News tab (which it recently removed in Europe), allegedly offered up to $3 million a year to license news, headlines and previews.
It is not known whether the total number reached such figures as, for example, Google, which in 2020 announced that it would invest $1 billion in partnerships with news organizations. Under pressure from the new law, the company also recently agreed to pay Canadian publishers a total of $100 million a year in exchange for links to their articles.
Modern large language models, as far as is known, are mostly trained on information taken from the Internet. Datasets vary in price, but there are free ones, such as LAION, which is used by Stable Diffusion (though it was temporarily pulled due to child sexual abuse content).
AI developers also use web crawlers, which collect training information from the Internet, and hire people to verify and label it (and this is often quite expensive). Meanwhile, some media outlets, such as The New York Times and The Verge’s parent company, Vox Media, have blocked access to OpenAI’s GPT scanner data.
On the other hand, several organizations claim that learning from their data is copyright infringement. The New York Times, among others, has sued OpenAI and Microsoft, alleging that ChatGPT and Microsoft Copilot can generate responses that quote their work almost verbatim.
Meanwhile, Axel Springer (the parent company of Politico and Business Insider) and The Associated Press have already struck deals with OpenAI to license articles to train models like GPT-4 and develop technologies for news gathering.