The well for AI training data is running dry. Big Tech heavyweights are taking extraordinary measures to deal with the drought.

Contrary to popular perception, digital data that’s suitable for training AI models is a finite resource. Bumping up hard against this reality, three of the field’s top players—OpenAI, Google and Meta—have been acting like they’ve had no choice but to potentially cut some ethical corners, skirt their own policies and mull the pros and cons of bending the law.

This is not an opinion. It’s the conclusion of five of the sharpest tech reporters on staff at The New York Times. The nation’s unofficial newspaper of record launched a journalistic investigation after its legal team sued OpenAI and Microsoft last year. At that time the Times accused the companies of using copyrighted news articles to train AI models. The companies called their repurposing of the content “fair use.”

In the present investigative article, published April 6 and topping 3,000 words, first author Cade Metz and colleagues lay out the business challenges that led OpenAI, Google and Meta to potentially err on the side of aggressiveness. Embedded in the article are directives company leaders seem to have decided to run with. Four examples:

1. Transcribe audio from more than a million hours of YouTube videos so as to scrape conversational text for model training. The Times reports this is what OpenAI did, according to several of its employees who evidently spoke with the newspaper on condition of anonymity.

Some of these people “discussed how such a move might go against YouTube’s rules,” the reporters write. “YouTube, which is owned by Google, prohibits use of its videos for applications that are ‘independent’ of the video platform. More:

‘The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful A.I. models and was the basis of the latest version of the ChatGPT chatbot.’

2. Discuss buying a publishing house to procure long works. Meta did just this, the Times reports, basing the assertion on recordings of internal meetings the newspaper obtained. Meta, which owns Facebook and Instagram, “also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. … Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.” More:

‘Like OpenAI, Google transcribed YouTube videos to harvest text for its A.I. models, five people with knowledge of the company’s practices said. That potentially violated the copyrights to the videos, which belong to their creators.’

3. Broaden terms of service in order to tap publicly available documents, restaurant reviews and other online materials for AI training. It’s not hard to guess that this one goes to Google, which has Google Docs, Google Maps and the like for data harvesting. All three companies’ actions, the reporters add, “illustrate how online information—news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts and movie clips—has increasingly become the lifeblood of the booming AI industry.” More:

‘Creating innovative systems depends on having enough data to teach the technologies to instantly produce text, images, sounds and videos that resemble what a human creates.’

4. Learn from pools of digital text spanning as many as 3 trillion words. That would be around twice the word count of the bookshelves at Oxford University’s Bodleian Library, Metz and co-authors note, adding that the Bodleian has been collecting manuscripts since 1602.

“For years, the internet—with sites like Wikipedia and Reddit—was a seemingly endless source of data,” the authors point out. “But … Google and Meta, which have billions of users who produce search queries and social media posts every day, [have been] largely limited by privacy laws and their own policies from drawing on much of that content for AI.” More:

‘Their situation is urgent. Tech companies could run through the high-quality data on the internet as soon as 2026, according to Epoch, a research institute. The companies are using the data faster than it is being produced.’

There’s a lot more. Read the whole thing.

 

Dave Pearson

Dave P. has worked in journalism, marketing and public relations for more than 30 years, frequently concentrating on hospitals, healthcare technology and Catholic communications. He has also specialized in fundraising communications, ghostwriting for CEOs of local, national and global charities, nonprofits and foundations.