Big Tech and data depletion | Industry watcher’s digest | Partner news

News You Need to Know Today

View Message in Browser

Tuesday, April 9, 2024

●

New York Times artificial intelligence training data

The well for AI training data is running dry. Big Tech heavyweights are taking extraordinary measures to deal with the drought.

Contrary to popular perception, digital data that’s suitable for training AI models is a finite resource. Bumping up hard against this reality, three of the field’s top players—OpenAI, Google and Meta—have been acting like they’ve had no choice but to potentially cut some ethical corners, skirt their own policies and mull the pros and cons of bending the law.

This is not an opinion. It’s the conclusion of five of the sharpest tech reporters on staff at The New York Times. The nation’s unofficial newspaper of record launched a journalistic investigation after its legal team sued OpenAI and Microsoft last year. At that time the Times accused the companies of using copyrighted news articles to train AI models. The companies called their repurposing of the content “fair use.”

In the present investigative article, published April 6 and topping 3,000 words, first author Cade Metz and colleagues lay out the business challenges that led OpenAI, Google and Meta to potentially err on the side of aggressiveness. Embedded in the article are directives company leaders seem to have decided to run with. Four examples:

1. Transcribe audio from more than a million hours of YouTube videos so as to scrape conversational text for model training. The Times reports this is what OpenAI did, according to several of its employees who evidently spoke with the newspaper on condition of anonymity.

Some of these people “discussed how such a move might go against YouTube’s rules,” the reporters write. “YouTube, which is owned by Google, prohibits use of its videos for applications that are ‘independent’ of the video platform.” More:

‘The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful A.I. models and was the basis of the latest version of the ChatGPT chatbot.’

2. Discuss buying a publishing house to procure long works. Meta did just this, the Times reports, basing the assertion on recordings of internal meetings the newspaper obtained. Meta, which owns Facebook and Instagram, “also conferred on gathering copyrighted data from across the internet, even if that meant facing lawsuits. … Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.” More:

‘Like OpenAI, Google transcribed YouTube videos to harvest text for its A.I. models, five people with knowledge of the company’s practices said. That potentially violated the copyrights to the videos, which belong to their creators.’

3. Broaden terms of service in order to tap publicly available documents, restaurant reviews and other online materials for AI training. It’s not hard to guess that this one goes to Google, which has Google Docs, Google Maps and the like for data harvesting. All three companies’ actions, the reporters add, “illustrate how online information—news stories, fictional works, message board posts, Wikipedia articles, computer programs, photos, podcasts and movie clips—has increasingly become the lifeblood of the booming AI industry.” More:

‘Creating innovative systems depends on having enough data to teach the technologies to instantly produce text, images, sounds and videos that resemble what a human creates.’

4. Learn from pools of digital text spanning as many as 3 trillion words. That would be around twice the word count of the bookshelves at Oxford University’s Bodleian Library, Metz and co-authors note, adding that the Bodleian has been collecting manuscripts since 1602.

“For years, the internet—with sites like Wikipedia and Reddit—was a seemingly endless source of data,” the authors point out. “But … Google and Meta, which have billions of users who produce search queries and social media posts every day, [have been] largely limited by privacy laws and their own policies from drawing on much of that content for AI.” More:

‘Their situation is urgent. Tech companies could run through the high-quality data on the internet as soon as 2026, according to Epoch, a research institute. The companies are using the data faster than it is being produced.’

There’s a lot more. Read the whole thing.

The Latest from our Partners

Large Language models hallucinate, which is a huge bottleneck in applying AI on biomedical data for use cases such as drug discovery, patient record summarization or search, as well as developing models to describe medical data. Activeloop has recently introduced a feature that improves knowledge retrieval accuracy by up to 22,5% on average. Learn how it was used in one of Brazil's largest hospitals for better patient data management.

Industry Watcher’s Digest

Buzzworthy developments of the past few days.

Parkland Health in Dallas is rightly proud of its status as a public institution with outsize plans for healthcare AI. Another historic institution in Big D is giving Parkland its due for the visionary stance. “Even before AI was a buzzword, Parkland officials said the hospital system was an early adopter among its public hospital peers in using electronic health records,” write the editors of the 139-year-old Dallas Morning News. “This allowed the hospital system to collect valuable patient data and use machine learning and predictive analytics before those technologies were common.” For some, Parkland will always be remembered as the hospital to which President John F. Kennedy was taken that fateful November day in 1963. But its status as a community-forward tech innovator to reckon with may help it turn that page. “Parkland leaders say that being a public hospital actually helped them implement AI strategically,” the newspaper reports. “The hospital system’s budget doesn’t leave much room for experimental resources, so it uses existing AI tools that are highly vetted or collaborates with Parkland Center for Clinical Innovation on new technologies.” Read the rest.
AI is high on the list of nontrivial pursuits at Kettering Health system in Ohio. The system just opened its Center for Clinical Innovation at Ridgeleigh Terrace. The last part of its name comes from the man who once made the place his home. That would be Charles F. Kettering, the inventor, engineer and business leader whose own name went to the health system. “Built in 1914, the house was the first in the United States to have electric air conditioning using freon, one of Kettering’s inventions,” Kettering Health informs. “The history and legacy of Charles F. Kettering’s focus on innovation and the future makes Ridgeleigh Terrace a fitting site for the innovations launched by Kettering Health physicians.”
Nvidia, Microsoft and Alphabet. There you have the answer to the question: “What three little shops is The Motley Fool picking as tops in the race for AI preeminence?” Nvidia’s GPUs are “the must-have component in the AI revolution,” the Fool notes, while Microsoft is “leveraging its partnership with OpenAI in multiple ways” and Alphabet has been “investing in AI for years, from large language models to autonomous vehicles.” More here.
The World Health Organization would like you to meet Sarah. No surname because she’s not a person but a GenAI-powered avatar. With apologies if this disappoints you, her name is an acronym for Smart AI Resource Assistant for Health. She’s charged with promoting good health digitally. She speaks eight languages and will be happy to answer your questions, albeit in a disembodied way, on all sorts of health topics any time of day or night. Read the background here and/or get acquainted with Sarah here.
Accelerating drug development is one of healthcare AI’s core competencies. This week The Week guides a tour through a number of fronts on which the use case is advancing most promisingly. Article here.
A million and a half people have used the virtual therapist named Woebot. That probably figured in the thinking of producers at 60 Minutes when they sent medical correspondent Jon LaPook, MD, to get the story. One of the subject matter experts he interviews is Woebot’s main creator, psychologist Allison Darcy, PhD. “Our field hasn’t had a great deal of innovation since the basic architecture was sort of laid down by Freud in the 1890s,” Darcy tells LaPook. “We have to modernize psychotherapy.” View video or read transcript here.
ChatGPT and Google’s Bard are … useful toys. “But the appropriateness of these platforms in frontline healthcare, and especially in specific guidance, is questionable—particularly when you consider that the free version of ChatGPT’s dataset is from early 2022.” So states Simon Noel, chief nursing informatics officer at Oxford University Hospitals NHS Foundation Trust in the U.K. “Healthcare,” Noel adds, “moves on quickly.”
Research roundup:
- MIT: When an antibiotic fails: MIT scientists are using AI to target ‘sleeper’ bacteria
- Georgetown: Virtual reality sessions lessen cancer pain in clinical trial
- Utrecht University: Neural implants face ethical hurdles, study finds
Funding rounds of note:

Innovate Healthcare thanks our partners for supporting our newsletters.
Sponsorship has no influence on editorial content.

Interested in reaching our audiences, contact our team

*|LIST:ADDRESSLINE|*

You received this email because you signed up for newsletters from Innovate Healthcare.
Change your preferences or unsubscribe here

Contact Us | Unsubscribe from all | Privacy Policy

© Innovate Healthcare, a TriMed Media brand

The well for AI training data is running dry. Big Tech heavyweights are taking extraordinary measures to deal with the drought.

The Latest from our Partners

Industry Watcher’s Digest

Recent Newsletters