OpenAI uses data from YouTube videos to train GPT-4 AI model: report

Published by

Surja

08/04/2024

A report claims that OpenAI may have used more than 1 million hours of YouTube video transcription data to train its latest artificial intelligence (AI) model, GPT-4. It further stated that the ChatGPT maker was forced to obtain data through YouTube as it had exhausted all text resources to train its artificial intelligence model. The allegation, if true, could create new problems for the artificial intelligence company, which is already facing multiple lawsuits over its use of copyrighted data. Notably, a report last month highlighted that its GPT store contained mini chatbots that violated the company’s guidelines.

in a ReportThe New York Times said that after running out of sources of unique text words for training AI models, the company developed an automatic speech recognition tool called Whisper, used it to transcribe YouTube videos and used the data to train its model. OpenAI publicly launched Whisper in September 2022, and the artificial intelligence company said it was trained on 680,000 hours of “multi-language and multi-task supervised data collected from the web.”

The report, citing unnamed people familiar with the matter, said OpenAI employees discussed whether using YouTube data would violate the platform’s guidelines and land them in legal trouble. Notably, Google prohibits platform-independent apps from using video.

Eventually, the company reportedly went ahead with the program, transcribing more than 1 million hours of YouTube videos and inputting the text into GPT-4. Additionally, the New York Times report claimed that OpenAI president Greg Brockman was directly involved in the process and personally helped collect data from the videos.

Speaking OpenAI spokesman Matt Bryant told The Verge that the reports were unconfirmed and denied any such activity, saying: “Both our robots.txt file and terms of service prohibit unauthorized Crawl or download YouTube content.” Another spokesperson, Lindsay Held, told the publication that it uses “numerous sources, including partners with public data and non-public data,” as the source of its data. She also added that the AI company is looking into the possibility of using synthetic data to train its future AI models.

Affiliate links may be automatically generated – see our Ethics Statement for details.

Follow Us on

Follow Us On

OpenAI uses data from YouTube videos to train GPT-4 AI model: report

Recent posts

Video: Telangana Congress Candidate Slaps Woman During Campaign

Cardi B suffered a last-minute fashion emergency with Offset

Piers Morgan makes big claim over Prince Harry, Meghan Markle’s royal titles

Anne Hathaway had a ‘special place’ in Nicholas Galitzine’s heart: Here’s why

Johnny Depp will prioritize his health and mental well-being after Amber Heard trial

King Charles’s American friend reveals the truth about the monarch’s health

Meghan Markle prefers ‘good PR’ over ‘being booed by royalists’ in Nigeria

Prince William aims to follow in Prince Harry’s ‘rebellious’ footsteps

Selena Gomez is ready to settle down with boyfriend Benny Blanco

Ben Affleck, Jennifer Lopez ‘stand united’ in the face of criticism