A report claims that OpenAI may have used more than 1 million hours of YouTube video transcription data to train its latest artificial intelligence (AI) model, GPT-4. It further stated that the ChatGPT maker was forced to obtain data through YouTube as it had exhausted all text resources to train its artificial intelligence model. The allegation, if true, could create new problems for the artificial intelligence company, which is already facing multiple lawsuits over its use of copyrighted data. Notably, a report last month highlighted that its GPT store contained mini chatbots that violated the company’s guidelines.

in a ReportThe New York Times said that after running out of sources of unique text words for training AI models, the company developed an automatic speech recognition tool called Whisper, used it to transcribe YouTube videos and used the data to train its model. OpenAI publicly launched Whisper in September 2022, and the artificial intelligence company said it was trained on 680,000 hours of “multi-language and multi-task supervised data collected from the web.”

The report, citing unnamed people familiar with the matter, said OpenAI employees discussed whether using YouTube data would violate the platform’s guidelines and land them in legal trouble. Notably, Google prohibits platform-independent apps from using video.

Eventually, the company reportedly went ahead with the program, transcribing more than 1 million hours of YouTube videos and inputting the text into GPT-4. Additionally, the New York Times report claimed that OpenAI president Greg Brockman was directly involved in the process and personally helped collect data from the videos.

See also  Generative AI will explode into a $100 billion industry by 2026, report says

Speaking OpenAI spokesman Matt Bryant told The Verge that the reports were unconfirmed and denied any such activity, saying: “Both our robots.txt file and terms of service prohibit unauthorized Crawl or download YouTube content.” Another spokesperson, Lindsay Held, told the publication that it uses “numerous sources, including partners with public data and non-public data,” as the source of its data. She also added that the AI ​​company is looking into the possibility of using synthetic data to train its future AI models.


Affiliate links may be automatically generated – see our Ethics Statement for details.

Follow us on Google news ,Twitter , and Join Whatsapp Group of thelocalreport.in

Follow Us on