The source of Sora's training data is "unknown" to OpenAI's Mira Murati.

Following a recent interview with The Wall Street Journal, Mira Murati evaded questions regarding the source of data for OpenAI’s Sora model.

According to Mira Murati, chief technical officer of OpenAI, the company’s next video-generating artificial intelligence model, Sora, has an uncertain data source.

In a Wall Street Journal interview that was published on March 13, Murati evaded questions regarding the data source for the Sora model, the company’s technology that turns text instructions into movies.

“We trained our upcoming model using both licenced and publicly available data,” Murati responded when asked how the $80 billion corporation did so.

The Journal’s Joanna Stern then enquired as to whether Sora had been taught using information from social networking sites like Facebook, Instagram, and YouTube. Murati answered, “I’m actually not sure about that,” adding:

“You know, if they were openly usable, then they would be available to the public. However, I’m not certain. I have doubts about it.

Stern inquired about the possibility of using Shutterstock’s data for Sora’s training before addressing another subject and bringing up OpenAI’s collaboration with the stock photo company. I just won’t get into specifics about the data that was utilised. However, Murati noted that the data was licenced or publicly accessible. She thereafter told the Journal that Shutterstock data was utilised to create Sora. Training data sets are vast collections of data that are used to teach AI models how to identify patterns, anticipate outcomes, and comprehend language.

Since joining OpenAI in 2018, Murati has overseen several of the company’s most well-known projects, such as the speech-recognition tool Whisper, the image-generator model DALL-E 3, and the most recent iteration of the chatbot GPT-4. She momentarily assumed the role of acting CEO in November 2023 following Sam Altman’s dismissal by OpenAI’s board.

Several legal proceedings concerning the training data of OpenAI’s AI models have been directed towards the company. Writers Richard Kadrey, Christopher Golden, and Sarah Silverman sued the firm in July 2023, claiming that ChatGPT uses copyrighted content to create summaries of the writers’ works.

The New York Times filed a similar copyright infringement lawsuit against Microsoft and OpenAI in December, claiming the businesses trained AI chatbots using content from the newspaper. A different class-action case was brought in California, claiming that OpenAI trained ChatGPT without the agreement of the users by scraping confidential user data from the internet.

The source of Sora’s training data is “unknown” to OpenAI’s Mira Murati.

Share This Post: