Sora is a diffusion model that starts from a video that looks like static noise and gradually transforms the video by removing the noise multiple times.
Sora can generate the entire video at once, or expand the generated video to make it longer. By providing multiple frames of foresight to the model at once, Sora solved a challenging problem of ensuring that the theme remains unchanged even when temporarily out of sight.
Similar to the GPT model, Sora utilizes a converter architecture, unleashing excellent scalability performance.
Sora represent videos and images as a collection of smaller data units called patches, each data unit resembling a token in GPT. By unifying the way Sora represent data, Sora can train diffusion converters on a wider range of visual data than before, spanning different durations, resolutions, and aspect ratios.
Sora is based on past research on the DALL·E and GPT models. It uses the restatement technique from DALL·E3, which involves generating highly descriptive subtitles for visual training data. As a result, the model can more faithfully follow the text instructions of the user in the generated video.
In addition to being able to generate videos solely based on text instructions, this model can also capture existing still images and generate videos from them, accurately animating image content while paying attention to small details. This model can also capture existing videos and expand or fill in missing frames. Learn more in our technical report.
Sora is the foundation for understanding and simulating real-world models, and Sora believe this ability will be an important milestone in achieving AGI.