Sora System Card
Alexandra García Pérez, Arjun Singh Puri, Caroline Friedman Levy, Dani Madrid-Morales, Emily Lynell Edwards, Grant Brailsford, Herman Wasserman, Javier García Arredondo, Kate Turetsky, Kelly Bare, Matt Groh, Maximilian Müller, Naomi Hart, Nathan Heath, Patrick Caughey, Per Wikman Svahn, Rafael González-Vázquez, Sara Kingsley, Shelby Grossman, Vincent Nestler
Overview of Sora
Sora is OpenAI's video generation model, designed to take text, image, and video inputs and generate a new video as an output. Users can create videos up to 1080p resolution (20 seconds max) in various formats, generate new content from text, or enhance, remix, and blend their own assets. Users will be able to explore the Featured and Recent feeds which showcase community creations and offer inspiration for new ideas. Sora builds on learnings from DALL-E and GPT models, and is designed to give people expanded tools for storytelling and creative expression.
Sora is a diffusion model, which generates a video by starting off with a base video that looks like static noise and gradually transforms it by removing the noise over many steps. By giving the model foresight of many frames at a time, we've solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily. Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.
Sora uses the recaptioning technique from DALL-E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to understand the content of the images and videos it is trained on at a deep level.