New “Stable Video Diffusion” AI model can animate any still image

On Tuesday, Stability AI released Stable Video Diffusion, a new free AI research tool that can turn any still image into a short video—with mixed results. It’s an open-weights preview of two AI models that use a technique called image-to-video, and it can run locally on a machine with an Nvidia GPU.

Last year, Stability AI made waves with the release of Stable Diffusion, an “open weights” image synthesis model that kick started a wave of open image synthesis and inspired a large community of hobbyists that have built off the technology with their own custom fine-tunings. Now Stability wants to do the same with AI video synthesis, although the tech is still in its infancy.

Right now, Stable Video Diffusion consists of two models: one that can produce image-to-video synthesis at 14 frames of length (called “SVD”), and another that generates 25 frames (called “SVD-XT”). They can operate at varying speeds from 3 to 30 frames per second, and they output short (typically 2-4 second-long) MP4 video clips at 576×1024 resolution.

In our local testing, a 14-frame generation took about 30 minutes to create on an Nvidia RTX 3060 graphics card, but users can experiment with running the models much faster on the cloud through services like Hugging Face and Replicate (some of which you may need to pay for). In our experiments, the generated animation typically keeps a portion of the scene static and adds panning and zooming effects or animates smoke or fire. People depicted in photos often do not move, although we did get one Getty image of Steve Wozniak to slightly come to life.

(Note: Other than the Steve Wozniak Getty Images photo, the other images animated in this article were generated with DALL-E 3 and animated using Stable Video Diffusion.)

Given these limitations, Stability emphasizes that the model is still early and is intended for research only. “While we eagerly update our models with the latest advancements and work to incorporate your feedback,” the company writes on its website, “this model is not intended for real-world or commercial applications at this stage. Your insights and feedback on safety and quality are important to refining this model for its eventual release.”

Notably, but perhaps unsurprisingly, the Stable Video Diffusion research paper does not reveal the source of the models’ training datasets, only saying that the research team used “a large video dataset comprising roughly 600 million samples” that they curated into the Large Video Dataset (LVD), which consists of 580 million annotated video clips that span 212 years of content in duration.

Stable Video Diffusion is far from the first AI model to offer this kind of functionality. We’ve previously covered other AI video synthesis methods, including those from Meta, Google, and Adobe. We’ve also covered the open source ModelScope and what many consider the best AI video model at the moment, Runway’s Gen-2 model (Pika Labs is another AI video provider). Stability AI says it is also working on a text-to-video model, which will allow the creation of short video clips using written prompts instead of images.

The Stable Video Diffusion source and weights are available on GitHub, and another easy way to test it locally is by running it through the Pinokio platform, which handles installation dependencies easily and runs the model in its own environment.