Stability AI introduces Stable Video Diffusion models in research preview

[ad_1] While OpenAI is the return of Sam Altman, its rivals are on their way to raising the bar in the AI ​​race. Just after that Anthropic’s release…

[ad_1]

While OpenAI is the return of Sam Altman, its rivals are on their way to raising the bar in the AI ​​race. Just after that Anthropic’s release of Claude 2.1 and Adobe’s reported acquisition from Rehrase.ai, Stability AI has announced the release of Stable Video Diffusion to mark its entry into the in-demand video generation space.

Available for research purposes only, Stable Video Diffusion (SVD) includes two state-of-the-art AI models – SVD and SVD-XT – that produce short clips of images. The company says they both produce high-quality output, which matches or even exceeds the performance of other AI video generators.

Stability AI has open-sourced the image-to-video models as part of its research sample and plans to leverage user feedback to further refine them, ultimately paving the way for their commercial application.

Understanding stable video diffusion

According to a blog post from the company’s SVD and SVD-XT latent diffusion models that take a still image as a conditioning frame and generate 576 X 1024 video from it. Both models produce content at speeds between three and thirty frames per second, but the output is quite short: only a maximum of four seconds. The SVD model is trained to produce 14 frames from photos, while the latter goes up to 25, Stability AI noted.

To create stable video diffusion, the company used a large, systematically curated video dataset, consisting of approximately 600 million samples, and trained a base model. Then this model was refined on a smaller, high-quality dataset (with up to one million clips) to address downstream tasks such as text-to-video and image-to-video, where a sequence of frames is predicted from a single conditioning image.

Stability AI said the data for training and refining the model came from publicly available research datasets, although the exact source remains unclear.

Also Read:
OpenAI Sora: AI Model That Create Realistic Videos from Scratch

More importantly, in a white paper In detailing SVD, the authors write that this model can also serve as a basis for refining a diffusion model capable of multi-view synthesis. This would make it possible to generate multiple consistent views of an object using just one still image.

All this could eventually culminate in a wide range of applications in sectors such as advertising, education and entertainment, the company added in its blog post.

High quality output, but limitations remain

An external evaluation by human voters found that the SVD output was of high quality and easily outperformed the leading closed text-to-video models from Track And Pika Labs. However, the company notes that this is just the beginning of its work and that the models are far from perfect at this stage. In many cases, they fail to deliver photorealism, generate videos with no motion or with very slow camera movements, and fail to generate faces and people as users would expect.

Ultimately, the company plans to use this research sample to refine both models, close current gaps, and introduce new features, such as support for text prompts or text rendering in videos, for commercial applications. It emphasized that the current release is mainly intended to invite open research into the models, which could identify more problems (such as biases) and help with safe implementation later.

“We plan a variety of models that build on and extend this foundation, similar to the ecosystem built around stable diffusion,” the company wrote. It has also started calling for users to sign up for an upcoming web experience that will allow users to generate videos from text.

That said, it remains unclear when exactly the experience will be available.

A glimpse into Stable Video Diffusion’s text-to-video experience

How to use the models?

To get started with the new open-source Stable Video Diffusion models, users can find the code at the company’s GitHub repository and the weights required to run the model locally on his computer Hugging Face page. The company notes that use will only be permitted upon acceptance of the terms, which outline both permitted and excluded uses.

From now on, the permitted usage scenarios include, in addition to examining and researching the models, generating artworks for design and other artistic processes and applications in educational or creative tools.

Generating factual or “true representations of people or events” is out of scope, according to Stability AI.

[ad_2]

Share your thoughts!

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Search

Most Popular

Latest Articles

Stability AI introduces Stable Video Diffusion models in research preview

[ad_1] While OpenAI is the return of Sam Altman, its rivals are on their way to raising the bar in the AI ​​race. Just after that Anthropic’s release…

[ad_1]

While OpenAI is the return of Sam Altman, its rivals are on their way to raising the bar in the AI ​​race. Just after that Anthropic’s release of Claude 2.1 and Adobe’s reported acquisition from Rehrase.ai, Stability AI has announced the release of Stable Video Diffusion to mark its entry into the in-demand video generation space.

Available for research purposes only, Stable Video Diffusion (SVD) includes two state-of-the-art AI models – SVD and SVD-XT – that produce short clips of images. The company says they both produce high-quality output, which matches or even exceeds the performance of other AI video generators.

Stability AI has open-sourced the image-to-video models as part of its research sample and plans to leverage user feedback to further refine them, ultimately paving the way for their commercial application.

Understanding stable video diffusion

According to a blog post from the company’s SVD and SVD-XT latent diffusion models that take a still image as a conditioning frame and generate 576 X 1024 video from it. Both models produce content at speeds between three and thirty frames per second, but the output is quite short: only a maximum of four seconds. The SVD model is trained to produce 14 frames from photos, while the latter goes up to 25, Stability AI noted.

To create stable video diffusion, the company used a large, systematically curated video dataset, consisting of approximately 600 million samples, and trained a base model. Then this model was refined on a smaller, high-quality dataset (with up to one million clips) to address downstream tasks such as text-to-video and image-to-video, where a sequence of frames is predicted from a single conditioning image.

Stability AI said the data for training and refining the model came from publicly available research datasets, although the exact source remains unclear.

Also Read:
AI's Secret Guilt: Unveiling the Hidden Energy Costs of Image Generation

More importantly, in a white paper In detailing SVD, the authors write that this model can also serve as a basis for refining a diffusion model capable of multi-view synthesis. This would make it possible to generate multiple consistent views of an object using just one still image.

All this could eventually culminate in a wide range of applications in sectors such as advertising, education and entertainment, the company added in its blog post.

High quality output, but limitations remain

An external evaluation by human voters found that the SVD output was of high quality and easily outperformed the leading closed text-to-video models from Track And Pika Labs. However, the company notes that this is just the beginning of its work and that the models are far from perfect at this stage. In many cases, they fail to deliver photorealism, generate videos with no motion or with very slow camera movements, and fail to generate faces and people as users would expect.

Ultimately, the company plans to use this research sample to refine both models, close current gaps, and introduce new features, such as support for text prompts or text rendering in videos, for commercial applications. It emphasized that the current release is mainly intended to invite open research into the models, which could identify more problems (such as biases) and help with safe implementation later.

“We plan a variety of models that build on and extend this foundation, similar to the ecosystem built around stable diffusion,” the company wrote. It has also started calling for users to sign up for an upcoming web experience that will allow users to generate videos from text.

That said, it remains unclear when exactly the experience will be available.

A glimpse into Stable Video Diffusion’s text-to-video experience

How to use the models?

To get started with the new open-source Stable Video Diffusion models, users can find the code at the company’s GitHub repository and the weights required to run the model locally on his computer Hugging Face page. The company notes that use will only be permitted upon acceptance of the terms, which outline both permitted and excluded uses.

From now on, the permitted usage scenarios include, in addition to examining and researching the models, generating artworks for design and other artistic processes and applications in educational or creative tools.

Generating factual or “true representations of people or events” is out of scope, according to Stability AI.

[ad_2]

Share your thoughts!

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Search

Advertismentspot_img

Most Popular

Similar Articles

Similar Articles