Hugging Face 现已支持使用达摩院text-to-video模型从文本生成视频

805次阅读

模型地址：https://modelscope.cn/models/damo/text-to-video-synthesis/summary

模型描述

该文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成，整体模型参数约17亿。支持英文输入。扩散模型采用Unet3D结构，通过从纯高斯噪声视频中，迭代去噪的过程，实现视频生成的功能。

代码示例

from huggingface_hub import snapshot_download

from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys

model_dir = pathlib.Path(‘weights’)
snapshot_download(‘damo-vilab/modelscope-damo-text-to-video-synthesis’,
repo_type=’model’, local_dir=model_dir)

pipe = pipeline(‘text-to-video-synthesis’, model_dir.as_posix())
test_text = {
‘text’: ‘A panda eating bamboo on a rock.’,
}
output_video_path = pipe(test_text,)[OutputKeys.OUTPUT_VIDEO]
print(‘output_video_path:’, output_video_path)

部分示例

A goldendoodle playing in a park by a lake.

A panda bear driving a car.

相关论文

@inproceedings{rombach2022high,
title={High-resolution image synthesis with latent diffusion models},
author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{“o}rn},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10684–10695},
year={2022}
}

@inproceedings{luo2023videofusion,
title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}模型地址：https://modelscope.cn/models/damo/text-to-video-synthesis/summary
模型描述
该文本到视频生成扩散模型由文本特征提取、文本特征到视频隐空间扩散模型、视频隐空间到视频视觉空间这3个子网络组成，整体模型参数约17亿。支持英文输入。扩散模型采用Unet3D结构，通过从纯高斯噪声视频中，迭代去噪的过程，实现视频生成的功能。
代码示例
from huggingface_hub import snapshot_download

from modelscope.pipelines import pipeline
from modelscope.outputs import OutputKeys

model_dir = pathlib.Path(‘weights’)
snapshot_download(‘damo-vilab/modelscope-damo-text-to-video-synthesis’,
repo_type=’model’, local_dir=model_dir)

pipe = pipeline(‘text-to-video-synthesis’, model_dir.as_posix())
test_text = {
‘text’: ‘A panda eating bamboo on a rock.’,
}
output_video_path = pipe(test_text,)[OutputKeys.OUTPUT_VIDEO]
print(‘output_video_path:’, output_video_path)

部分示例
A goldendoodle playing in a park by a lake.

A panda bear driving a car.

相关论文
@inproceedings{rombach2022high,
title={High-resolution image synthesis with latent diffusion models},
author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{“o}rn},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10684–10695},
year={2022}
}

@inproceedings{luo2023videofusion,
title={VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
author={Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2023}
}

正文完

可以使用微信扫码关注公众号（ID：xzluomor）