AIGC专栏10——EasyAnimate 一个新的类SORA文生视频模型轻松文生视频-编程知识

AIGC专栏10——EasyAnimate 一个新的类SORA文生视频模型 📺轻松文生视频

学习前言
源码下载地址
技术原理储备（DIT/Lora/Motion Module）
- 什么是Diffusion Transformer (DiT)
- Lora
- Motion Module
EasyAnimate简介
EasyAnimate原理界面展示
快速启动
- 云使用: AliyunDSW/Docker
- 本地安装: 环境检查/下载/安装
如何使用
- 生成
- - 运行python文件
  - 通过ui界面
- 模型训练
- - 训练视频生成模型
  - - i、基于webvid数据集
    - ii、基于自建数据集
  - 训练基础文生图模型
  - - i、基于diffusers格式
    - ii、基于自建数据集
  - 训练Lora模型
  - - i、基于diffusers格式
    - ii、基于自建数据集
算法细节

学习前言

在过年期间，OpenAI放出了SORA文生视频的预览效果，一瞬间各大媒体争相报道，又引爆了一次科技圈，可惜的是，SORA依然没选择开源。

在这个契机下，本来我也对文生视频的工作非常感兴趣，所以也研究了一些与SORA相关的技术，虽然我们没有像OpenAI那么大的算力，但做一些基础研究还是足够的。

最近我参与了一个EasyAnimate的项目，可以根据文本生成视频，并且借鉴了Animatediff的IDEA，将MotionModule网格化后引入到DIT中，借助DIT的强大生成能力，生成视频效果也还不错，并且由于基于一个可插入结构，EasyAnimate有良好的拓展性，近期也开源了出来。
在这里插入图片描述

源码下载地址

https://github.com/aigc-apps/EasyAnimate

感谢大家的关注。

技术原理储备（DIT/Lora/Motion Module）

什么是Diffusion Transformer (DiT)

DiT基于扩散模型，所以不免包含不断去噪的过程，如果是图生图的话，还有不断加噪的过程，此时离不开DDPM那张老图，如下：
在这里插入图片描述
DiT相比于DDPM，使用了更快的采样器，也使用了更大的分辨率，与Stable Diffusion一样使用了隐空间的扩散，但可能更偏研究性质一些，没有使用非常大的数据集进行预训练，只使用了imagenet进行预训练。

与Stable Diffusion不同的是，DiT的网络结构完全由Transformer组成，没有Unet中大量的上下采样，结构更为简单清晰。

在EasyAnimate中，我们将Motion Module网格化后引入到DIT中，借助DIT的强大生成能力，生成视频效果也还不错。

Lora

由《LoRA: Low-Rank Adaptation of Large Language Models》提出的一种基于低秩矩阵的对大参数模型进行少量参数微调训练的方法，广泛引用在各种大模型的下游使用中。

由于我们是基于一个可插入的结构设计了EasyAnimate，所以EasyAnimate有良好的拓展性，我们可以对文生图模型训练Lora后应用到文生视频模型中。

Motion Module

AnimateDiff是一个可以对文生图模型进行动画处理的实用框架，其内部设计的Motion Module无需进行特定模型调整，即可一次性为大多数现有的个性化文本转图像模型提供动画化能力。

EasyAnimate参考AnimateDiff使用Motion Module保证动画的连续性，同时作为一个可插入的结构，Motion Module有良好的拓展性

EasyAnimate简介

EasyAnimate是一个基于transformer结构的pipeline，可用于生成AI动画、训练Diffusion Transformer的基线模型与Lora模型，我们支持从已经训练好的EasyAnimate模型直接进行预测，生成不同分辨率，6秒左右、fps12的视频（40 ~ 80帧, 未来会支持更长的视频），也支持用户训练自己的基线模型与Lora模型，进行一定的风格变换。

这些是pipeline的生成结果，从生成结果来看，它的生成效果还是非常不错的，Resolution 的顺序是width、height、frames：

首先是使用原始的pixart checkpoint进行预测。

Base Models	Sampler	Seed	Resolution (h x w x f)	Prompt	Download
PixArt	DPM++	43	512x512x80	A soaring drone footage captures the majestic beauty of a coastal cliff, its red and yellow stratified rock faces rich in color and against the vibrant turquoise of the sea. Seabirds can be seen taking flight around the cliff’s precipices.	Download GIF
PixArt	DPM++	43	448x640x80	The video captures the majestic beauty of a waterfall cascading down a cliff into a serene lake. The waterfall, with its powerful flow, is the central focus of the video. The surrounding landscape is lush and green, with trees and foliage adding to the natural beauty of the scene.	Download GIF
PixArt	DPM++	43	704x384x80	A vibrant scene of a snowy mountain landscape. The sky is filled with a multitude of colorful hot air balloons, each floating at different heights, creating a dynamic and lively atmosphere. The balloons are scattered across the sky, some closer to the viewer, others further away, adding depth to the scene.	Download GIF
PixArt	DPM++	43	448x640x64	The vibrant beauty of a sunflower field. The sunflowers, with their bright yellow petals and dark brown centers, are in full bloom, creating a stunning contrast against the green leaves and stems. The sunflowers are arranged in neat rows, creating a sense of order and symmetry.	Download GIF
PixArt	DPM++	43	384x704x48	A tranquil Vermont autumn, with leaves in vibrant colors of orange and red fluttering down a mountain stream.	Download GIF
PixArt	DPM++	43	704x384x48	A vibrant underwater scene. A group of blue fish, with yellow fins, are swimming around a coral reef. The coral reef is a mix of brown and green, providing a natural habitat for the fish. The water is a deep blue, indicating a depth of around 30 feet. The fish are swimming in a circular pattern around the coral reef, indicating a sense of motion and activity. The overall scene is a beautiful representation of marine life.	Download GIF
PixArt	DPM++	43	576x448x48	Pacific coast, carmel by the blue sea ocean and peaceful waves	Download GIF
PixArt	DPM++	43	576x448x80	A snowy forest landscape with a dirt road running through it. The road is flanked by trees covered in snow, and the ground is also covered in snow. The sun is shining, creating a bright and serene atmosphere. The road appears to be empty, and there are no people or animals visible in the video. The style of the video is a natural landscape shot, with a focus on the beauty of the snowy forest and the peacefulness of the road.	Download GIF
PixArt	DPM++	43	640x448x64	The dynamic movement of tall, wispy grasses swaying in the wind. The sky above is filled with clouds, creating a dramatic backdrop. The sunlight pierces through the clouds, casting a warm glow on the scene. The grasses are a mix of green and brown, indicating a change in seasons. The overall style of the video is naturalistic, capturing the beauty of the landscape in a realistic manner. The focus is on the grasses and their movement, with the sky serving as a secondary element. The video does not contain any human or animal elements.	Download GIF
PixArt	DPM++	43	704x384x80	A serene night scene in a forested area. The first frame shows a tranquil lake reflecting the star-filled sky above. The second frame reveals a beautiful sunset, casting a warm glow over the landscape. The third frame showcases the night sky, filled with stars and a vibrant Milky Way galaxy. The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop. The style of the video is naturalistic, emphasizing the beauty of the night sky and the peacefulness of the forest.	Download GIF
PixArt	DPM++	43	640x448x80	Sunset over the sea.	Download GIF

使用人像checkpoint进行预测。

Base Models	Sampler	Seed	Resolution (h x w x f)	Prompt	Download
Portrait	Euler A	43	448x576x80	1girl, 3d, black hair, brown eyes, earrings, grey background, jewelry, lips, long hair, looking at viewer, photo \(medium\), realistic, red lips, solo	Download GIF
Portrait	Euler A	43	448x576x80	1girl, bare shoulders, blurry, brown eyes, dirty, dirty face, freckles, lips, long hair, looking at viewer, realistic, sleeveless, solo, upper body	Download GIF
Portrait	Euler A	43	512x512x64	1girl, black hair, brown eyes, earrings, grey background, jewelry, lips, looking at viewer, mole, mole under eye, neck tattoo, nose, ponytail, realistic, shirt, simple background, solo, tattoo	Download GIF
Portrait	Euler A	43	576x448x64	1girl, black hair, lips, looking at viewer, mole, mole under eye, mole under mouth, realistic, solo	Download GIF

使用人像Lora进行预测。

Base Models	Sampler	Seed	Resolution (h x w x f)	Prompt	Download
Pixart + Lora	Euler A	43	512x512x64	1girl, 3d, black hair, brown eyes, earrings, grey background, jewelry, lips, long hair, looking at viewer, photo \(medium\), realistic, red lips, solo	Download GIF
Pixart + Lora	Euler A	43	512x512x64	1girl, bare shoulders, blurry, brown eyes, dirty, dirty face, freckles, lips, long hair, looking at viewer, mole, mole on breast, mole on neck, mole under eye, mole under mouth, realistic, sleeveless, solo, upper body	Download GIF
Pixart + Lora	Euler A	43	512x512x64	1girl, black hair, lips, looking at viewer, mole, mole under eye, mole under mouth, realistic, solo	Download GIF
Pixart + Lora	Euler A	43	512x512x80	1girl, bare shoulders, blurry, blurry background, blurry foreground, bokeh, brown eyes, christmas tree, closed mouth, collarbone, depth of field, earrings, jewelry, lips, long hair, looking at viewer, photo \(medium\), realistic, smile, solo	Download GIF

可以看出，EasyAnimate具有良好的可拓展性，无论是训练Checkpoint还是Lora都可以应用到模型当中，另外，我们设计了分桶策略与自适应视频裁剪，模型既可以预测512x512的视频，也可以预测如384x768的视频。

EasyAnimate原理界面展示

参考Animatediff，我们为EasyAnimate也提供了对应的界面，在界面上，我们可以选择基础模型、motion module版本、基础checkpoint和lora模型。

在填入prompt和neg prompt后，就可以在下面点击generate进行生成了。
在这里插入图片描述

快速启动

云使用: AliyunDSW/Docker

a. 通过阿里云 DSW
我们暂时还没有快速启动资源，等配置完成后再做更新。

b. 通过docker
使用docker的情况下，请保证机器中已经正确安装显卡驱动与CUDA环境，然后以此执行以下命令：

# 拉取镜像
docker pull mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:easyanimate# 进入镜像
docker run -it -p 7860:7860 --network host --gpus all --security-opt seccomp:unconfined --shm-size 200g mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easycv/torch_cuda:easyanimate# clone 代码
git clone https://github.com/aigc-apps/EasyAnimate.git# 进入EasyAnimate文件夹
cd EasyAnimate# 下载权重
mkdir models/Diffusion_Transformer
mkdir models/Motion_Module
mkdir models/Personalized_Modelwget https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/easyanimate/Motion_Module/easyanimate_v1_mm.safetensors -O models/Motion_Module/easyanimate_v1_mm.safetensors
wget https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/easyanimate/Personalized_Model/easyanimate_portrait.safetensors -O models/Personalized_Model/easyanimate_portrait.safetensors
wget https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/easyanimate/Personalized_Model/easyanimate_portrait_lora.safetensors -O models/Personalized_Model/easyanimate_portrait_lora.safetensors
wget https://pai-aigc-photog.oss-cn-hangzhou.aliyuncs.com/easyanimate/Diffusion_Transformer/PixArt-XL-2-512x512.tar -O models/Diffusion_Transformer/PixArt-XL-2-512x512.tarcd models/Diffusion_Transformer/
tar -xvf PixArt-XL-2-512x512.tar
cd ../../

本地安装: 环境检查/下载/安装

我们已验证EasyAnimate可在以下环境中执行：

Linux 的详细信息：

操作系统 Ubuntu 20.04, CentOS
python: python3.10 & python3.11
pytorch: torch2.2.0
CUDA: 11.8
CUDNN: 8+
GPU： Nvidia-A10 24G & Nvidia-A100 40G & Nvidia-A100 80G

我们需要大约 60GB 的可用磁盘空间，请检查！

b. 权重放置
我们最好将权重按照指定路径进行放置：

📦 models/
├── 📂 Diffusion_Transformer/
│   └── 📂 PixArt-XL-2-512x512/
├── 📂 Motion_Module/
│   └── 📄 easyanimate_v1_mm.safetensors
├── 📂 Motion_Module/
│   ├── 📄 easyanimate_portrait.safetensors
│   └── 📄 easyanimate_portrait_lora.safetensors

如何使用

生成

运行python文件

步骤1：下载对应权重放入models文件夹。
步骤2：在predict_t2v.py文件中修改prompt、neg_prompt、guidance_scale和seed。
步骤3：运行predict_t2v.py文件，等待生成结果，结果保存在samples/easyanimate-videos文件夹中。
步骤4：如果想结合自己训练的其他backbone与Lora，则看情况修改predict_t2v.py中的predict_t2v.py和lora_path。

通过ui界面

步骤1：下载对应权重放入models文件夹。
步骤2：运行app.py文件，进入gradio页面。
步骤3：根据页面选择生成模型，填入prompt、neg_prompt、guidance_scale和seed等，点击生成，等待生成结果，结果保存在sample文件夹中。

模型训练

训练视频生成模型

i、基于webvid数据集

如果使用webvid数据集进行训练，则需要首先下载webvid的数据集。

您需要以这种格式排列webvid数据集。

📦 project/
├── 📂 datasets/
│   ├── 📂 webvid/
│       ├── 📂 videos/
│       │   ├── 📄 00000001.mp4
│       │   ├── 📄 00000002.mp4
│       │   └── 📄 .....
│       └── 📄 csv_of_webvid.csv

然后，进入scripts/train_t2v.sh进行设置。

export DATASET_NAME="datasets/webvid/videos/"
export DATASET_META_NAME="datasets/webvid/csv_of_webvid.csv"...train_data_format="webvid"

最后运行scripts/train_t2v.sh。

sh scripts/train_t2v.sh

ii、基于自建数据集

如果使用内部数据集进行训练，则需要首先格式化数据集。

您需要以这种格式排列数据集。

📦 project/
├── 📂 datasets/
│   ├── 📂 internal_datasets/
│       ├── 📂 videos/
│       │   ├── 📄 00000001.mp4
│       │   ├── 📄 00000002.mp4
│       │   └── 📄 .....
│       └── 📄 json_of_internal_datasets.json

json_of_internal_datasets.json是一个标准的json文件，如下所示：

[{"file_path": "videos/00000001.mp4","text": "A group of young men in suits and sunglasses are walking down a city street.","type": "video"},{"file_path": "videos/00000002.mp4","text": "A notepad with a drawing of a woman on it.","type": "video"}.....
]

json中的file_path需要设置为相对路径。

然后，进入scripts/train_t2v.sh进行设置。

export DATASET_NAME="datasets/internal_datasets/"
export DATASET_META_NAME="datasets/internal_datasets/json_of_internal_datasets.json"...train_data_format="normal"

最后运行scripts/train_t2v.sh。

sh scripts/train_t2v.sh

训练基础文生图模型

i、基于diffusers格式

数据集的格式可以设置为diffusers格式。

📦 project/
├── 📂 datasets/
│   ├── 📂 diffusers_datasets/
│       ├── 📂 train/
│       │   ├── 📄 00000001.jpg
│       │   ├── 📄 00000002.jpg
│       │   └── 📄 .....
│       └── 📄 metadata.jsonl

然后，进入scripts/train_t2i.sh进行设置。

export DATASET_NAME="datasets/diffusers_datasets/"...train_data_format="diffusers"

最后运行scripts/train_t2i.sh。

sh scripts/train_t2i.sh

ii、基于自建数据集

如果使用自建数据集进行训练，则需要首先格式化数据集。

您需要以这种格式排列数据集。

📦 project/
├── 📂 datasets/
│   ├── 📂 internal_datasets/
│       ├── 📂 train/
│       │   ├── 📄 00000001.jpg
│       │   ├── 📄 00000002.jpg
│       │   └── 📄 .....
│       └── 📄 json_of_internal_datasets.json

json_of_internal_datasets.json是一个标准的json文件，如下所示：

[{"file_path": "train/00000001.jpg","text": "A group of young men in suits and sunglasses are walking down a city street.","type": "image"},{"file_path": "train/00000002.jpg","text": "A notepad with a drawing of a woman on it.","type": "image"}.....
]

json中的file_path需要设置为相对路径。

然后，进入scripts/train_t2i.sh进行设置。

export DATASET_NAME="datasets/internal_datasets/"
export DATASET_META_NAME="datasets/internal_datasets/json_of_internal_datasets.json"...train_data_format="normal"

最后运行scripts/train_t2i.sh。

sh scripts/train_t2i.sh

训练Lora模型

i、基于diffusers格式

数据集的格式可以设置为diffusers格式。

📦 project/
├── 📂 datasets/
│   ├── 📂 diffusers_datasets/
│       ├── 📂 train/
│       │   ├── 📄 00000001.jpg
│       │   ├── 📄 00000002.jpg
│       │   └── 📄 .....
│       └── 📄 metadata.jsonl

然后，进入scripts/train_lora.sh进行设置。

export DATASET_NAME="datasets/diffusers_datasets/"...train_data_format="diffusers"

最后运行scripts/train_lora.sh。

sh scripts/train_lora.sh

ii、基于自建数据集

如果使用自建数据集进行训练，则需要首先格式化数据集。

您需要以这种格式排列数据集。

📦 project/
├── 📂 datasets/
│   ├── 📂 internal_datasets/
│       ├── 📂 train/
│       │   ├── 📄 00000001.jpg
│       │   ├── 📄 00000002.jpg
│       │   └── 📄 .....
│       └── 📄 json_of_internal_datasets.json

json_of_internal_datasets.json是一个标准的json文件，如下所示：

[{"file_path": "train/00000001.jpg","text": "A group of young men in suits and sunglasses are walking down a city street.","type": "image"},{"file_path": "train/00000002.jpg","text": "A notepad with a drawing of a woman on it.","type": "image"}.....
]

json中的file_path需要设置为相对路径。

然后，进入scripts/train_lora.sh进行设置。

export DATASET_NAME="datasets/internal_datasets/"
export DATASET_META_NAME="datasets/internal_datasets/json_of_internal_datasets.json"...train_data_format="normal"

最后运行scripts/train_lora.sh。

sh scripts/train_lora.sh

算法细节

我们使用了PixArt-alpha作为基础模型，并在此基础上引入额外的运动模块（motion module）来将DiT模型从2D图像生成扩展到3D视频生成上来。其框架图如下：
请添加图片描述
其中，Motion Module 用于捕捉时序维度的帧间关系，其结构如下：

我们在时序维度上引入注意力机制来让模型学习时序信息，以进行连续视频帧的生成。同时，我们利用额外的网格计算（Grid Reshape），来扩大注意力机制的input token数目，从而更多地利用图像的空间信息以达到更好的生成效果。Motion Module 作为一个单独的模块，在推理时可以用在不同的DiT基线模型上。此外，EasyAnimate不仅支持了motion-module模块的训练，也支持了DiT基模型/LoRA模型的训练，以方便用户根据自身需要来完成自定义风格的模型训练，进而生成任意风格的视频。