xFormers-编程知识

在这里插入图片描述

文章目录

- 一、关于 xFormers
- 二、安装 xFormers
- 三、基准测试
- - （可选）测试安装
- 四、使用 xFormers
- - 1、Transformers 关键概念
  - 2、Repo 地图
  - - 注意力机制
    - Feed forward mechanisms
    - Positional embedding
    - Residual paths
    - Initializations
  - 3、主要特征
  - 4、安装故障排除

一、关于 xFormers

xFormers是一个基于PyTorch的库，托管灵活的 Transformers 部件。

它们是可互操作和优化的构建块，可以选择性地组合以创建一些最先进的模型。

github : https://github.com/facebookresearch/xformers
官网：https://facebookresearch.github.io/xformers/

xFormers 是：

Customizable building blocks：独立/可定制的构建块，无需样板代码即可使用。这些组件与领域无关，xFormers 被视觉、NLP 等领域的研究人员使用。
研究第一：xFormers 包含前沿组件，这些组件在 PyTorch 等主流库中尚不可用。
构建时考虑到效率：由于迭代速度很重要，因此组件尽可能快且内存高效。 xFormers 包含自己的 CUDA 内核，但会在相关时分派到其他库。

二、安装 xFormers

（推荐，linux）使用 conda 安装最新的稳定版：需要使用conda 安装 PyTorch 2.3.0

conda install xformers -c xformers

（推荐，linux 和 win）使用 pip 安装最新稳定版本：需要PyTorch 2.3.0

# cuda 11.8 version
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
# cuda 12.1 version
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121

开发二进制文件：

# Use either conda or pip, same requirements as for the stable version above
conda install xformers -c xformers/label/dev
pip install --pre -U xformers

从源安装：例如，如果您想与 PyTorch 的其他版本一起使用（包括夜间版本）

# (Optional) Makes the build much faster
pip install ninja
# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
# (this can take dozens of minutes)

三、基准测试

内存高效的 MHA 设置：F16 上的 A100，测量前向+后向传递的总时间 外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

请注意，这是精确的注意力，而不是近似值，只需调用xformers.ops.memory_efficient_attention

更多基准测试

xFormers 提供了许多组件，并且BENCHMARKS.md中提供了更多基准测试。

（可选）测试安装

此命令将提供有关 xFormers 安装的信息，以及构建/可用的内核：

python -m xformers.info

四、使用 xFormers

1、Transformers 关键概念

让我们从 Transformer 架构的经典概述开始（来自 Lin 等人的“A Survey of Transformers”的插图）

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

您将在此图中找到关键的 repository 边界：Transformer 通常由注意力机制、用于编码某些位置信息的嵌入、前馈块和残差路径（通常称为前层或后层）的集合组成。规范）。这些边界并不适用于所有模型，但我们在实践中发现，如果进行一些调整，它可以捕获大部分最先进的技术。

因此，模型不是在整体文件中实现的，而整体文件的处理和修改通常很复杂。上图中出现的大多数概念都对应于抽象级别，并且当给定子块存在变体时，应该始终可以选择其中的任何一个。您可以关注给定的封装级别并根据需要对其进行修改。

2、Repo 地图

├── ops                         # Functional operators└ ...
├── components                  # Parts zoo, any of which can be used directly
│   ├── attention
│   │    └ ...                  # all the supported attentions
│   ├── feedforward             #
│   │    └ ...                  # all the supported feedforwards
│   ├── positional_embedding    #
│   │    └ ...                  # all the supported positional embeddings
│   ├── activations.py          #
│   └── multi_head_dispatch.py  # (optional) multihead wrap
|
├── benchmarks
│     └ ...                     # A lot of benchmarks that you can use to test some parts
└── triton└ ...                     # (optional) all the triton parts, requires triton + CUDA gpu

注意力机制

Scaled dot product
- Attention is all you need, Vaswani et al., 2017
Sparse
- whenever a sparse enough mask is passed
BlockSparse
- courtesy of Triton
Linformer
- Linformer, self-attention with linear complexity, Wang et al., 2020
Nystrom
- Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention, Xiong et al., 2021
Local. Notably used in (and many others)
- Longformer: The Long-Document Transformer, Beltagy et al., 2020
- BigBird, Transformer for longer sequences, Zaheer et al., 2020
Favor/Performer
- Rethinking Attention with Performers, Choromanski et al., 2020
Orthoformer
- Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers, Patrick et al., 2021
Random
- See BigBird, Longformers,…
Global
- See BigBird, Longformers,…
FourierMix
- FNet: Mixing Tokens with Fourier Transforms, Lee-Thorp et al.
CompositionalAttention
- Compositional Attention: Disentangling search and retrieval, S. Mittal et al.
2D Pooling
- Metaformer is actually what you need for vision, Yu et al.
Visual Attention
- Visual Attention Network_, Guo et al
… add a new one see Contribution.md

Feed forward mechanisms

MLP
Fused
Mixture of Experts
Conv2DFeedforward

Positional embedding

Sine
Vocabulary
Rotary
Simplicial

Residual paths

Pre
Post
DeepNorm

Initializations

这是完全可选的，并且仅在通过 xFormers 生成完整模型时才会发生，而不是在单独挑选零件时发生。

基本上公开了两种初始化机制，但用户可以在事后根据他/她认为合适的情况自由初始化权重。

部件可以公开一个init_weights()方法，该方法定义合理的默认值
xFormers 支持特定的 init 方案，该方案可以优先于 init_weights()

如果使用第二个代码路径（通过模型工厂构造模型），我们会检查所有权重是否已初始化，如果不是这种情况，则可能会出错（如果您设置了xformers.factory.weight_init.__assert_if_not_initialized = True）

支持的初始化方案有：

Small init
Timm defaults
ViT defaults
Moco v3 defaults

指定 init 方案的一种方法是将字段 config.weight_init 设置为匹配的枚举值。这可以很容易地扩展，请随时提交 PR！

3、主要特征

许多注意力机制，可互换
优化的构建块，超越 PyTorch 原语
1. 内存高效的精确注意力 - 速度提高 10 倍
2. 注意力稀疏
3. 块稀疏注意力
4. 融合softmax
5. 融合线性层
6. 融合层范数
7. 融合丢失（激活（x+偏差））
8. 融合SwiGLU
基准测试和测试工具
1. 微观基准
2. 变压器块基准
3. LRA，具有 SLURM 支持
程序化和扫描友好的层和模型构建
1. 与分层 Transformer 兼容，例如 Swin 或 Metaformer
可破解
1. 不使用整体 CUDA 内核、可组合构建块
2. 使用Triton进行一些优化的部分，显式的、Pythonic 的和用户可访问的
3. 对 SquaredReLU 的本机支持（在 ReLU、LeakyReLU、GeLU 之上）、可扩展激活

4、安装故障排除

NVCC 和当前 CUDA 运行时匹配。根据您的设置，您也许可以使用来更改 CUDA 运行时module unload cuda; module load cuda/xx.x，也可能nvcc
您使用的 GCC 版本与当前 NVCC 功能匹配
env变量TORCH_CUDA_ARCH_LIST设置为您想要支持的体系结构。建议的设置（构建缓慢但全面）是export TORCH_CUDA_ARCH_LIST="6.0;6.1;6.2;7.0;7.2;7.5;8.0;8.6"
如果从源 OOM 构建，则可以减少 ninja 的并行性MAX_JOBS（例如MAX_JOBS=2）
如果您UnsatisfiableError在使用 conda 安装时遇到问题，请确保您的 conda 环境中安装了 PyTorch，并且您的设置（PyTorch 版本、cuda 版本、python 版本、操作系统）与xFormers 的现有二进制文件匹配

2024-05-14（二）