[论文精读]Do Transformers Really Perform Bad for Graph Representation?-编程知识

论文网址：[2106.05234] Do Transformers Really Perform Bad for Graph Representation? (arxiv.org)

论文代码：https://github.com/Microsoft/Graphormer

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用！

1. 省流版

1.1. 心得

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

①Transformer did not achieve ideal performance comparing with mainstream GNN variants

②The authors put forward Graphormer to change this situation

leaderboard n. 排行榜；通栏广告

2.2. Introduction

①Graphormer performs outstanding on Open Graph Benchmark Large-Scale Challenge (OGB-LSC), and several popular leaderboards such as OGB and Benchmarking-GNN

②Transformer only takes node similarity into consideration, whereas dose not focus on structural relationship. Ergo, Graphormer add structural encoding

③They capture node importance by Centrality Encoding, extract centrality by Degree centrality and present structural relationship by Spatial Encoding

④Graphormer occupies the top spot on the OGB-LSC, MolHIV, MolPCBA, ZINC and other rankings

de-facto 实际上的：指在实际上拥有某种地位或权力，而不是在法律上或正式上拥有

canonical adj. 根据教规的，按照宗教法规的；真经的，正经的；标准的，典范的；准确的，权威的；公认的，依据科学法则的；（数学表达式）最简洁的；（与）公理（或标准公式）（有关）的；（与）教会（或教士）（有关）的

2.3. Preliminary

（1）Graph Neural Network (GNN)

①Presenting graph as $G=\left ( V,E \right )$ , where $V=\{v_{1},v_{2},\cdots,v_{n}\}$ denotes the node set, $n=\left | V \right |$ denotes the number of nodes. Define the feature vector of $v_i$ named $x_i$ and node representation of $v_i$ at the $l$ -th layer is $h_i^{\left ( l \right )}$ , $h_i^{\left ( 0 \right )}=x_i$

②The usual GNN is representated as:

$a_i^{(l)}=\text{AGGREGATE}^{(l)}\left(\left\{h_j^{(l-1)}:j\in\mathcal{N}(v_i)\right\}\right)\\\quad h_i^{(l)}=\text{COMBINE}^{(l)}\left(h_i^{(l-1)},a_i^{(l)}\right)\\h_G=\text{READOUT}\left(\left\{h_i^{(L)}\mid v_i\in G\right\}\right)$

where $\mathcal{N}(v_i)$ denotes the neighbors (unknow hops) of $v_i$

（2）Transformer

①Each layer in Transformer contains a self-attention module and a position-wise feed-forward network (FFN)

②The input of self-attention module is ${H}=\left[h_{1}^{\top},\cdots,h_{n}^{\top}\right]^{\top}\in\mathbb{R}^{n\times d}$ , where $d$ represents the hidden dimension, $h_{i}\in\mathbb{R}^{1\times d}$ denotes the hidden representation at position $i$

③The function of attention mechanism:

$\begin{aligned}Q&=HW_Q,W_{Q}\in\mathbb{R}^{d\times d_{K}} \quad K=HW_K,W_{K}\in\mathbb{R}^{d\times d_{K}} \quad V=HW_V,W_{V}\in\mathbb{R}^{d\times d_{V}} \\A&=\frac{QK^\top}{\sqrt{d_K}},\quad\mathrm{Attn}\left(H\right)=\mathrm{softmax}\left(A\right)V\end{aligned}$

where $A$ is a similarity matrix of queries and keys

④They apply simple single-head self-attention mechanism and define $d_K=d_V=d$ . Moreover, they eliminate bias in multi-head attenton part

2.4. Graphormer

2.4.1. Structural Encodings in Graphormer

The overall framework of Graphormer, which contains three modules:

（1）Centrality Encoding

①For directed graph, their centrality encoding for input will be:

$h_{i}^{(0)}=x_{i}+z_{\deg^{-}(v_{i})}^{-}+z_{\deg^{+}(v_{i})}^{+}$

where $z_{\deg^{-}(v_{i})}^{-}\in \mathbb{R}^d$ is the learnable embedding vector of indegree $\deg^{-}(v_{i})$ , $z_{\deg^{+}(v_{i})}^{+}\in \mathbb{R}^d$ is the learnable embedding vector of outdegree $\deg^{+}(v_{i})$ （呃呃我现在不太能想象z是个什么样的玩意儿）

②For undirected graph, just one $\deg(v_{i})$ replaces $\deg^{+}(v_{i})$ and $\deg^{-}(v_{i})$

（2）Spatial Encoding

①There is no sequence in graph presentation. To this end, they provide a new spatial encoding method to present spatial relations between $v_i$ and $v_j$ :

${\phi\left(v_{i},v_{j}\right):V\times V\rightarrow\mathbb{R}}$

where $\phi\left(v_{i},v_{j}\right)$ they choose there is the shortest path (SPD). If there is no path, then set value as -1.

②⭐Assigning a learnable scalar to each feasible output value as bias term in self-attention part（鼠鼠注意力学得太菜了捏）

③The Q-K product matrix $A$ can be calculated by:

$A_{ij}=\frac{(h_iW_Q)(h_jW_K)^T}{\sqrt{d}}+b_{\phi(v_i,v_j)}$

where $b_{\phi(v_i,v_j)}$ denotes learnable scalar

④Paraphrase不动了这里上一句中文，文中的意思大概是上式比起传统的GNN可以体现全局视野，每个节点都开天眼。然后还能体现一下学习性，作者举例说如果 $b_{\phi(v_i,v_j)}$ 和 $\phi\left(v_{i},v_{j}\right)$ 负相关的话可能每个节点会更在意邻近节点。

（3）Edge Encoding in the Attention

①In the previous works, adding edge features into corresponding node features or adding edge features and aggregated node features into corresponding node features are two traditional edge encoding methods. However, it is too superficial and limited in that it can just express the adjacent relationships rather than global relationships.

②The SP of each pair node can be $SP_{ij}=\left ( e_1,e_2,...,e_N \right )$

③They calculate the average of the dot-products of the edge feature $x_{e_{n}}$ in the $n$ -th edge $e_n$ and a learnable embedding weight $w_{n}^{E}\in\mathbb{R}^{d_{E}}$ in the $n$ -th along the path: