[논문리뷰] Graph Transformer Networks

Yun, S., Jeong, M., Kim, R., Kang, J., & Kim, H. J. (2019). Graph transformer networks. Advances in Neural Information Processing Systems, 32, 11983-11993.

Abstract

Graph neural networks (GNNs) have been widely used in representation learning on graphs and achieved state-of-the-art performance in tasks such as node classification and link prediction. However, most existing GNNs are designed to learn node representations on the fixed and homogeneous graphs. The limitations especially become problematic when learning representations on a misspecified graph or a heterogeneous graph that consists of various types of nodes and edges. In this paper, we propose Graph Transformer Networks (GTNs) that are capable of generating new graph structures, which involve identifying useful connections between unconnected nodes on the original graph, while learning effective node representation on the new graphs in an end-to-end fashion. Graph Transformer layer, a core layer of GTNs, learns a soft selection of edge types and composite relations for generating useful multi-hop connections so-called meta-paths. Our experiments show that GTNs learn new graph structures, based on data and tasks without domain knowledge, and yield powerful node representation via convolution on the new graphs. Without domain-specific graph preprocessing, GTNs achieved the best performance in all three benchmark node classification tasks against the state-of-the-art methods that require pre-defined meta-paths from domain knowledge.

1. Introduction

1) 배경

Graph Neural Network (GNN)은 graph 구조의 데이터의 representation을 학습하기 위한 중요한 도구가 됨
한계
- GNN은 graph가 고정되어있고 homogeneous함을 가정함
- 없거나 거짓의 연결이 존재하는 noisy graph는 graph에 잘못된 neighbor을 기반으로한 ineffective convolution으로 이어진다.
- 다양한 형태의 nodes나 relation을 기반으로 하는 heterogeneous graph에는 GNN을 적용하기가 어렵다.
  - heterogeneous graph의 경우 node type이나 edge type의 중요도는 task에 따라서 달라지고 어떤 경우에는 아예 쓸모가 없을 수 있다
  - heterogeneous graph를 다룰 때 homogeneous graph처럼 하나의 node와 edge로만 구성되었다고 볼 수도 있지만 모든 information에 대한 정보를 다 처리하는 것이 아님
  - 따라서 이러한 그래프를 다루기 위해서 meta-path(paths connected with heterogeneous edge types)를 직접 디자인을 하기 시작하였고, heterogeneous 그래프를 meta-path로 구성된 homogeneous graph로 transform하였음
  - Meta-path를 수동으로(?) 그려주는 것은 비효율적이다

2) Proposal 1 - Graph transformer networks (GTNs)

기존의 그래프에서 유용한 multi-hop connection (e.g. meta-path)을 포함하고 noisy connection을 배제하는 새로운 graph로 기존의 그래프를 transform하도록 학습
새로운 그래프의 node representation을 end-to-end로 학습
Graph Transformer Layer 은 edge type의 인접행렬의 soft selection을 학습하고, 유용한 meta-path가 생성될 수 있게 2개의 선택된 인접행렬을 곱한다.
- soft selection :
Identity matrix를 leveraging함으로서, heterogeneous graph의 선택된 edge type과 연결된 arbitrary-length composite 관계들을 기반으로 한 새로운 그래프 구조를 생성할 수 있음

3) GTN의 issue 해결방안

FastGTN
- 큰 크기의 adjacency matrix를 곱해야 하기 때문에 computation costs와 large memory를 고려한 모델 제안
- FastGTN의 경우 2개의 인접행렬의 곱 없이 graph를 implicitly transform한다.
- 기존 GTN에 비해서 230배 빠르고 100배 적게 메모리를 사용하지만, GTN과 동일한 graph transformation이 가능케 하도록 한다.
Non-local operations
- GTN의 edge generation이 input graph의 meta-path의 nodes 연결에만 의존하였는데, 이는 node의 semantic proximity를 고려하지 못한 것이다.
- non-local operation을 허용함으로서, node의 semantic proximity를 고려한 graph transformation을 가능하게 함.

4) Contribution

GTN을 통해서 유용한 meta-path와 multi-hop connection을 판별하여 new graph structure을 학습할 수 있다.
Fast-GTNs를 통해서 인접행렬의 곱 없이 graph를 implicitly하게 transform이 가능하다
meta-path외의 node 간의 semantic proximity를 활용할 수 있도록 non-local operations 활용
6개의 heterogenous와 homogeneous graph에서 node classification에서 SOTA 달성

heterogeneous graph

GNN with relation-specific parameters
- R-GCN: employed GCN with relation-specific convolutions
- Heterogeneous Graph Transformer (HGT): parameterize the meta relation triplet of each edge type and transformer 아키텍처의 self-attention을 활용한 구조를 통해 다양한 관계의 구체적인 패턴을 학습함
GNN with relation-based graph transformations (meta-path를 주로 활용하는 방법)
- Heterogeneous graph attention network (HAN): 수동으로 선택한 meta-path를 통해 heterogeneous 그래프를 homogeneous graph로 transform시킨 뒤 그래프에 attention 기반 GNN 적용
  - multi-stage 접근법이고, 각 데이터 별로 meta-path를 지정해야 한다는 한계
  - 어떤 meta-path를 선택하느냐에 따라서 성능이 달라짐.

3. Method

더 효율적인 graph convolution을 만들고 강력한 node representation을 학습하기 위해서 여러 후보 인접행렬을 찾음
새로운 그래프 구조를 학습한다는 것은 유용한 meta-path와 multi-hop connections를 인식하는 것을 포함함

1) Preliminaries

Heterogeneous Graph
- Deach node v and each edge e가 각각의 type mapping function ( Tv(v): V → Tv, Te(e): ε→Te)과 연관되어 있는 Directed graph
  - G = (V, E, Tv, Te)
- 해당 그래프는 인접행렬의 set이나 tensor 으로 나타날 수 있다.
  - 이떄 At는 t 번째 edge type의 인접행렬이고, V =N
  - At[i,j]는 node j로부터 node i까지의 t번째 edge type의 weight을 나타냄.
    - 그래프가 single type의 node와 edge를 가지면, Tv = 1, Te = 1인 homogeneour graph이다.
Metha-path
- Heterogeneous 그래프에서의 multi-hop connection이며, heterogeneous edge type을 연결시켜주는 path
Graph Convolutional Network
- 해당 모델에서 end-toend로 node를 classify할 때 사용이 됨.
- H(l)을 l번째 layer의 feature representation이라고 할 때, forward propagation은 다음과 같다.
  - 이때, Ã = A+I로, A 인접행렬에 self-connetion이 더해진 것이며, D~의 경우 Ã의 degree matrix이다. in-degree diagonal matrix인 D의 역행렬을 통해서 A를 정규화할 수 있다.
  - W(l)은 학습가능한 weight matrix
  - 이 때, 주어진 그래프 구조에 따라 convolution operation이 이루어짐을 알 수 있고, node-wise 선형 transform인 H(l)W(l)을 제외하고는 학습이 불가능하다. 따라서, convolution layer은 node-wise linear transformation에 activation function σ를 취한 fixed convolution으로 구성된다.
  - 본 연구에서의 framework에서는 다양한 인접행렬로부터 추출한 convolution을 통해, 다양한 convolution이 가능하다.

2) Learning meta-graphs

핵심 아이디어
- meta-path 그래프를 학습할 때, 유용한 meta-path P가 특정한 순서의 edge type (t1,t2,…,tl)과 연결된 새로운 adjency matrix A_P를 edge type별로 인접행렬의 곱을 통해서 얻는다.
- GTN의 k번째 Graph transformer layer은 softmax 함수에서의 weight으로 1x1 convolution의 인접행렬(edge type)을 선택할 수 있게 된다.
  - 는 1x1 convolution의 parameter
  - 는 인접행렬의 convex combination
- Meta-path 인접행렬은 output과 이전 (k-1)번째의 GT layer의 output 행렬의 곱으로 계산되어진다.
- numerical stability를 위해 matrix는 normalized된다.
  - 이 때 D는 이전 단계의 2개의 행렬의 곱의 output의 degree matrix이다.
GTN이 임의의 meta-path를 edge type과 path-length 기반으로 학습할 수 있는지 확인
- 임의의 길이 k+1를 가지는 meta-path의 인접행렬 계산식
  - : set of edge types
  - : k번째 GT layer의 edge type tk에 대한 weight
- α가 one-hot vector일 때, Ap는 모든 (k+1) meta-path 인접 행렬들의 weighted sum임
Issue
- GT layer을 추가한다는 것은 meta-path의 길이를 늘리는 것이기 때문에, original edge를 고려하지 않게 된다.
  - 길이가 짧더라도 중요한 path일 수 있기 때문에, identity matrix I (A0=I)를 포함한다. 이를 통해 GTN은 최대 k+1의 길이인 모든 길이의 meta-path를 학습할 수 있게 된다.

3) Graph Transformer Networks

다양한 multi-path를 동시에 고려하기 위해
- Output channels를 1x1 filter로 설정하면 k번째 GT layer인 A(k)가 output tensor 가 되고, weight vector 가 weight matrix Φ(k)가 된다.
- Eq(5)는 tensor equation으로 나타날 수도 있다.
  - 이때, 이며 는 두 tensor의 합의 output의 degree tensor임.
- K의 GT layer을 쌓은 후, multi-layer GNN이 각 채널의 output tensor에 적용이 되고, node representation Z를 업데이트 한다.
  - 는 concatenation operator, C=number of channels, Z(l)은 l번째 GNN layer에서의 node representation을 의미한다.
  - 에서 A(k)의 c번째 channel에서의 self-loop을 가지는 인접행렬이다.
  - W(l)은 channel 모두 공유되는 학슴가능한 weight matrix이며 Z(0)은 X의 feature matrix, f_agg는 channel aggregation 함수이다.
- Final node representation Z(l)은 downstream task에 사용될 수 있다.
  - 본 연구에서 softmax layer에 dense layer를 적용하여 node representation에 사용한다. Node의 ground truth label을 사용하여 back propagation과 gradient desent의 cross-entropy를 최소화하도록 weight을 최적화한다.

4) Fast Graph Transformer Networks

GTN의 scalability issue를 해결하기 위한 것.

GTN의 scalability issue
- GTNs는 meta-path의 새로운 인접행렬을 2개의 인접행렬의 곱으로 explicit하게 계산하고, 각 layer 별로 새로운 인접행렬을 저장한다.
- 따라서 huge computational costs와 large memory가 필요하며, GTN을 large graph는 적용하기에 어렵게 만든다.
- 이러한 문제를 해결하기 위해서 GTN의 enhanced version인 FastGTN을 통해 meta-path의 새로운 인접행렬을 저장하지 않는 implicit한 그래프 structure transform을 진행한다.

Derivation
- C=1로 가정
- (문제)
- 하나의 GCN 레이어는 새로운 그래프 구조의 top에 얹어지고, channel aggregation function은 identity function과 동일하다. 이 때, GTN의 node representation Z는 다음과 같다.
  - A(k)는 K개의 GT 레이어를 가지는 GTN의 새로운 인접행렬이고, X는 input feature이며, W는 GCN layer의 선형 변환, D의 역행렬은 (A(k)+I)의 degree matrix의 역행렬이다.
  - GT layer에서 선택된 2개의 인접행렬을 곱한 후에 output 인접행렬을 정규화하는 것을 알 수 있다.
- Z는 다음과 같이 재구성이 가능하다.
  - huge adjacency matrix의 곱을 통해 computational mbottleneck이 사용됨을 알 수 있다.
- (해결)
- Matrix mutiplication의 associative property를 다음과 나타낼 수 있고, 이는 각 레이어에서 인접행렬의 행렬 곱 없이 differently constructed 인접행렬을 사용하여 feature transformation의 sequence를 통해 identical feature을 구할 수 있음을 의미한다.
  - 하지만 2개의 인접행렬의 곱을 구할 수 없기 때문에 degree matrix 를 구할 수 없다.
- Given condition of input data와 normalized matrices의 proposition을 통해 Eq. 11의 degree matrices를 identity matrices로 만들어 Eq. 11을 구할 수 있다.
- Proposition 1
  - (ii)로 인해, 각 k-th 레이어에서 가 identity matrix I가 되므로, 이에 따라 Eq 11.은 아래와 같이 기술 가능함
  - (iii)을 통해서 Z는 아래와 같이 표현이 가능함
- 각 레이어가 하나의 인접행렬의 convex combination으로 구성되어있기 때문에, K-layers는 K개의 인접행렬을 만들어낸다.
- 이는 수학적으로 FastGTNs는 GTN과 수학적으로 동일함을 의미한다
- 1부터 K까지의 순서를 뒤집고, 1/2를 hyper-parameter γ로 replace하면, FastGTN의 output은 다음과 같이 표현된다.
Multi-Channel setting에서의 Fast GTN
- C(l)은 number of channels, Z(l)은 l번째 FastGTN 레이어의 node-representation, 은 l번째 FastGTN 레이어의 c번째 channel의 선형변환, A는 normalized 인접행렬의 set, α(l;k;c)는 l번째 레이어의 k번째 FastGT 레이어의 c번째 채널의 convolution filter을 의미하고, Z(0)은 feature matrix이다.

5) Non-local operations

GTN의 단점은 transformation이 existing relation의 구성에 한정지어 졌다는 것이다. KGT layer은 (K+1)-hop relation에 대해서만 edge를 만들 수 있고, node의 semantic proximity between nodes를 기반으로 remote relation을 생성해낼 수 없다. 따라서 meta-path를 벗어난 node의 semantic proximity를 활용한 node feature를 포함할 수 있는 non-local operation을 제안함

Fast GCN에만 적용
각 l번째 FastGTN 의 layer에 k번째 FastGT layer에만 non-local 인접행렬 을 이전 FastGT 레이어의 hidden representation Z(l,k-l)을 기반으로 구성하고, 이 non-local 인접행렬을 인접행렬 후보 set A에 붙여, non-local relation을 활용할 수 있도록 한다.
non-local 인접행렬 구축
- Graph affinity matrix를 node similarity기준으로 l번째 FastGT layer마다 계산
- k-1번째 FastGT layer의 multi=channel hidden representation의 유사도를 구하고, 이를 latent space에 non-linear transformation gθ 를 통해 project한다.
- 이후 latent의 similarity를 통해 affinity matrix를 계산한다.
  - GAE를 통해 similarity 계산
- Affinity matrix는 fully connected graph의 weighted 인접행렬로 볼 수 있고, 해당 matrix 계산은 computing cost가 크기 때문에 각 node i 별로 n개의 가장 큰 weight을 추출하여 affinity matrix를 sparsify한다.
- 각 행의 edge weight에 softmax function을 적용하여 non-local adjacency matrix를 row-wise로 normalize한다.
- 이를 transformation에 사용하기 위해서 non-local parameter을 각 FastGT layer의 1x1 convolution filter에 사용한 후, k번째 FastGT layer의 인접행렬 set에 append한다.

6) 다른 GNN 아키텍처와의 비교

1. GCN과의 비교 시 아래 두 가지 사항은 동일하다.
  - GCN의 l번째 GCN layer의 output node representation
  - FastGTN의 FastGT layer의 개수가 1이고, channel의 개수가 1인 경우, node representation,
1. Mixhop은 FastGTN의 special case
1. RGCN 또한 FastGT 레이어가 하나이고, channel의 개수가 basis matrices와 동일한 경우, different linear combination을 적용한다는 것을 빼면 동일하다.

4. Experiments

Q1. How effective are the GTNs and FastGTNs with non-local operations compared to state-of-the-art GNNs on both homogeneous and heterogeneous graphs in node classification?
Q2. Can the FastGTNs efficiently perform the identical graph transformation compared to the GTNs?
Q3. Can GTNs adaptively produce a variable length of meta-paths depending on datasets?
Q4. How can we interpret the importance of each metapath from the adjacency matrix generated by GTNs?

1) Experimental Setting

Dataset
Baselines
- MLP, Node2Vec, GCN, GAT, JK-Net, Mixhop, GCNII, RGCN, HAN, HGT
Implementation with PyTorch
- Dimentionality of hidden representations: 64
- Adam optimizer
- hyper-parameter search
  - learning rate from 1e-3 to 1e-6
  - dropout rate: 0.1 to 0.8
  - epoch: 50 to 200.
- micro-F1 scores are computed
  2) Results
모든 데이터에서 GTN과 FastGTN이 best performance를 보임

3) Efficiency of FastGTN

Predictions (confidence scores) by the FastGTN과 GTN이 동일함
FastGTN이 GTN보다 computational cost와 memory 사용량에 있어서 효율적임. 특히 graph가 더 커질 수록 더 performance gain은 큼

4) Interpretation of GTN

GTN에서도 table 5의 사전 정의된 meta-paths를 잘 예측함.
GTN에서는 사전 정의되지 않은 중요한 meta-path 또한 잘 예측함.
- DBLP에서 CPCPA를 가장 중요한 meta-path라고 정의하는데 이는 사전 정의가 되지 않는다.
- Author의 research 분야가 venue와 연관되어있음을 의미하는데 이는 domain 적으로 의미가 있다.
GTN은 short meta-path를 학습하는데 더 유용하다. 특히 high-attention scores를 identity matrix에 assign하면 할 수록 deeper layer에서 더 길이가 짧은 meta-path를 중요하게 여기는 경향이 있음.

5. Conclusions

Future directions
- Applying GTNs to other tasks such as graph classification and link prediction.