[그래프데이터분석및응용] Graph Neural networks
해당 포스팅은 [연세대학교 21-1 UT 세미나 그래프데이터분석과응용]을 기반으로 작성되었습니다.
1. Graph Neural Networks
1. Deep Graph Encoders
- multiple layers of non-linear transformations based on graph structure
 - Output: node embeddings, subgraph or graph embeddings
 - Difficulties
- No fixed node ordering or reference point
 - Dynamic and have multimodal features
 
 - Local network neiborhoods
 - Stacking multiple layers
 
2. Setup
- V: vertex set
 - A: adjacency matrix (assume binary)
 - X: matrix of node features
 - v: node in V; N(v): the set of neighbors of v
 - Node features
- Social network의 경우 user profile, user image
 - biologial networks: gene expression profiles, gene function information
 - When there is no node feature in the graph dataset
- Indicator vectors (one-hot encoding of a node)
 - Vector of constant 1: [1,1,…,1]
 
 
 
3. Naive approach
- Join adjacency matrix and features
 - Feed them into a deep neural net
- Issues with this idea
O( v ) parameters - Not applicable to graphs of different sizes
 - input만 달라져도 새로운 신경만을 학습을 시켜줘야 한다.
 
- Sensitive to node ordering
 
 
 - Issues with this idea
 
4. Convolutional networks
- Goal: to generate convolutions beyond simple lattices
 - Leverage node features/attributes
 - From Images to graphs
- Transform information at the neighbors and combine it
- Transform ‘messages’ h(i) from neighbors: W(i)*h(i)
 - Add them up
 
 
 - Transform information at the neighbors and combine it
 
5. Graph Convolutional networks
- Nodes’ neighborhood defines a computation graph
 - Learn how to propagate information across the graph to compute node features
 - Local network neighborhoodsd
- Aggregate neighbors
 - generate node embeddings based on local network neighborhoods
 - Every node defines a computation graph based on its neighborhood 

 
 - Many layers
- Models can be of arbitrary depth
- Nodes have embeddings at each layer
 - Layer 0 embedding of node u is its input features, x_u.
 - Layer k embedding gets information from nodes that are k hops away.
 
 
 - Models can be of arbitrary depth
 - Neighborhood Aggregation
- Key distinctions are in how different approaches aggregate information across the layers
 - Average information from neighbors -> apply a NN 

 
 - Model parameters 

 - Matrix Formulation 

- not all graphs can be represented in the matrix form
 
 - How to train GNN
- Node embedding is a function of input graph
 - Supervised Setting: minimize the loss
- Directly train the model for the supervised task 

 
 - Directly train the model for the supervised task 
 - Unsupervised setting
- use the graph structure as the supervision
 - Similar nodes have similar embeddings 

 - node similarity can be anything including random walks or node proximity in graph
 
 
 - 가장 큰 장점: Inductive capability
- Generate embeddings for nodes as neede and even for nodes we never trained on.
 - The same aggregation parameters are shared for all nodes
 The number of model parameters in sublinear in v and can generalize to unseen nodes - Number of parameters to train does not differ by the node.
 
 
6. General perspectives on GNN
- GNN layer=message+aggregation
- GCN, GraphSAGE, GAT: how to design the message and define the function to aggregate the messages
 
 - A single GNN Layer
- Compress a set of vectors into a single vector
 - Two step process
- Message
 - Aggregation
 
 
 - 1) Message Computation
- Message function 

 - Intuition: Each node will create a message, which will be sent to other nodes later
 
 - Message function 
 - 2) Aggregation
- Intuition: each node will aggregate the message from node v’s neighbors 

 
 - Intuition: each node will aggregate the message from node v’s neighbors 
 - Issue
- information from node v itself could get lost
 - computation of h_v(l) does not directly depend on h_v(l-1)
 - solution: include h_v(l-1) when computing h_v(l)
 - Message에서의 해결방법: compute message from node v itself
- a different message computation will be performed
 
 - Aggregation의 해결방법: after aggregating from neighbors, we can aggregate the message from node v itself via concatenation or summation
 
 
#### 7. Classical GNN Layers: GCN * GCN 
 * Message 
 * Aggregation: sum over message from neighbors, then apply activation function. 
8. GraphSAGE
- GraphSAGE (Sample and aggreGatE) 

 - Message is computed within the AGG.
 - Two-stage aggregation
- Stage1: Aggregate from node neighbors 

 - Stage2: Further aggregate over the node itself 

 
 - Stage1: Aggregate from node neighbors 
 - Neighbor Aggregation
- Mean: take a weighted average of neighbors 

 - Pool: transform neighbor vectors and apply symmetric vector function Mean, Max, sum, or std. 

 - LSTM
 
 - Mean: take a weighted average of neighbors 
 - L2 Normalization
- Apply l2 normalization to h_v(l) at every layer: 크기가 1인 벡터로 정규화 

 - Without l2 normalization, the embedding vectors have different scales for vectors
 
 - Apply l2 normalization to h_v(l) at every layer: 크기가 1인 벡터로 정규화 
 - Neighborhood sampling
- Randomly sample a node’s neighborhood for message passing
 - When we compute the embeddings, we can sample different neighbors
 - greatly reduces computational cost and avoids overfitting
 
 - GCN이나 GraphSAGE의 경우 이웃들에게 공통된 weight을 부여함
- All neighbors are almost equally important to node v
 
 
9. Graph Attention Networks (GAT)
- Attention weights를 부여 

 - Not all node’s neighbors are equally important and attention focuses on the important parts of the input data and fades out the rest
 - Idea: more computing power on that small but important part of the data
 - Goal: Specify arbitrary importance to different neighbors of each node in the graph
 - How?
- Compute embedding of each node in the graph following an attention strategy
 
 - Attention mechanism
- Let a compute attention coefficients e_vu across pairs of nodes u, v based on their messages 

 - indicates the importance of u’s message to node v
 - Normalize e_vu into the final attention weight
 - Use the softmax function, so that 
 
 - Weighted sum based on the final attention weight 

 - What is the form of attention mechanism a?
- agnostic to the choice of a
 - use a simple single-layer NN as a and a has trainable parameters
 - Parameters of a are trained jointly with weight matrices in an end-to-end fashion
 
 
 - Let a compute attention coefficients e_vu across pairs of nodes u, v based on their messages 
 - Multihead attention
- Allows model to capture different version of attention
 - Stabilizes the learning process of attention mechanism
 - Create multiple attention scores (each replica with a different set of parameters) 

 - Outputs are aggregated by concatenation or summation.
 
 
10. Oversmoothing problem in GNN
- Overview
- The issue of stacking many GNN layers
 - All node embeddings converge to the same value
 
 - Receptive Field
- the set of nodes that determine the embedding of the node of interest
 - In the k-layer GNN, each node has receptive field of k-hop neighborhood
 - Receptive field overlap for the two nodes and the shared neighbors quickly grows when we increase the number of hops (in other words: number of GNN layers)
 - If two nodes have highly-overlapped receptive fields, then thier embeddings are highly similar
 - If GNN layers are many-> node embeddings will be highly similar and suffer from the over-smoothing problem
 
 - Design GNN laer connectivity
- Lesson1: Be cautious when adding GNN layers
- Step 1: Analyze the necessary receptive field to solve your problem
 - Step 2: Set number of GNN layers to be a bit more than the receptive field we like
 - Do not set L unnecessarily high
 
 
 - Lesson1: Be cautious when adding GNN layers
 - Expressive power for shallow GNN
- How to make a shallow GNN more expressive
 - Idea1
- Increase the expressive power within each GNN layers
 - Make aggregation/transformation become a deep NN.
 
 - Idea 2
- A GNN does not necessarily only contain GNN layers
 - We can add MLP layers before and after GNN layers, as pre- and post-process layers 

 
 
 
