[그래프데이터분석및응용] Graph Neural networks
해당 포스팅은 [연세대학교 21-1 UT 세미나 그래프데이터분석과응용]을 기반으로 작성되었습니다.
1. Graph Neural Networks
1. Deep Graph Encoders
- multiple layers of non-linear transformations based on graph structure
- Output: node embeddings, subgraph or graph embeddings
- Difficulties
- No fixed node ordering or reference point
- Dynamic and have multimodal features
- Local network neiborhoods
- Stacking multiple layers
2. Setup
- V: vertex set
- A: adjacency matrix (assume binary)
- X: matrix of node features
- v: node in V; N(v): the set of neighbors of v
- Node features
- Social network의 경우 user profile, user image
- biologial networks: gene expression profiles, gene function information
- When there is no node feature in the graph dataset
- Indicator vectors (one-hot encoding of a node)
- Vector of constant 1: [1,1,…,1]
3. Naive approach
- Join adjacency matrix and features
- Feed them into a deep neural net
- Issues with this idea
O( v ) parameters - Not applicable to graphs of different sizes
- input만 달라져도 새로운 신경만을 학습을 시켜줘야 한다.
- Sensitive to node ordering
- Issues with this idea
4. Convolutional networks
- Goal: to generate convolutions beyond simple lattices
- Leverage node features/attributes
- From Images to graphs
- Transform information at the neighbors and combine it
- Transform ‘messages’ h(i) from neighbors: W(i)*h(i)
- Add them up
- Transform information at the neighbors and combine it
5. Graph Convolutional networks
- Nodes’ neighborhood defines a computation graph
- Learn how to propagate information across the graph to compute node features
- Local network neighborhoodsd
- Aggregate neighbors
- generate node embeddings based on local network neighborhoods
- Every node defines a computation graph based on its neighborhood
- Many layers
- Models can be of arbitrary depth
- Nodes have embeddings at each layer
- Layer 0 embedding of node u is its input features, x_u.
- Layer k embedding gets information from nodes that are k hops away.
- Models can be of arbitrary depth
- Neighborhood Aggregation
- Key distinctions are in how different approaches aggregate information across the layers
- Average information from neighbors -> apply a NN
- Model parameters
- Matrix Formulation
- not all graphs can be represented in the matrix form
- How to train GNN
- Node embedding is a function of input graph
- Supervised Setting: minimize the loss
- Directly train the model for the supervised task
- Unsupervised setting
- use the graph structure as the supervision
- Similar nodes have similar embeddings
- node similarity can be anything including random walks or node proximity in graph
- 가장 큰 장점: Inductive capability
- Generate embeddings for nodes as neede and even for nodes we never trained on.
- The same aggregation parameters are shared for all nodes
The number of model parameters in sublinear in v and can generalize to unseen nodes - Number of parameters to train does not differ by the node.
6. General perspectives on GNN
- GNN layer=message+aggregation
- GCN, GraphSAGE, GAT: how to design the message and define the function to aggregate the messages
- A single GNN Layer
- Compress a set of vectors into a single vector
- Two step process
- Message
- Aggregation
- 1) Message Computation
- Message function
- Intuition: Each node will create a message, which will be sent to other nodes later
- 2) Aggregation
- Intuition: each node will aggregate the message from node v’s neighbors
- Issue
- information from node v itself could get lost
- computation of h_v(l) does not directly depend on h_v(l-1)
- solution: include h_v(l-1) when computing h_v(l)
- Message에서의 해결방법: compute message from node v itself
- a different message computation will be performed
- Aggregation의 해결방법: after aggregating from neighbors, we can aggregate the message from node v itself via concatenation or summation
#### 7. Classical GNN Layers: GCN * GCN * Message * Aggregation: sum over message from neighbors, then apply activation function.
8. GraphSAGE
- GraphSAGE (Sample and aggreGatE)
- Message is computed within the AGG.
- Two-stage aggregation
- Stage1: Aggregate from node neighbors
- Stage2: Further aggregate over the node itself
- Neighbor Aggregation
- Mean: take a weighted average of neighbors
- Pool: transform neighbor vectors and apply symmetric vector function Mean, Max, sum, or std.
- LSTM
- L2 Normalization
- Apply l2 normalization to h_v(l) at every layer: 크기가 1인 벡터로 정규화
- Without l2 normalization, the embedding vectors have different scales for vectors
- Neighborhood sampling
- Randomly sample a node’s neighborhood for message passing
- When we compute the embeddings, we can sample different neighbors
- greatly reduces computational cost and avoids overfitting
- GCN이나 GraphSAGE의 경우 이웃들에게 공통된 weight을 부여함
- All neighbors are almost equally important to node v
9. Graph Attention Networks (GAT)
- Attention weights를 부여
- Not all node’s neighbors are equally important and attention focuses on the important parts of the input data and fades out the rest
- Idea: more computing power on that small but important part of the data
- Goal: Specify arbitrary importance to different neighbors of each node in the graph
- How?
- Compute embedding of each node in the graph following an attention strategy
- Attention mechanism
- Let a compute attention coefficients e_vu across pairs of nodes u, v based on their messages
- indicates the importance of u’s message to node v
- Normalize e_vu into the final attention weight
- Use the softmax function, so that
- Weighted sum based on the final attention weight
- What is the form of attention mechanism a?
- agnostic to the choice of a
- use a simple single-layer NN as a and a has trainable parameters
- Parameters of a are trained jointly with weight matrices in an end-to-end fashion
- Multihead attention
- Allows model to capture different version of attention
- Stabilizes the learning process of attention mechanism
- Create multiple attention scores (each replica with a different set of parameters)
- Outputs are aggregated by concatenation or summation.
10. Oversmoothing problem in GNN
- Overview
- The issue of stacking many GNN layers
- All node embeddings converge to the same value
- Receptive Field
- the set of nodes that determine the embedding of the node of interest
- In the k-layer GNN, each node has receptive field of k-hop neighborhood
- Receptive field overlap for the two nodes and the shared neighbors quickly grows when we increase the number of hops (in other words: number of GNN layers)
- If two nodes have highly-overlapped receptive fields, then thier embeddings are highly similar
- If GNN layers are many-> node embeddings will be highly similar and suffer from the over-smoothing problem
- Design GNN laer connectivity
- Lesson1: Be cautious when adding GNN layers
- Step 1: Analyze the necessary receptive field to solve your problem
- Step 2: Set number of GNN layers to be a bit more than the receptive field we like
- Do not set L unnecessarily high
- Lesson1: Be cautious when adding GNN layers
- Expressive power for shallow GNN
- How to make a shallow GNN more expressive
- Idea1
- Increase the expressive power within each GNN layers
- Make aggregation/transformation become a deep NN.
- Idea 2
- A GNN does not necessarily only contain GNN layers
- We can add MLP layers before and after GNN layers, as pre- and post-process layers