GNN Notes : LEC 03

Lec 03 Node Embedding

1. Overview

Goal : Efficient task-independent feature learning for ML with graph

Why Embedding ?

  1. similarity of embedding space indicate that in origin space
  2. encode information
  3. could used in many downstream tasks

2. Encoder and Decoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
graph TB
subgraph Embedding Space
a((ZA))
b((ZB))
c((ZC))
d((ZD))
u((ZU))
v((ZV))
end
subgraph Origin Network
1((A))
2((B))
3((C))
4((D))
5((U))
6((V))
1---2
1---3
1---4
5---1
6---1
end
ARROW[Embedding To]

Learning Node Embedding

  1. Encoder maps from nodes to embeddings
  2. define a similarity function in origin space
  3. Decoder maps from embeddings to similarity scores
  4. optimize the parameters of encoder

3. Shallow Encoding : Embedding lookup

$$
Enc(v_i) = z_{v_i} = Z \cdot v_i\
(v_i)_j = \left{ \begin{array}{rcl}
0,i\ne j\
1,i=j
\end{array} \right.\
Z \in R^{d\times |V|}
$$

Weakness : need to optimize a large number of parameters

4. Random Walk : Overview

Algorithm of Random Walk

  1. Run short fixed-length random walks starting from each node $u$ with strategy $R$

  2. collect $N_R(u)$ for each node $u$, where $N_R(u)$ is a multi-set including all node access in step 1

  3. optimize the loss function gradient descent
    $$
    \arg \min_{z_u,u\in V} \sum_{u \in V} \sum_{v \in N_R(u)} - \log \frac{\exp(z_u^Tz_v)}{\sum_{n\in V} z_u^Tz_n}
    $$

Weakness : time expensive

The time complexity of loss function calculating is $O(|N|^2)$, that means it is time expensive

Solution : Negative Sample

To simplify the loss function’s calculating, use the formula as follow
$$
\log \frac{\exp(z_u^Tz_v)}{\sum_{n\in V} z_u^Tz_n} \approx \log (\sigma(z_u^Tz_v)) - \sum_{i=1}^k \log(\sigma(z_u^Tz_{n_i}))\
\sigma(x) = \frac{1}{1+\exp(-x)}\
$$
In the formula, $n_i$ express the node got by sample from random distribution

Proved by the practice, the negative sample have good performance when $k$ just in $[5,20]$

5. The Random Walk Strategy

5.1 Deep Walk

Just run fixed-length, unbiased random walks

Weakness : too constrained

5.2 Node2Vec

Idea : use flexible, biased random walks that can trade off between local and global, where local nodes’ explored by BFS while global by DFS

Super Parameters

  1. return parameter $p$ : depend the probability to return to previous node
  2. In-out parameter $q$ : depend the probability to BFS or DFS

Example For node2vec

For the graph below, assume previous node is $A$ and current node is $B$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
graph LR
1((A))
subgraph Distance1
2((B))
3((C))
end
subgraph Distance2
5((E))
6((F))
end
1---2
1---3
2---5
2---6

The next node would depend by random sample from the distribution as follow :
$$
nextnode = \left{\begin{array}{rcl}
A,prob = 1 / p\
C,prob = 1 / q\
E,prob = 1\
F,prob = 1
\end{array}\right.\
$$
Attention : the probabilities is not normalized

6. Entire Graph Embedding

6.1 Sum or average

algorithm :

  1. run a standard node embedding technique on graph G
  2. sum or average all node embeddings in the G

6.2 Virtual Node

algorithm :

  1. use a virtual node represent the entire graph, maybe have a series of edges with node in the graph
  2. run a standard node embedding technique to get the virtual node’s embedding

6.3 Anonymous Walk Embedding : v1

algorithm :

  1. run a random walk on the graph $G$
  2. anonymization : for each node in a random walk result, replace its name with the index of the first time it occur
  3. for all of anonymous walk produced by 2, count the number of all kinds of anonymous walk
  4. Embedding $Z_G[i]$ represent the score or probability of $i$-th anonymous walk

the dimension of embedding $Z_G$:

Because the kind number of anonymous walk depend on the length of walk, so the dimension of embedding depend on it too.

How many random walks m do we need?

We want the distribution to have error of more than $\varepsilon$ with probability less than $\delta$
$$
m = \lceil \frac 2 {\varepsilon^2(\log(2^n-2)-\log(\delta))} \rceil
$$

6.4 Anonymous Walk Embedding : v2

signals :

$\eta$ : the number of samples, the “sample” mean the sum of window number could be extracted from all anonymous walk

$Z$ : the embeddings of walks, $z_i$ express $i$-th specific type of anonymous walk

$\Delta$ : the window size

$z_G$ : the embedding of the entire graph $G$

$w_{i,j}$ : the $j$-th anonymous walk in the series sampled starting from node $i$

algorithm :

learn the embeddings of anonymous walks $Z = {z_i|i=1,2,…,\eta}$ as well as $z_G$

a) for each node i, starting from it and sample anonymous walks randomly, get a series of walks called $[w_{i1},w_{i2},…,w_{iT}]$

b) for a window size $\Delta$, learn to predict walks that co-occur
$$
\arg \min_{Z,z_G,b,U} \sum_{i \in G }\sum_{t=\Delta}^{T-\Delta} -\log P(w_{i,t}|{w_{i,t-\Delta},w_{i,t-\Delta+1},…,w_{i,t+\Delta},z_G})\
P(w_{i,t}|{w_{i,t-\Delta},w_{i,t-\Delta+1},…,w_{i,t+\Delta},z_G}) = \frac{\exp(y(w_{i,t}))}{\sum_{j=1}^\eta \exp(y(w_j))}\
y(w) = b + U (cat(\frac 1 {2\Delta}\sum_{i=-\Delta}^\Delta z_i,z_G))
$$

7. How to use Embeddings

  • Clustering by point $z_i$

    • community detection
  • Node classification by point $z_i$

    • predict the label of node
  • Link prediction : predict edge $(i,j)$ based on $(z_i,z_j)$

    • concatenate : $[z_i,z_j]$
    • Hadamard : $z_i * z_j$
    • Sum / Avg : $z_i + z_j$
    • Distance : $||z_i-z_j||_2$
  • Graph classification by $z_G$