2024-12-06

GNN Notes : LEC 04

Lec 04 PageRank and Link Analysis

1. PageRank

main idea

Regard links as votes : Page is more important if it has more links

Different Importance of links : links from important links are more important

Combination : A point is important if it is pointed by other important pages

Formula

Common Formulation

$r_i$ : the $i$-th node’s rank or score

$d_i$ : the out-degree of $i$-th node
$$
r_j = \sum_{i->j}\frac{r_i}{d_i}
$$
Matrix Formulation

$A_{ij}$ : equal to 1 if edge from node $i$ to $j$ exists, else 0

$sum_by_row(A)$ : a vector means the out-degrees of nodes in graph

Stochastic adjacency matrix M : if $j->i$, $M_{ij} = 1 / d_j$

$M = (A / sum_by_row(A))^T$

so that $r = M \cdot r$

Eigenvector Formulation

we could regard $r = M \cdot r$ as $1\cdot r = M \cdot r$, so that $r$ would be the eigenvector with the eigenvalue 1

2. How to solve PageRank ?

Power Iteration

algorithm :

assign each node an initial page rank
repeat until convergence $\sum_i|r_i^{t+1} - r_i^t| <= \varepsilon \Leftrightarrow |r^{t+1}-r^t|_1 <=varepsilon$:
$$
\begin{align}
&r_j^{t+1} = \sum_{i->j} \frac{r_i^t}{d_i}\
\Leftrightarrow &r = M\cdot r
\end{align}
$$

Problem : converge

problem in converge :

dead ends problem : no out-links

Spider traps : all out-links within the group

solution : teleport

At each time step, with the probability $\beta$, follow a link at random; with the probability $1-\beta$, jump to a random node

commonly, $0.8 < \beta <0.9$

Especially, when surfer arrive at a dead end, it has $1.0$ probability to jump to a random node

Formulation as follow (assume that $\sum_{i} r_i = 1$):
$$
\begin{align}
&r_j=\sum_{i\rightarrow j}\beta\frac{r_i}{d_i} + \frac1N(1-\beta)\
\Leftrightarrow &P = \beta \cdot M + \frac1N(1-\beta)
\end{align}
$$

3. Personalized PageRank and Random Walks with Restarts

PageRank, PPR and RWR

difference : the points could be teleported to

PageRank : teleport to anywhere on the graph

PPR : teleport to only a subset of graph

RWR : always teleport to the staring point

Random Walks with Restarts

Idea :

Every node has some importance
Importance gets evenly split among all edges and pushed to the neighbors

algorithm :

Given a set of query_nodes $Q$, randomly select a node $q_i$ and make a random walk
make a step to a random neighbor and record visit or with $1 -\beta$ probability to restart from $q_j \in Q$
the nodes with the highest visit count have highest proximity to $Q$

item = Q.sample_by_weight()
for i in range(N_STEPS):
	item = item.get_random_neighbor()
    item.visited_time += 1
    if random >= beta :
        item = Q.sample_by_weight()

considers :

multiple connections
multiple paths
directed and undirected connection
degree of node

4. Matrix Factorization

Embedding and Matrix Factorization

Given : measure the similarity of nodes with adjacency matrix $A$

then : $Z^TZ \approx A$

so that : $\arg \min_Z ||A-Z^TZ||_2$

Exactly : we make matrix factorization of adjacency matrix $A$