Lec 04 PageRank and Link Analysis
1. PageRank
main idea
Regard links as votes : Page is more important if it has more links
Different Importance of links : links from important links are more important
Combination : A point is important if it is pointed by other important pages
Formula
Common Formulation
$r_i$ : the $i$-th node’s rank or score
$d_i$ : the out-degree of $i$-th node
$$
r_j = \sum_{i->j}\frac{r_i}{d_i}
$$
Matrix Formulation
$A_{ij}$ : equal to 1 if edge from node $i$ to $j$ exists, else 0
$sum_by_row(A)$ : a vector means the out-degrees of nodes in graph
Stochastic adjacency matrix M : if $j->i$, $M_{ij} = 1 / d_j$
$M = (A / sum_by_row(A))^T$
so that $r = M \cdot r$
Eigenvector Formulation
we could regard $r = M \cdot r$ as $1\cdot r = M \cdot r$, so that $r$ would be the eigenvector with the eigenvalue 1
2. How to solve PageRank ?
Power Iteration
algorithm :
assign each node an initial page rank
repeat until convergence $\sum_i|r_i^{t+1} - r_i^t| <= \varepsilon \Leftrightarrow |r^{t+1}-r^t|_1 <=varepsilon$:
$$
\begin{align}
&r_j^{t+1} = \sum_{i->j} \frac{r_i^t}{d_i}\
\Leftrightarrow &r = M\cdot r
\end{align}
$$
Problem : converge
problem in converge :
dead ends problem : no out-links
Spider traps : all out-links within the group
solution : teleport
At each time step, with the probability $\beta$, follow a link at random; with the probability $1-\beta$, jump to a random node
commonly, $0.8 < \beta <0.9$
Especially, when surfer arrive at a dead end, it has $1.0$ probability to jump to a random node
Formulation as follow (assume that $\sum_{i} r_i = 1$):
$$
\begin{align}
&r_j=\sum_{i\rightarrow j}\beta\frac{r_i}{d_i} + \frac1N(1-\beta)\
\Leftrightarrow &P = \beta \cdot M + \frac1N(1-\beta)
\end{align}
$$
3. Personalized PageRank and Random Walks with Restarts
PageRank, PPR and RWR
difference : the points could be teleported to
PageRank : teleport to anywhere on the graph
PPR : teleport to only a subset of graph
RWR : always teleport to the staring point
Random Walks with Restarts
Idea :
- Every node has some importance
- Importance gets evenly split among all edges and pushed to the neighbors
algorithm :
- Given a set of query_nodes $Q$, randomly select a node $q_i$ and make a random walk
- make a step to a random neighbor and record visit or with $1 -\beta$ probability to restart from $q_j \in Q$
- the nodes with the highest visit count have highest proximity to $Q$
1 | item = Q.sample_by_weight() |
considers :
- multiple connections
- multiple paths
- directed and undirected connection
- degree of node
4. Matrix Factorization
Embedding and Matrix Factorization
Given : measure the similarity of nodes with adjacency matrix $A$
then : $Z^TZ \approx A$
so that : $\arg \min_Z ||A-Z^TZ||_2$
Exactly : we make matrix factorization of adjacency matrix $A$