2024-12-06

GNN Notes : LEC 02

Lec 02 Traditional feature-base method for GNN

1. Node-Level Feature

Node Degree

category : importance-based, structure-based

weakness : treat all neighboring nodes equally

Node Centrality

category : importance-based

core : take the node importance in a graph

eigenvector centrality

core : v is important if its neighboring nodes are important

formula :
$$
\begin{align}
&c_v = \frac{1}{\lambda} \sum_{u\in N(v)} c_u\

\Leftrightarrow&\lambda c = A c
\end{align}
$$

bewteeness centrality

core : v is important if v lies on many shortest paths between two nodes

formula :

use $p(u,v)$ express the number of shortest path between $u$ and $v$

use $p(v,s,t)$ express the number of shortest path between $s$ and $t$ via $v$
$$
c_v = \sum_{v\ne s\ne t}\frac{p(s,t)}{p(s,t,v)}
$$

closeness centrality

core : v is important if v has small shortest path lengths to all other nodes

formula :

use $p(u,v)$ express the shortest path length between node $u$ and $v$
$$
c_v = \frac 1 {\sum_{u\ne v}p(u,v)}
$$

Clustering coefficient

category : structure-based

formula :

use $e(N(v))$ express the number of edge among $N(v)$
$$
c_v = \frac{e(N(v))}{C_{k_v}^2}
$$

Graphlet Degree Vector

category : structure-based

core : use the number of graphlet with node $v$ express $v$’s feature

2. Link-Level Feature

Task of Link Prediction

Given : existing links

Aim : predict new links

Key : design features to express a pair of nodes

Formulations of link predictions

links missing at ramdon

core : self-supervised

methodology :

remove links randomly
predict them

links over time

core : self-regression

Given : $G[t_0,t_0’]$ express graph $G$ at time $t_0’$

Output : $L$ express a list of links that be predicted to appear at time $t_1’$

Evaluation : take top $n$ links in $L$ and then count how many links in it appear at time $t_1’$

Methodology :

for each $(x,y)$, compute its score $c(x,y)$
decreasing score $c(x,y)$
select top $n$ as predicted links

Distance-based Feature

mainly include : shortest path distance between two nodes

use $s(u,v)$ express the length of shortest path between node $u$ and node $v$, and use $s(u,v)$ as the feature of $(u,v)$

weakness : ignore the neighborhood’s degree

Local Neighborhood Overlap

common neighborhood

$|N(v_1)\cap N(v_2)|$

Jaccard’s coefficient

$\frac{|N(v_1)\cap N(v_2)|}{|N(v_1)\cup N(v_2)|}$

Adamic-Adar index

encoding the node importance also

$\sum_{u\in N(v_1)\cap N(v_2)} \frac 1{\log(k_u)}$

Limitation : metric will be 0 if two nodes has no common neighbors but they may connected in the furture

Global Neighborhood Overlap

core : use the number of paths of all lengths between given nodes

calculation : calculate by the power of Adjacency Matrix

Katz index’s formulation :
$$
s_{u,v} = \sum_{l=1}^{\infty} \beta^l A^l\
S = (I-\beta A)^{-1}-I\
where\ 0< \beta < 1
$$

3. Graph-Level Feature

Kernel Methods

Idea : design kernel instead of feature vector

Introduction :

$K(G,G’)$ measures the similarity between $G$ and $G’$
Kernel metrix $k=(K(G,G’))_{G,G’}$ must be positive semidefinite
exists a feature represent $\phi(\cdot)$ such that $K(G,G’) = \phi(G) \phi(G’)$

Core Idea of Graph Kernels

Bags-of-words is the main idea use in graph feature

Graphlet Features

Main Idea : Bags of Graphlets

Given : graph $G$, graphlet list $G_k = (g_1,g_2,…,g_{n_k})$, graphlet count vector $f_G \in R^{n_k}$, where $(f_G)_i = #(g_i \in G)$

Output : $K(G,G’) = f_G^T f_{G^{‘}}$

Problem : if $G$ and $G’$ has different sizes, that woudl greatly skew value

Solutions : normalization by $h_G = \frac{f_G}{sum(f_G)}$ and $K(G,G’) = h_G^T h_{G^{‘}}$

Limitations : NP-hard problem, time expensive

Weisfeiler-Lehman Kernel

Idea ：use neighborhood structure to iteratively enrich node vocabulary, or bags of colors

Methodology :

Given graph G with node set V
Init : $c^0(v)$
for K steps :

$c^{(k+1)}(v) = hash({c^{(k)}(v),{c^{(k)}(u)}_{u\in N(v)}})$
count numbers of nodes with given color

Time Complexity : $K |E|$