Definition PageRank Centrality
PageRank centrality is calculated using the following formula: \[ r_i = \alpha \sum_{j=1}^N \frac{W_{ji}}{\sum_{k=1}^N W_{jk}} r_j + (1 - \alpha)\frac{1}{N} \] where:
- \( r_i \) is the PageRank score of node \( i \),
- \( \alpha \) is the damping factor (\( \alpha = 0.85 \)),
- \( W_{ij} \) is the weight of the edge from node \( i \) to node \( j \),
- \( \sum_{k=1}^N W_{kj} \) is the total outgoing weight from node \( j \),
- \( N \) is the total number of nodes in the network.
Calculating PageRank
The PageRank algorithm, developed by Larry Page and Sergey Brin, is a fundamental method for measuring the importance or authority of nodes within a network, particularly web pages. It interprets links as "votes" where a link from an important page carries more weight. The algorithm uses an iterative process to assign a numerical weight to each element of a hyperlinked set of documents, aiming to establish their relative importance.
1. Data Extraction and Graph Setup
The first step involves preparing the input data, which represents a directed weighted graph. An input DataFrame, TN_t_c, contains columns for reporterISO (source country), partnerISO (destination country), and W_ij (the weight of the link). Let \( N \) be the total number of unique countries (nodes) in our network. Each country is assigned a unique integer index from \( 0 \) to \( N-1 \).
2. Initializing and Populating the Weighted Adjacency Matrix \( \mathbf{W} \)
I construct an \( N \times N \) weighted adjacency matrix, \( \mathbf{W} \).
- Definition: An element \( W_{ij} \) in this matrix represents the total weight of direct links from country \( i \) to country \( j \). If there is no link from \( i \) to \( j \), then \( W_{ij} = 0 \).
- Mathematical Representation: For a set of \( N \) countries, the matrix \( \mathbf{W} \) is: \[ \mathbf{W} = \begin{pmatrix} W_{00} & W_{01} & \cdots & W_{0,N-1} \\ W_{10} & W_{11} & \cdots & W_{1,N-1} \\ \vdots & \vdots & \ddots & \vdots \\ W_{N-1,0} & W_{N-1,1} & \cdots & W_{N-1,N-1} \end{pmatrix} \] Each \( W_{ij} \) is the sum of weights for all links from country \( i \) to country \( j \). With the data structure of different periods this does not matter for my case, but in Capability 1 this leads to me taking the sum of the commodity codes as the weight and not looking at them individually.
- Conceptual Visualization: Imagine \( N \) countries. If country \( i \) links to country \( j \) with a weight \( w \), then \( W_{ij} = w \). For example, a simplified \( \mathbf{W} \) for \( N=4 \) countries might look like: \[ \mathbf{W} = \begin{pmatrix} 0 & W_{01} & W_{02} & W_{03} \\ W_{10} & 0 & W_{12} & W_{13} \\ W_{20} & W_{21} & 0 & W_{23} \\ W_{30} & W_{31} & W_{32} & 0 \end{pmatrix} \] (where \( W_{ii}=0 \) for no self-loops unless otherwise specified by data).
3. Creating the Transition Probability Matrix \( \mathbf{P} \)
The PageRank power iteration formula requires a column-stochastic matrix \( \mathbf{P} \), where \( P_{ji} \) represents the probability of a random surfer moving from page \( i \) to page \( j \).
- a. Calculate Row Sums (Out-Degrees): For each country \( i \), I calculate the total "strength" of its outgoing links. This is the sum of the \( i \)-th row in \( \mathbf{W} \). \[ L(i) = \sum_{k=0}^{N-1} W_{ik} \] This forms a vector \( \mathbf{L} \) of total outgoing link weights for each node. L represents the sum of all outgoing connections. If a country does not export at all it is set to 0. This leaves us with: \[ \mathbf{L} = \begin{pmatrix} L(0) \\ L(1) \\ \vdots \\ L(N-1) \end{pmatrix} \]
-
b. Handle Dangling Nodes:
A "dangling node" is a country (node) that has no outgoing links (i.e., its row sum \( L(i) = 0 \)). If a random surfer lands on such a page, they cannot follow any links. To prevent them from getting "stuck" and to ensure PageRank distributes correctly, dangling nodes are handled specifically.
- If \( L(i) = 0 \), I treat node \( i \) as if it links to every other node with equal probability \( 1/N \). This ensures that PageRank does not get trapped.
-
c. Create Row-Stochastic Matrix \( \mathbf{M} \):
I first construct an intermediate matrix \( \mathbf{M} \) where \( M_{ij} \) is the probability of moving from node \( i \) to node \( j \). This means the rows of \( \mathbf{M} \) must sum to 1.
- For a non-dangling node \( i \): \[M_{ij} = \frac{W_{ij}}{L(i)} = \frac{W_{ij}}{\sum_{k=0}^{N-1} W_{ik}}\]
- For a dangling node \( i \) (\( L(i) = 0 \)): \[M_{ij} = \frac{1}{N} \quad \text{for all } j \in \{0, \ldots, N-1\}\]
- d. Transpose to Get Column-Stochastic Matrix \( \mathbf{P} \): The PageRank power iteration formula relies on a matrix \( \mathbf{P} \) where \( P_{ji} \) represents the probability of moving from node \( i \) to node \( j \), and its columns sum to 1. This is achieved by transposing \( \mathbf{M} \). \[P_{ji} = M_{ij}\] So, \( \mathbf{P} = \mathbf{M}^T \). \[ \mathbf{P} = \begin{pmatrix} M_{00} & M_{10} & \cdots & M_{N-1,0} \\ M_{01} & M_{11} & \cdots & M_{N-1,1} \\ \vdots & \vdots & \ddots & \vdots \\ M_{0,N-1} & M_{1,N-1} & \cdots & M_{N-1,N-1} \end{pmatrix} \] Each column of \( \mathbf{P} \) sums to 1: \( \sum_j P_{ji} = 1 \). Thus, \( \mathbf{P} \) is a column-stochastic matrix, ready for the PageRank formula.
- Conceptual Visualization of \( \mathbf{P} \): In the matrix \( \mathbf{P} \), the columns represent the source nodes, and the rows represent the destination nodes. An element \( P_{ji} \) is the probability of a surfer moving from node \( i \) to node \( j \). \[ \mathbf{P} = \begin{pmatrix} P_{00} & P_{01} & \cdots & P_{0,N-1} \\ P_{10} & P_{11} & \cdots & P_{1,N-1} \\ \vdots & \vdots & \ddots & \vdots \\ P_{N-1,0} & P_{N-1,1} & \cdots & P_{N-1,N-1} \end{pmatrix} \] where \( P_{ji} \) (element in row \( j \), column \( i \)) is the probability of moving from node \( i \) to node \( j \). The sum of probabilities in each column must be 1. For example, the sum of probabilities of moving from node \( i \) to any other node \( j \) is \( \sum_j P_{ji} = 1 \).
4. Damping Factor ( \( \alpha \) ) and Teleportation Vector ( \( \mathbf{v} \) )
These components are crucial for ensuring the algorithm's convergence and reflecting the behavior of a random surfer.
-
Damping Factor ( \( \alpha \) ):
- Definition: Represents the probability that the random surfer will continue to follow a link on the current page. From experience of previous research set to be \( 0.85 \).
-
Teleportation Vector ( \( \mathbf{v} \) ):
- Definition: Represents the probability distribution over all pages that the random surfer jumps to when they get "bored" (i.e., decide not to follow a link, which occurs with probability \( 1-\alpha \)). In the standard PageRank model, this is a uniform distribution across all pages.
- Mathematical Representation: \[\mathbf{v} = \begin{pmatrix} 1/N \\ 1/N \\ \vdots \\ 1/N \end{pmatrix}\] This ensures that the elements of \( \mathbf{v} \) sum to 1: \( \sum_{i=0}^{N-1} v_i = 1 \). If \( (1-\alpha) \) "triggers", the random surfer jumps to the edge with index chosen in v.
5. Power Iteration Initialization
The PageRank algorithm uses an iterative approach (power iteration) to find the steady-state PageRank scores. I start with an initial guess \( r^{(0)} \).
-
Initial PageRank Vector ( \( \mathbf{r}^{(0)} \) ):
- Definition: Each page is initially assigned an equal PageRank score.
- Mathematical Representation: \[\mathbf{r}^{(0)} = \begin{pmatrix} 1/N \\ 1/N \\ \vdots \\ 1/N \end{pmatrix}\] The sum of elements in \( \mathbf{r}^{(0)} \) is 1: \( \sum_{i=0}^{N-1} r_i^{(0)} = 1 \) and \( N \) is the number of countries in the network.
-
Convergence Threshold ( \( \epsilon \) ):
- Definition: A small positive value used to determine when the algorithm has converged. The iteration stops when the change in the PageRank vector between successive steps falls below this threshold.
- Typically: \( \epsilon = 10^{-8} \).
6. Power Iteration Loop
This is the core of the PageRank algorithm, where PageRank scores are iteratively refined until they stabilize.
- Loop Condition: The loop continues as long as the measured change ( \( \Delta \) ) in the PageRank vector is greater than the convergence threshold \( \epsilon \).
- Scoring Vector: \( \mathbf{r}^{(k)} \) represents the PageRank scores of all pages after the \( k \)-th iteration of the algorithm. More precisely, \( \mathbf{r}^{(k)} \) can be interpreted as the probability distribution of the random surfer being on any given page after \( k \) steps of their journey across the web (or our network of countries). The vector \( \mathbf{r}^{(k)} \) is an \( N \times 1 \) column vector: \[ \mathbf{r}^{(k)} = \begin{pmatrix} r_0^{(k)} \\ r_1^{(k)} \\ \vdots \\ r_{N-1}^{(k)} \end{pmatrix} \] where \( r_i^{(k)} \) is the PageRank score (or the probability of the random surfer being on page \( i \) ) after the \( k \)-th iteration. Crucially, because it's a probability distribution, the sum of all elements in \( \mathbf{r}^{(k)} \) must be 1: \( \sum_{i=0}^{N-1} r_i^{(k)} = 1 \).
-
PageRank Update Equation:
For iteration \( k+1 \), the new PageRank vector \( \mathbf{r}^{(k+1)} \) is calculated from the previous vector \( \mathbf{r}^{(k)} \) using the formula:
\[\mathbf{r}^{(k+1)} = \alpha \mathbf{P} \mathbf{r}^{(k)} + (1 - \alpha) \mathbf{v}\]
Let's interpret the terms:
- \( \mathbf{P} \) is our column-stochastic transition matrix, where \( P_{ji} \) is the probability of a random surfer moving from page \( i \) to page \( j \).\( \mathbf{P} \mathbf{r}^{(k)} \) : When you multiply the matrix \( \mathbf{P} \) by the vector \( \mathbf{r}^{(k)} \), you are essentially calculating where the surfer is likely to be after one more step, assuming they always follow a link according to the probabilities defined by \( \mathbf{P} \). The \( j \)-th component of this product, \( (\mathbf{P} \mathbf{r}^{(k)})_j = \sum_{i=0}^{N-1} P_{ji} r_i^{(k)} \), represents the total probability flow into page \( j \) from all other pages \( i \), weighted by their current probabilities of being visited. In simpler terms, it's the sum of the votes page \( j \) receives, where each vote's strength is proportional to the PageRank of the voting page ( \( r_i^{(k)} \) ) and the probability of that page linking to \( j \) ( \( P_{ji} \) ).
- \( \alpha (\mathbf{P} \mathbf{r}^{(k)}) \): This scales the PageRank obtained by following links by the damping factor \( \alpha \).
- \( (1 - \alpha) \mathbf{v} \): This term represents the PageRank added by the "teleportation" mechanism. A small portion of the total PageRank ( \( 1-\alpha \) ) is uniformly distributed among all pages via the vector \( \mathbf{v} \). This is crucial for overcoming issues like "rank sinks" (groups of pages that only link internally) and ensuring that all pages (even those with no incoming links) can eventually gain some PageRank. It ensures that the underlying Markov chain is irreducible and aperiodic, guaranteeing a unique stationary distribution (the PageRank vector).
- Convergence Check (L1 Norm): The L1 norm (Manhattan distance) is used to quantify the overall change between the new and old PageRank vectors: \[\Delta = \| \mathbf{r}^{(k+1)} - \mathbf{r}^{(k)} \|_1 = \sum_{i=0}^{N-1} |r_{i}^{(k+1)} - r_{i}^{(k)}|\] This \( \Delta \) value indicates the stability of the PageRank distribution.
- Update and Increment: The newly computed PageRank vector \( \mathbf{r}^{(k+1)} \) becomes the current vector \( \mathbf{r}^{(k)} \) for the next iteration. The iteration counter is incremented.
7. Final Results
Once the power iteration loop converges (\( \Delta \le \epsilon \)), the final \( \mathbf{r} \) vector contains the stable PageRank scores for each country. These scores represent the calculated importance or authority of each country within the network, based on the link structure and weights.