Description of the algorithm

We shall show that the graph constructed from a training set defines a Marov chain. Let us consider a random walk on this graph from a node (word), i, to a related word with probability 1/out_degree(i). Similarly we can consider that from a node we can visit all words with small probabilities 1/n, where n is the number of nodes in the graph. Thus this random walk defines an ergodic Marov chain. Starting with an arbitrary distribution of probability this Marov chain converges to a unique stationary probability distribution (known as PageRank in the era of information retrieval). The algorithm can be used such that to promote a given set of nodes in a specific topic. This use of the algorithm is called topic sensitive PageRank. Generally the algorithm depends on a bias vector, p. In this vector we may distribute the mass of probability in such a way that promotes certain pages (nodes). In our case these nodes will represent the polarized words (positive or negative). Thus starting with an initial small set of polarized words we run the algoriithm and calculate the polarity score of all nodes. The user may select words with high positive or negative polarity and enrish the initial set of polarixed words. The process may repeated several times.

In the following we give the mathematics of the algorithm and the data structures selected for using the algorithm with large data sets.

Random walk with restart algorithm - Mathematics

1. Construct the adjacency matrix of the graph:

Construct the adjacency matrix

2. Construct matrix P with transition probabilities, Pij=probability of moving from node i to node j. Add the elements of each row of Α and divide the elements of each row by the corresponding sum.

3. Sink nodes: If a row of P contains only zeros then replace all the elements of the line by 1/n. Let d a binary vector with 1's in position i if i is a sink node and 0 otherwise.

sink vector


We define a bias vector, p, with equal components (probabilities which relate a node (word) i with all other nodes (words) in the graph.

bias-vector

The rows of the matrix matrix-D corresponding to sink nodes have all their elements equal to 1/n. All other elements of D are equal to 0.

4. Calculate the transition probability matrix. Finally we calculate matrix E whose elements are all equal to 1/n and denote the probability of jumping from a node to all other nodes in the graph. Matrix E is defined by: matrix-E, where vector-1 is a vector with all its components equal to 1. Finally we define the transition probability matrix

matrix-Phat

where α is a dumping factor set to 0.85.

5. Calculate left eigenvector of patrix Phat (power method)

power method

It is easily to prove that this sequence converges to the left eienvector of the matrix Phat. If this vector is normalized to 1 with respect to norm L1 defines a probability distribution vector with elements the polarity scores of the corresponding nodes.

To exploit the sparseness of matrix P we don't need to calculate the matrix P-hat explicitly. The power method may be applied by the relation:

power-simple

Proof

p1

It holds that:
p2

Thus

power-simple

Data Structures

Construction of the adjacency matrix: The adjacency matrix is constructed while parsing the training set. For each pair of related terms within a sentence we store in the file, A, their coordinates defined by their IDs. Only the dictionary of terms is kept in memory. It is clear that a pair of terms (i,j) may occurs several times inside the file, A. The number of occurrencies of (i,j) defines the weight of the edge (i,j). The adjacency matrix for our example is stored in the file A as it is shown in the figure.

p1

Construction of matrix P: Matrix P is a stochastic matrix. To calculate P we parse the file, A, and calculate the out_degree, Ni, for each node i. This is equal to the sum of the elements in row i of the adjacency matrix (first coordinate of the elements in A).

Initial values: for all nodes i set N[i]=0;
read file A;
for each pair (i,j) N[i]++;
//P[i,j] <- Adj[i,j]/N[i];
//This is not necessary to compute explicitly

Sink nodes: Sink nodes have out_degree, Ni=0. If a node i is a sink node then all the elements in row i of matrix P are zero. In this case we set all these elements to 1/n, where n is the size of the graph. To retain the sparsity of matrix P this is not done explicitly. However it is taken into account while we calculate the Matrix-vector multiplication in the power method.

Calulate y=xTP

for each (i,j)∈ A set y[j]=y[j]+x[i]/N[i];

Calulate y=xT(P+D)=xTP+xTD

for each (i,j)∈ A do
if (N[i]==0) then y[j]=y[j]+x[i]/n; else y[j]=y[j]+x[i]/N[i];

The bias vector p. We assign a probability mass pm (through the interface) to our set of initial positive (negative) words and (1-pm) to all the remaining words.

Use the Stanford PoS tagger to set new relations between words inside a sentence.