Recommendation System
Introduction
The users of our software are all innovative researchers on Synthetic Biology, who are interested in many different biological fields. The purpose of our system is to recommend most related genetic parts to the users based on the research interest the users offer to our system.
The information that we use to make recommendation is a database built by NLP and Random Walk which contains scores between each key word and each genetic part. Formally, it is a n * m
matrix A
, where n
is the number of key words and m
is the number of parts. The element at row i
, column j
is the score between the i
th keyword and j
th genetic part, which reflects the connection between them(higher score means higher connection).
The overall strategy of our system is Collaborative Filtering , i.e. we first search the similar key words in our database of the unknown word offered by users and then recommend the genetic parts which are highly related to key words that we found to the users.
A natural question is how to find the most similar key words in our dataset of a given unknown word in an accurate and efficient way? Here is our general solution : To quantify the semantic similarities between words accurately ,we use technique in Deep Learning to convert words into numerical vectors and the cosine similarities between each vectors can represents the semantic similarities between words. To search the similar key words efficiently, we use the KD Tree Algorithm, a fast algorithm based on binary tree, to implement the K Nearest Neighbors strategy.
Models used in the system
Word2Vec Algorithm
Word2vec is an algorithm that produces word embedding , i.e. it converts a corpus of text into a high dimensional real vector space(in our case , the dimension is 400) and each word in the corpus is assigned to a vector in the vector space. If two words are similar semantically , then they will be close under cosine distance measure.
In our Recommendation System, we use gensim, an open source Python module focused on Natural Language Processing , to train our word2vec model and the corpus we use to feed the model is wikimedia, which can be downloaded from:https://dumps.wikimedia.org/
The reason why we use Word2vec is that it can distinguish the semantic meanings of words accurately by Deep Learning technique, which outperforms the traditional semantic analysis methods greatly.
KD Tree Algorithm
The KD Tree algorithm is rather efficient when searching for the most similar items with averaged time complexityO(log(n))
,thus the users can get recommendation instantly after they enter their interested new word to our system.
The KD Tree Algorithm consists of the following two parts:
- Construct KD Tree based on the normalized word vectors of the key words in the database which we previously mentioned. (Offline)
- When the user enter a new word , we search the similar key words along the KD Tree that we constructed in step 1) . (Online)
Note that KD Tree Algorithm is based on Euclidean distance measure , while in our case , we need to calculate the cosine distance of the word vectors. To solve this problem ,we normalize all the word vectors of key words before constructing KD Tree and normalize the word vector of new words offered by users before letting it travel along the constructed KD Tree. The reason behind the method is that by Law of Cosines:
$$|x_1-x_2|=\sqrt{x_1^2+x_2^2-2|x_1||x_2|\cos(x_1,x_2)}\,,\,\,x_1,x_2\in R^n$$
while since x1,x2
are normalized, we have :
$$|x_1-x_2| = \sqrt{1+1-2\cos(x_1,x_2)}\,,\,\,x_1,x_2\in R^n, |x_1|=|x_2|=1$$
Now we see that the calculation of cosine distance between normalized vectors can be replaced by the calculation of Euclidean distance since they can determine each other in this normalization context.
In our Recommendation System , we use the KD Tree implementation in scikit-learn , a simple and efficient open source machine learning module in Python. For more details of KD Tree algorithm, see: wiki page of K-D Tree
Random Walk with Restart Algorithm
Random Walk with Restart(RWR) is an algorithm adapted from the PageRank algorithm and it focuses on characterizing the affiliation between each item. We treat the relation between key words and genetic parts as an undirected graph where nodes represent key words or parts and edges represent connection between words and parts. Imagine there is a walker travelling on the graph mentioned above and each time he faces two choices: 1) Randomly travelling along an edge connected to the current node. 2) Teleport to node K. After a long time of random travelling , the frequency the walker reaches each node represents the affiliation between each node and node K , which we use to characterize the relation between key words and parts. For more detailed mathematical formulation of PageRank, see: wiki page of page rank
Algorithms used in the Recommendation System
Calculating affiliation between key words and parts
We used random walk with restart to calculate the affiliation between key words and all parts in the dataset. The calculation is done by updating the probability vector iteratively ,once the probability vector is converged, it can be viewed as the affiliation between various parts and key words. Therefore, the higher the probability is, the closer the key word and part are.
Here is the Algorithm that we use in our project:
Define Transition Matrix M
Let M
be the transition matrix of the graph G
with size of (n+m)^2
where n
represents the number of keywords and m
represents the number of parts. The entry in row i
and column j
of M
is 1/k
if node j
of G
has degree k
, and one of the adjacent node is i
. Otherwise the entry is 0
Note that the first n
columns(rows) represent key words and the other m
columns(rows) represent the parts.
Prespecify β
β is the probability that the walker continues to follow the edges connected to the current node, thus 1-β
is the probability the walker will teleport to the initial node
Iterate over v
for
keyword i
in 1 to N
Define and initialize v
Let ei
be the column vector that has 1 in the row for node i
and 0’s elsewhere, i.e.
$$e_{ij}=\delta_{ij}=
\begin{cases}
1, & \text{if $i=j$} \\
0, & \text{if $i\neq j$}
\end{cases}$$
And ei
is the column vector such that each item reflects the probability the walker stays at each node of the graph and the initial v
is set to be ei
.
Update v
by
$$v'=\beta M v+(1-\beta)e_N$$
update code{v} until the terminal condition is satisfied, i.e.
$$|v^{(t)}-v^{(t-1)}|<\epsilon,\,\,\text{for sometime $t$}$$
Modify v
Since the first n components of v
are the affiliation between key word i
and the other key words which we are not interested , therefore we only keep the last m components of v
, which reflects the affiliation between key word i
and all the parts.
Build KD Tree of Keywords
Once the user enter a new word to our system, we need to search for similar key words efficiently , thus we build a KD Tree of key words to implement this function.
Here is our Algorithm
Train word vector model
We feed our word vector model with Wikimedia corpus . After the training process is done, we have a model that can convert words into numerical vectors, i.e.
$$f:\Gamma\rightarrow R^n$$
where Γ denoted the corpus space
Convert keywords into normalized vectors
let T
denotes the set of word vectors
for
keyword i
in 1 to N
:
T
.append \( \frac{f(i)}{|f(i)|_2} \)
Build KD Tree
We use the standard construction procedure to build KD Tree of set T
. Details will be ignored here. For the people who are interested , we offered a link to wikipedia in section2.
Collaborative Filtering
This is the final step to construct our Recommendation System , overall we use collaborative filtering strategy to make prediction.
Here is the algorithm
Transform users input
We get users input word k
, we then convert it into normalized word vector v
, i.e. $$v=\frac{f(k)}{|f(k)|}$$
Search K most similar keywords(KNN)
Let v
travel along the KD Tree in a way like binary search , we can end up with K
(in our case , K
is set to be 5) most similar key words(KNN) to the current word k
.
Make Recommendation
We have calculated the affiliation between key words and parts which shall be used to guide our final recommendation. Let pi
denotes the connection between the i
the part and the current word k
, we calculate pi
by formula $$p_i=\sum^K_{n=1}\frac{\text{Affiliation}(w_n,p_i)}{\text{distance}(w_n,k)+\text{bias}}$$
where \( \{w_n|n=1,2,...,K\}\) denotes the k
most similar key words
Affiliation(wn, pi)
denotes the connection between wi
and the i
th part and distance(wi, k)
denotes the the Euclidean distance between the normalized word vectors of wn
and word k
.
After the computation of \(\{p_i|i=1,2,...,m\}\), we use heap sort algorithm to get R
highest pi
and recommend to the users the corresponding parts.Besides, we use exactly the same strategy to make recommendation of related projects to the users. For simplicity, we ignore the details here.