Team:HFUT-China/Model

Team:HFUT-China

Model



1. Latent Dirichlet Allocation (LDA) model

For the information of all teams under each track, we tried to let our computers “understand” it, and automatically classify it into groups. Conventional LDA model is used to explore keywords of themes among documents, but here we regard it as an unsupervised classifier, and unsupervised means we don’t have to provide any manually labeled data. As a result it can give us clusters of documents, and documents in the same cluster have the same theme. The picture below better explains how LDA works.


(1) α represents the key parameter to generate a theme. (2) β stands for the word given theme distribution (p(word|theme)). (3) θ is the theme distribution for the document (p(theme)). (4) z is the theme for words in a document. (5) w is specific words.



2. TF-IDF model

TF-IDF refers to Term Frequency–Inverse Document Frequency. It is used in our system to excavate the keywords of a document. It consists of two parts TF value and IDF value. Primarily, we calculate the TF value for each document by simply counting how many times a word appears in the document. As for IDF value of word w_i, it is calculated according to the following formula:

The IDF value represents how general a word is, and the higher it is, the less the word is commonly seen.

Finally, we combine the TF and IDF value by multiplying them. By doing this, we can filter off some the general words, and keywords are left as we expected.
3. Word2Vec

Word2Vec plays an important role in our system. Word Vector is an effective and promising substitute for the conventionally used one-hot encoding method in NLP (Natural Language Processing). About one-hot encoding, take a sentence “I love you so much” for example. We want to find a vector to represent each word in this sentence. What one-hot does is that it assigns “1” to the entry in this vector according to the word’s position in the sentence. For example, “love” in one-hot is “0 1 0 0 0” and “so” is “0 0 0 1 0”. Nonetheless, this kind of encoding does not contain the semantic meaning of a word. If we want to measure the semantic similarity between two words, unless these words are identical, the similarity will be zero. So researchers proposed “Word Vector”, which is a vector representing the word’s semantic meaning. It takes the context of the word into consideration, and the effect turns out to be really excellent.

Thus, the similarity between two words can be easily measured using L2 norm. The whole process of calculating word vectors is through neural networks, and the detailed structure of it can be found here.

4. LSI (Latent Semantic Indexing)

Word Vector can only be used to explore the semantic meaning for a word, and if we try to measure the semantic distance between two documents, we will have to find another way. And that’s why we introduced LSI model. LSI uses SVD (Singular Value Decomposition) to find the latent similarity between documents. SVD can be thought as factorization in matrices version. For example, the number 12 can be decomposed to 2×2×3, and SVD does the same thing for matrices. Suppose that we have m documents and n total words. We decompose it as follows:

Where A_(i,j) stands for the feature value, which is TF-IDF value of word j in document i generally. We regard U_i, which is the row vector of the matrix U to be the semantic value of document i. The similarity among documents i and j can be calculated using cosine similarity as the following expression:

Reference

     1.By Bkkbrad - Own work, GFDL,