Difference between revisions of "Team:HFUT-China/Model"

 
(42 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
<html>
 
<html>
   
+
 
    <head>
+
<head>
 
     <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 
     <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
   
+
 
      <title>Team:HFUT-China</title>
+
    <title>Team:HFUT-China</title>
   
+
 
     <link href="https://2017.igem.org/Template:HFUT-China/Safety/css/headerCss?action=raw&ctype=text/css" rel="stylesheet"/>
+
     <link href="https://2017.igem.org/Template:HFUT-China/Safety/css/headerCss?action=raw&ctype=text/css" rel="stylesheet" />
     <link href="https://2017.igem.org/Template:HFUT-China/Safety/css/stylewikiCss?action=raw&ctype=text/css" rel="stylesheet"/>
+
     <link href="https://2017.igem.org/Template:HFUT-China/Safety/css/stylewikiCss?action=raw&ctype=text/css" rel="stylesheet"
 +
    />
 
     <link href="https://2017.igem.org/Template:HFUT-China/MainPage/css/styleCss?action=raw&ctype=text/css" rel="stylesheet" />
 
     <link href="https://2017.igem.org/Template:HFUT-China/MainPage/css/styleCss?action=raw&ctype=text/css" rel="stylesheet" />
 
     <style>
 
     <style>
    body{background:#f7f5e6;}
+
        body {
 +
            background: #f7f5e6;
 +
        }
 +
        a{
 +
            color: rgba(120, 194, 196, 1);
 +
        }
 +
        a:hover{
 +
            color: rgba(120, 194, 196, 1);
 +
        }
 +
        a:visited{
 +
            color: rgba(120, 194, 196, 1);
 +
        }
 
     </style>
 
     </style>
    </head>
+
</head>
   
+
 
    <center>
+
<center>
    <div class="header" style="z-index:999;position:absolute">
+
<div class="header" style="z-index:999; position:fixed;left:0px;top:16px;">
      <ul class="nav">
+
  <ul class="nav">
        <li><a href="https://2017.igem.org/Team:HFUT-China">&nbsp;&nbsp;Main page&nbsp;&nbsp;</a></li>
+
    <li><a href="https://2017.igem.org/Team:HFUT-China">&nbsp;&nbsp;Main page&nbsp;&nbsp;</a></li>
        <li><a href="https://2017.igem.org/Team:HFUT-China/Description">&nbsp;&nbsp;Project&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";>&nbsp;&nbsp;</a>
+
    <li><a href="https://2017.igem.org/Team:HFUT-China/Description">&nbsp;&nbsp;Project&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";>&nbsp;&nbsp;</a>
                <ul>
+
              <li class="selfnav hef"><a href="https://2017.igem.org/Team:HFUT-China/Description"><font class="hef">Description</font></a></li>
+
              <li class="selfnav hef"><a href="https://2017.igem.org/Team:HFUT-China/Design"><font class="hef">Design</font></a></li>
+
              <li class="selfnav hef"><a href="https://2017.igem.org/Team:HFUT-China/Contribution"><font class="hef">Contribution</font></a></li>
+
              <li class="selfnav hef"><a href="https://2017.igem.org/Team:HFUT-China/Demonstrate"><font class="hef">Demonstrate</font></a></li>
+
          </ul>
+
        </li>
+
        <li><a href="https://2017.igem.org/Team:HFUT-China/Software">&nbsp;&nbsp;Software&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";>&nbsp;&nbsp;</a>
+
            <ul>
+
    <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Software"><font class="hef" color="#000">Software</font></a></li>
+
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Model"><font class="hef" color="#000">Model</font></a></li>
+
            <li class="selfnav"><a href="href="https://2017.igem.org/Team:HFUT-China/Improve"><font class="hef" color="#000">Improve</font></a></li>
+
            </ul>
+
        </li>
+
        <li><a href="https://2017.igem.org/Team:HFUT-China/Notebook">&nbsp;&nbsp;Documents&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";>&nbsp;&nbsp;</a>
+
 
             <ul>
 
             <ul>
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Notebook"><font class="hef" color="#000">Notebook</font></a></li>
+
          <li class="selfnav hef"><a href="https://2017.igem.org/Team:HFUT-China/Description"><font class="hef">Description</font></a></li>
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Safety"><font class="hef" color="#000">Safety</font></a></li>
+
          <li class="selfnav hef"><a href="https://2017.igem.org/Team:HFUT-China/Design"><font class="hef">Design</font></a></li>
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/User_guide"><font class="hef" color="#000">User guide</font></a></li>
+
          <li class="selfnav hef"><a href="https://2017.igem.org/Team:HFUT-China/Contribution"><font class="hef">Contribution</font></a></li>
            </ul>
+
          <li class="selfnav hef"><a href="https://2017.igem.org/Team:HFUT-China/Demonstrate"><font class="hef">Demonstrate</font></a></li>
        </li>
+
      </ul>
        <li><a href="https://2017.igem.org/Team:HFUT-China/Team">Team&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";></a>
+
    </li>
            <ul>
+
    <li><a href="https://2017.igem.org/Team:HFUT-China/Software">&nbsp;&nbsp;Software&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";>&nbsp;&nbsp;</a>
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Team"><font class="hef" color="#000">Members</font></a></li>
+
        <ul>
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Collaborations"><font class="hef" color="#000">Collaborations</font></a></li>
+
<li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Software"><font class="hef" color="#000">Software</font></a></li>
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Attributions"><font class="hef" color="#000">Attributions</font></a></li>
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Model"><font class="hef" color="#000">Model</font></a></li>
            </ul>
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Improve"><font class="hef" color="#000">Improve</font></a></li>
        </li>
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Unit_Test"><font class="hef" color="#000">Unit Test</font></a></li></ul>
        <li><a href="#">&nbsp;&nbsp;Human practice&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";>&nbsp;&nbsp;</a>
+
    </li>
            <ul>
+
    <li><a href="https://2017.igem.org/Team:HFUT-China/Notebook">&nbsp;&nbsp;Documents&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";>&nbsp;&nbsp;</a>
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Silver"><font class="hef" color="#000">Silver HP</font></a></li>
+
        <ul>
            <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Gold_Integrated"><font class="hef" color="#000">Integrated<br>and Gold</font></a></li>
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Notebook"><font class="hef" color="#000">Notebook</font></a></li>
            </ul>
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Medals"><font class="hef" color="#000">Medals</font></a></li>
            </li>
+
<li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Safety"><font class="hef" color="#000">Safety</font></a></li>
      </ul>
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/User_guide"><font class="hef" color="#000">User guide</font></a></li>
    </div>
+
        </ul>
   
+
    </li>
    <div style="height:160px"></div>
+
    <li><a href="https://2017.igem.org/Team:HFUT-China/Team">Team&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";></a>
    <div class="title"><b><font color="#555555">Model</font>
+
        <ul>
    <!-- <div class="subtitle"><a href="http://47.93.11.157" target="blank" style=" text-decoration:none;"><font color="#555555"><br><br><br>Click <font color="#0089a7"><b>here</b></font> to use our software ~ :p</font></a></div>  -->
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Team"><font class="hef" color="#000">Members</font></a></li>
   
+
<li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Attributions"><font class="hef" color="#000">Attributions</font></a></li>
    <div style="width: 76%">
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/Collaborations"><font class="hef" color="#000">Collaborations</font></a></li>
        <div style="margin-top: 90px;font-family: Light; text-align: left;"><b><font color="#333" size="6px" ><br>1. Latent Dirichlet Allocation (LDA) model</font></b></div>
+
     
    </div>
+
        </ul>
   
+
    </li>
    <div class="p">
+
    <li><a href="https://2017.igem.org/Team:HFUT-China/HP/Silver">&nbsp;&nbsp;Human practice&nbsp;<img src="https://static.igem.org/mediawiki/2017/5/53/Xia.png" width="12px";>&nbsp;&nbsp;</a>
      <div class="q" style="line-height: 2.5;">
+
<ul>
      <br>&nbsp;&nbsp;&nbsp;&nbsp;For the information of all teams under each track, we tried to let our computers “understand” it, and automatically classify it into groups. Conventional LDA model is used to explore keywords of themes among documents, but here we regard it as an unsupervised classifier, and unsupervised means we don’t have to provide any manually labeled data. As a result it can give us clusters of documents, and documents in the same cluster have the same theme.
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/HP/Silver"><font class="hef" color="#000">Silver HP</font></a></li>
      The picture below better explains how LDA works.     
+
        <li class="selfnav"><a href="https://2017.igem.org/Team:HFUT-China/HP/Gold_Integrated"><font class="hef" color="#000">Integrated<br>and Gold</font></a></li>
      <br><br>
+
        </ul>
      </div>
+
    </li><li><a href="https://igem.org/2017_Judging_Form?id=2466" target="blank">&nbsp;&nbsp;Judging Form&nbsp;&nbsp;</a></li>
      <!-- <img src="https://static.igem.org/mediawiki/2017/a/ae/Index.png" width="79%" style="box-shadow: 0px 3px 19px #ddd;margin-top:30px;border-radius:20px;"> -->
+
  </ul>
    </div>
+
</div>
   
+
    <div style="width: 76%">
+
        <div style="margin-top: 90px;font-family: Light;text-align: left;"><b><font color="#333" size="6px" ><br><br><br>2. TF-IDF model</font></b></div>
+
    </div>
+
  
     <div class="p">
+
    <div style="height:190px"></div>
      <div class="q" style="line-height: 2.5;">
+
     <div class="title">
      <br>&nbsp;&nbsp;&nbsp;&nbsp;TF-IDF refers to Term Frequency–Inverse Document Frequency. It is used in our system to excavate the keywords of a document. It consists of two parts TF value and IDF value. Primarily, we calculate the TF value for each document by simply counting how many times a word appears in the document. As for IDF value of word w_i, it is calculated according to the following formula:
+
        <b>
      <center><img src="https://static.igem.org/mediawiki/2017/f/fc/Formula0.png" width="60%" style="box-shadow: 0px 3px 19px #ddd;margin-top:60px;margin-bottom:20px;border-radius:20px;"></center>
+
            <font color="#555555">Model</font>
      <br>The IDF value represents how general a word is, and the higher it is, the less the word is commonly seen.
+
            <!-- <div class="subtitle"><a href="http://47.93.11.157" target="blank" style=" text-decoration:none;"><font color="#555555"><br><br><br>Click <font color="#0089a7"><b>here</b></font> to use our software ~ :p</font></a></div>  -->
      <br><br>&nbsp;&nbsp;&nbsp;&nbsp;Finally, we combine the TF and IDF value by multiplying them. By doing this, we can filter off some the general words, and keywords are left as we expected.
+
 
      </div>
+
            <div style="width: 76%">
    </div>
+
                <div style="margin-top: 90px;font-family: Light; text-align: left;">
   
+
                    <b>
    <div style="width: 76%">
+
                        <font color="#333" size="6px" weight="bold">
        <div style="margin-top: 90px;font-family: Light; text-align: left;"><b><font color="#333" size="6px" >3. Word2Vec</font></b></div>
+
                            <br>
    </div>
+
                            <br>
    <div class="p">
+
                            <br>1. Latent Dirichlet Allocation (LDA) model</font>
      <div class="q" style="line-height: 2.5;">
+
                    </b>
      <br>&nbsp;&nbsp;&nbsp;&nbsp;Word2Vec plays an important role in our system. Word Vector is an effective and promising substitute for the conventionally used one-hot encoding method in NLP (Natural Language Processing). About one-hot encoding, take a sentence “I love you so much” for example. We want to find a vector to represent each word in this sentence. What one-hot does is that it assigns “1” to the entry in this vector according to the word’s position in the sentence. For example, “love” in one-hot is “0 1 0 0 0” and “so” is “0 0 0 1 0”. Nonetheless, this kind of encoding does not contain the semantic meaning of a word. If we want to measure the semantic similarity between two words, unless these words are identical, the similarity will be zero. So researchers proposed “Word Vector”, which is a vector representing the word’s semantic meaning. It takes the context of the word into consideration, and the effect turns out to be really excellent.
+
                </div>
      <br><br>&nbsp;&nbsp;&nbsp;&nbsp;Thus, the similarity between two words can be easily measured using L2 norm. The whole process of calculating word vectors is through neural networks, and the detailed structure of it can be found here.<br><br>
+
            </div>
 +
 
 +
            <div class="p">
 +
                <div class="q" style="line-height: 2.5;">
 +
                    <br>For the information of all teams under each track, we tried to let our computers “understand” it, and
 +
                    automatically classify it into groups. Conventional LDA model is used to explore keywords of themes among
 +
                    documents, but here we regard it as an unsupervised classifier, and unsupervised means we don’t have
 +
                    to provide any manually labeled data. As a result it can give us clusters of documents, and documents
 +
                    in the same cluster have the same theme. The picture below better explains how LDA works.
 +
                    <br>
 +
                    <br>
 +
                    <center>
 +
                        <img src="https://static.igem.org/mediawiki/2017/4/42/LDA.png" width="49%" style="box-shadow: 0px 3px 19px #ddd;margin-top:30px;border-radius:20px;">                           
 +
                    </center>
 +
                    <br>
 +
                    <p style="font-size:16px">
 +
                        (1) α represents the key parameter to generate a theme. (2) β stands for the word given theme distribution (p(word|theme)).
 +
                        (3) θ is the theme distribution for the document (p(theme)). (4) z is the theme for words in a document.
 +
                        (5) w is specific words.
 +
                    </p>
 +
                </div>
 +
            </div>
 +
 
 +
            <div style="width: 76%">
 +
                <div style="margin-top: 90px;font-family: Light;text-align: left;">
 +
                    <b>
 +
                        <font color="#333" size="6px" weight="bold">
 +
                            <br>
 +
                            <br>2. TF-IDF model</font>
 +
                    </b>
 +
                </div>
 +
            </div>
 +
 
 +
            <div class="p">
 +
                <div class="q" style="line-height: 2.5;">
 +
                    <br>TF-IDF refers to Term Frequency–Inverse Document Frequency. It is used in our system to excavate the
 +
                    keywords of a document. It consists of two parts TF value and IDF value. Primarily, we calculate the
 +
                    TF value for each document by simply counting how many times a word appears in the document. As for IDF
 +
                    value of word w_i, it is calculated according to the following formula:
 +
                    <center>
 +
                        <img src="https://static.igem.org/mediawiki/2017/f/fc/Formula0.png" width="60%" style="box-shadow: 0px 3px 19px #ddd;margin-top:60px;margin-bottom:20px;border-radius:20px;">
 +
                    </center>
 +
                    <br>The IDF value represents how general a word is, and the higher it is, the less the word is commonly seen.
 +
                    <br>
 +
                    <br>Finally, we combine the TF and IDF value by multiplying them. By doing this,
 +
                    we can filter off some the general words, and keywords are left as we expected.
 +
                </div>
 +
            </div>
 +
 
 +
            <div style="width: 76%;">
 +
                <div style="margin-top: 90px;font-family: Light; text-align: left;">
 +
                    <b>
 +
                        <font color="#333" size="6px" family="he" weight="bold">3. Word2Vec</font>
 +
                    </b>
 +
                </div>
 +
            </div>
 +
            <div class="p">
 +
                <div class="q" style="line-height: 2.5;">
 +
                    <br>Word2Vec plays an important role in our system. Word Vector is an effective and promising substitute
 +
                    for the conventionally used one-hot encoding method in NLP (Natural Language Processing). About one-hot
 +
                    encoding, take a sentence “I love you so much” for example. We want to find a vector to represent each
 +
                    word in this sentence. What one-hot does is that it assigns “1” to the entry in this vector according
 +
                    to the word’s position in the sentence. For example, “love” in one-hot is “0 1 0 0 0” and “so” is “0
 +
                    0 0 1 0”. Nonetheless, this kind of encoding does not contain the semantic meaning of a word. If we want
 +
                    to measure the semantic similarity between two words, unless these words are identical, the similarity
 +
                    will be zero. So researchers proposed “Word Vector”, which is a vector representing the word’s semantic
 +
                    meaning. It takes the context of the word into consideration, and the effect turns out to be really excellent.
 +
                    <br>
 +
                    <br>Thus, the similarity between two words can be easily measured using L2 norm. The whole process of calculating
 +
                    word vectors is through neural networks, and the detailed structure of it can be found  
 +
                    <a href="https://en.wikipedia.org/wiki/Word2vec">here.</a>
 +
                    <br>
 +
                    <br>
 +
                    <center>
 +
                        <img src="https://static.igem.org/mediawiki/2017/0/0e/Word2vec.png" width="49%" style="box-shadow: 0px 3px 19px #ddd;margin-top:60px;margin-bottom:20px;border-radius:20px;">
 +
                    </center>
 +
                    <!-- <div>
 +
                                <script src="https://2017.igem.org/Template:HFUT-China/js/chartTest?action=raw&ctype=text/javascript" type="text/javascript"></script>
 +
                            </div> -->
 +
                </div>
 +
            </div>
 +
 
 +
            <div style="width: 76%">
 +
                <div style="margin-top: 90px;font-family: Light; text-align: left;">
 +
                    <b>
 +
                        <font color="#333" size="6px" weight="bold">4. LSI (Latent Semantic Indexing)</font>
 +
                    </b>
 +
                </div>
 +
            </div>
 +
 
 +
            <div class="p">
 +
                <div class="q" style="line-height: 2.5;">
 +
                    <br>Word Vector can only be used to explore the semantic meaning for a word, and if we try to measure the
 +
                    semantic distance between two documents, we will have to find another way. And that’s why we introduced
 +
                    LSI model. LSI uses SVD (Singular Value Decomposition) to find the latent similarity between documents.
 +
                    SVD can be thought as factorization in matrices version. For example, the number 12 can be decomposed
 +
                    to 2×2×3, and SVD does the same thing for matrices. Suppose that we have m documents and n total words.
 +
                    We decompose it as follows:
 +
                    <center>
 +
                        <img src="https://static.igem.org/mediawiki/2017/a/af/Formula1.png" width="60%" style="box-shadow: 0px 3px 19px #ddd;margin-top:60px;border-radius:20px;">
 +
                    </center>
 +
                    <br>Where A_(i,j) stands for the feature value, which is TF-IDF value of word j in document i generally.
 +
                    We regard U_i, which is the row vector of the matrix U to be the semantic value of document i. The similarity
 +
                    among documents i and j can be calculated using cosine similarity as the following expression:
 +
                    <center>
 +
                        <img src="https://static.igem.org/mediawiki/2017/0/03/Formula2.png" width="60%" style="box-shadow: 0px 3px 19px #ddd;margin-top:60px;border-radius:20px;">
 +
                    </center>
 +
                    <br>
 +
                </div>
 +
            </div>
 +
 
 +
            <div style="width: 76%">
 +
                <div style="margin-top: 90px;font-family: Light; text-align: left;">
 +
                    <b>
 +
                        <font color="#333" size="6px" weight="bold">Reference</font>
 +
                    </b>
 +
                </div>
 +
            </div>
 +
 
 +
            <div class="p">
 +
                <div class="q" style="line-height: 2.5;">
 +
                    <br>&nbsp;&nbsp;&nbsp;&nbsp;
 +
                    <a href="https://commons.wikimedia.org/w/index.php?curid=3610403">1.By Bkkbrad - Own work, GFDL,</a>
 +
                    <br>&nbsp;&nbsp;&nbsp;&nbsp;
 +
                    <img src="https://static.igem.org/mediawiki/2017/f/fa/Topic_model_scheme.gif"></img>
 +
                    <br>
 +
                    <br>
 +
                </div>
 +
            </div>
 
     </div>
 
     </div>
    </div>
+
     <div style="height:20%;"></div>
   
+
     <div class="footer ">
     <div style="width: 76%">
+
        <div class="foot-icon ">
        <div style="margin-top: 90px;font-family: Light; text-align: left;"><b><font color="#333" size="6px" >4. LSI (Latent Semantic Indexing)</font></b></div>
+
            <a href="http://bio.measurex.top " target="blank "  style="width:66px;">
     </div>
+
                <img src="https://static.igem.org/mediawiki/2017/f/fb/Bio.png
   
+
                " class="icon ">
    <div class="p">
+
            </a>
      <div class="q" style="line-height: 2.5;">
+
            <!--<a href=" ">
      <br>&nbsp;&nbsp;&nbsp;&nbsp;Word Vector can only be used to explore the semantic meaning for a word, and if we try to measure the semantic distance between two documents, we will have to find another way. And that’s why we introduced LSI model. LSI uses SVD (Singular Value Decomposition) to find the latent similarity between documents. SVD can be thought as factorization in matrices version. For example, the number 12 can be decomposed to 2×2×3, and SVD does the same thing for matrices. Suppose that we have m documents and n total words. We decompose it as follows:
+
                <img src="https://static.igem.org/mediawiki/2017/5/51/Fb.png " class="icon " style="width:66px;">
      <center><img src="https://static.igem.org/mediawiki/2017/a/af/Formula1.png" width="79%" style="box-shadow: 0px 3px 19px #ddd;margin-top:60px;border-radius:20px;"></center>
+
            </a>-->
      <br>&nbsp;&nbsp;&nbsp;&nbsp;Where A_(i,j) stands for the feature value, which is TF-IDF value of word j in document i generally. We regard U_i, which is the row vector of the matrix U to be the semantic value of document i. The similarity among documents i and j can be calculated using cosine similarity as the following expression:
+
            <a href="mailto:591689118@qq.com" target="blank ">
      <center><img src="https://static.igem.org/mediawiki/2017/0/03/Formula2.png" width="79%" style="box-shadow: 0px 3px 19px #ddd;margin-top:60px;border-radius:20px;"></center>
+
                <img src="https://static.igem.org/mediawiki/2017/b/ba/Hfutdeemail.png" class="icon " style="width:66px;">
      <br></div>
+
            </a>
    </div>
+
            <a href="https://github.com/APTX-4869-MDZZ/HFUT-China" target="blank " style="width:66px;">
   
+
                <img src="https://static.igem.org/mediawiki/2017/b/bf/Hfutdegit.png " class="icon ">
    <div style="width: 76%">
+
            </a>
        <div style="margin-top: 90px;font-family: Light; text-align: left;"><b><font color="#333" size="6px" >Reference</font></b></div>
+
    </div>
+
   
+
    <div class="p">
+
        <div class="q" style="line-height: 2.5;">
+
        <br>&nbsp;&nbsp;&nbsp;&nbsp;<a href="https://commons.wikimedia.org/w/index.php?curid=3610403">1.By Bkkbrad - Own work, GFDL,</a>
+
        <br>&nbsp;&nbsp;&nbsp;&nbsp;
+
        <center><img src="https://static.igem.org/mediawiki/2017/7/70/Topic_model_scheme.webm"></center>
+
        <br><br></div>
+
 
         </div>
 
         </div>
 
     </div>
 
     </div>
    <div style="height:20%;"></div>
+
</center>
    <div class="footer"">
+
 
      <div class="foot-icon">
+
 
        <a href=""><img src="https://static.igem.org/mediawiki/2017/5/51/Fb.png" class="icon"></a>
+
</html>
        <a href=""><img src="https://static.igem.org/mediawiki/2017/d/da/Emial.png" class="icon"></a>
+
        <a href=""><img src="https://static.igem.org/mediawiki/2017/5/51/Git.png" class="icon"></a>
+
      </div>
+
    </div>
+
    </center>
+
     
+
   
+
    </html>
+

Latest revision as of 01:08, 2 November 2017

Team:HFUT-China

Model



1. Latent Dirichlet Allocation (LDA) model

For the information of all teams under each track, we tried to let our computers “understand” it, and automatically classify it into groups. Conventional LDA model is used to explore keywords of themes among documents, but here we regard it as an unsupervised classifier, and unsupervised means we don’t have to provide any manually labeled data. As a result it can give us clusters of documents, and documents in the same cluster have the same theme. The picture below better explains how LDA works.


(1) α represents the key parameter to generate a theme. (2) β stands for the word given theme distribution (p(word|theme)). (3) θ is the theme distribution for the document (p(theme)). (4) z is the theme for words in a document. (5) w is specific words.



2. TF-IDF model

TF-IDF refers to Term Frequency–Inverse Document Frequency. It is used in our system to excavate the keywords of a document. It consists of two parts TF value and IDF value. Primarily, we calculate the TF value for each document by simply counting how many times a word appears in the document. As for IDF value of word w_i, it is calculated according to the following formula:

The IDF value represents how general a word is, and the higher it is, the less the word is commonly seen.

Finally, we combine the TF and IDF value by multiplying them. By doing this, we can filter off some the general words, and keywords are left as we expected.
3. Word2Vec

Word2Vec plays an important role in our system. Word Vector is an effective and promising substitute for the conventionally used one-hot encoding method in NLP (Natural Language Processing). About one-hot encoding, take a sentence “I love you so much” for example. We want to find a vector to represent each word in this sentence. What one-hot does is that it assigns “1” to the entry in this vector according to the word’s position in the sentence. For example, “love” in one-hot is “0 1 0 0 0” and “so” is “0 0 0 1 0”. Nonetheless, this kind of encoding does not contain the semantic meaning of a word. If we want to measure the semantic similarity between two words, unless these words are identical, the similarity will be zero. So researchers proposed “Word Vector”, which is a vector representing the word’s semantic meaning. It takes the context of the word into consideration, and the effect turns out to be really excellent.

Thus, the similarity between two words can be easily measured using L2 norm. The whole process of calculating word vectors is through neural networks, and the detailed structure of it can be found here.

4. LSI (Latent Semantic Indexing)

Word Vector can only be used to explore the semantic meaning for a word, and if we try to measure the semantic distance between two documents, we will have to find another way. And that’s why we introduced LSI model. LSI uses SVD (Singular Value Decomposition) to find the latent similarity between documents. SVD can be thought as factorization in matrices version. For example, the number 12 can be decomposed to 2×2×3, and SVD does the same thing for matrices. Suppose that we have m documents and n total words. We decompose it as follows:

Where A_(i,j) stands for the feature value, which is TF-IDF value of word j in document i generally. We regard U_i, which is the row vector of the matrix U to be the semantic value of document i. The similarity among documents i and j can be calculated using cosine similarity as the following expression:

Reference

     1.By Bkkbrad - Own work, GFDL,