Difference between revisions of "Team:USTC-Software/Model"

Line 128: Line 128:
  
 
<div id="team-content" style="border: solid;border-width: 1px;border-color: #e1e1e1; background-color: white;border-radius: 2px;">
 
<div id="team-content" style="border: solid;border-width: 1px;border-color: #e1e1e1; background-color: white;border-radius: 2px;">
     <h1 style="text-align: center;">Implementation</h1>
+
     <h1 style="text-align: center;">Model</h1>
 
     <div id="item1" class="item one">
 
     <div id="item1" class="item one">
         <h2 style="text-align: center">Server Part</h2>
+
         <p>The core goal of Biohub 2.0 is to help biologists obtain the most matched and the most high-quality Biobricks more efficiently. Given a set of query conditions and searching for the most similar items, such task has already had mature solutions, which can be perfectly done by most of the search engines. In our project, we choose ElasticSearch to accomplish the it. The only problem is: how to order the matched Biobricks by their quality?</p>
        <div style="padding: 10px 5px 10px 5px;margin-top: 20px">
+
<p>The first step is to define "high quality". Technically, a Biobrick can be described by multiple properties, including objective ones (part sequence, part status, etc.) and subjective ones (number of stars, rate scores, etc.). Objective properties are usually connected with the physical characteristics of the part, which veritably reflect the quality, but some of them are difficult to quantize (for example, part sequence is the direct proof of whether a part is good or not, but such conclusion can only be drawn after experiements are done). Subjective properties, on the contrary, are very easy to be quantized, but there's no guarantee for their authenticity. Thus, we synthesize both of them to evaluate the quality of a part. According to early investigations, we define a part as "high-quality" if:</p>
            <p>The server of Biohub 2.0 are written in Python 3.6, which is an easy-to-write and cross-platform language and can lower the learning curve of developing plugins. We choose Django to be the basic framework, partly for it allows fast prototyping and large-scale deployment, partly for it already contains a matured plugin system (Django App Framework). However such plugin system is static, which means plugins cannot be loaded/unloaded during runtime. It will be a great deficiency for a website that may frequently alternate its components. To solve the problem, we build our own plugin system on the basis of the one in Django. We carefully monkey-patched Django's underlying implementation to make its core components (URL resolving, data model registration, and etc.) dynamically changeable,  whilst avoiding possible memory leaks. We also supplemented Django by adding many new features:</p>
+
<p>It is in good condition. (has available samples, satisfies most of the RFC, etc.)<br></p>
            <ul>
+
<p>It is welcomed in the community. <br></p>
                <li><strong>Hot Reload:</strong> Other processes can send <code>SIGUSR1</code> signal to Biohub 2.0's worker processes to inform them that the installed plugin list has changed and the server needs to be reloaded. By this way the server can be renewed without stopping.</li>
+
<p>The next step is to find an appropriate method to measure the quality. The method must be:</p>
                <li><strong>WebSocket Routing:</strong> Websocket is a protocol providing full-duplex communication channels over a single TCP connection, which is supported by most of modern browsers. Biohub 2.0 develops a customized protocol over Websocket, making it easier to route Websocket packets. We have used this protocol to achieve real-time notification sending and notifying accomplishment events in ABACUS plugin.</li>
+
<p><strong>fast enough</strong>. The evaluation process should not consume too much computation resources.</p>
                <li><strong>Background Tasks:</strong> As a website designed for synthetic biology, their must be many computation intensive tasks, which will blocks requests if handled inappropriately. Thus we split out and encapsulate this logic and make it an independent module. Background tasks can use websocket to notify certain events. Currently we have used in ABACUS computing.</li>
+
<p><strong>extensible</strong>. Adding new factors into the evaluation process should be easy enough.</p>
            </ul>
+
<p>In our project, we use the data downloaded from [here](http://parts.igem.org/partsdb/download.cgi?type=parts_sql) as main data source, and meta information crawled from iGEM offical website and daily data generated from Biohub Forum as supplements. The data set contains tens of properties to describe a Biobrick, but we will only use a small amount of them, since not all of them are relevant to quality. Based on the assumptions above, we select the following fields:</p>
            <p>The modules are all available for plugins, so developers can take advantages of them to build amazing artworks.</p>
+
<p><code>part_status</code> (from main data source)</p>
            <div class="subtitle">About the Bricks Data</div>
+
<p><code>sample_status</code>(from main data source)</p>
            <p>Biohub 2.0 is constructed on the base of Biobricks data, which comes from two channels. We use the data downloaded from <a href="http://parts.igem.org/partsdb/download.cgi?type=parts_sql">offical interface</a> as our initial data. The initial data is actually a snapshot of iGEM, containing most information of the bricks. However certain fields, such as group name or parameters, are missing in it, so we complement it by crawling iGEM's web pages. Such behaviour is "lazy", which means it will not be invoked before deploying, but before a specific brick's data is accessed for the first time. Fetched data will be cached in the database, and be updated per 10 days to keep updated.</p>
+
<p><code>works</code>(from main data source)</p>
            <div class="subtitle">About Data Organizing</div>
+
<p><code>uses</code> (from main data source)</p>
            <p>The initial data will be imported into a separate database (by default it is named <code>igem</code>). Biohub will link to it by creating virtual tables using MySQL's database view technology. Using database views may lower the speed of querying, but it provides more flexibility for data upgrading. If newer initial is available, we can upgrade the database simply by reloading <code>igem</code> database, rather than dropping and recreating the tables in production environment. Also it can prevent data redundancy. If multiple instances with different database configuration are deployed on the same machine, only one copy of initial data needs to exist in the database, saving the space of disks.</p>
+
<p><code>has_barcode</code>(from main data source)</p>
            <div class="subtitle">About About Bricks Ranking</div>
+
<p><code>favorite</code> (from main data source)</p>
            <p>Before ranking, we have to filter the initial data, since there exists many apparently useless bricks on iGEM. Such process will occur before deploying. We simply drop those bricks without DNA sequence, and dumped the filtered data into a new table (by default it is named <code>igem.parts_filtered</code>). At the same time we will pre-process some fields, such as extracting subparts information from edit cache, and pre-calculating some components of the ranking weight. You may refer to <a href="https://github.com/igemsoftware2017/USTC-Software-2017/blob/master/biohub/biobrick/bin/updateparts.py">updateparts.py</a> to see this process.</p>
+
<p><code>ac</code> (from crawling, indicating "Accessibilty Compatibility")</p>
            <p>Then we will rank the bricks for the first time, using multiple statistical methods. You may refer to <a href="https://github.com/igemsoftware2017/USTC-Software-2017/blob/master/biohub/biobrick/management/commands/refreshweight.py">refreshweight.py</a> to see such process. At this stage, the bricks are completely ranked by initial data, without any factors from Forum.</p>
+
<p><code>rates</code> (from Biohub Forum, indicating the number of users who rated the part)</p>
            <p>After the server starts up, we will recalculate the ranking weights every 30 minutes. The reason for not evaluating them in real time is that the task may update the whole table and become time-consuming. From now on, Biohub will gradually correct the deviation in bricks ranking.</p>
+
<p><code>rate_score</code> (from Biohub Forum, indicating the average score)</p>
            <div class="subtitle">About Optimizing ABACUS</div>
+
<p><code>stars</code> (from Biohub Forum, indicating the number of users who starred the part)</p>
            <p>ABACUS is a plugin inherited from last year's project. It consumes great amount of memory during executing, and causes the prossibility to break down the main server. Such accident did happen in testing phase of USTC-Software 2016. To avoid it, we improved ABACUS and enable it to run in two ways: locally or distributedly. If no available executable file is detected, ABACUS will connect to remote slave servers and distribute the computation tasks. This design can largely save the expense of master server, and makes ABACUS infinitely scalable.</p>
+
<p><code>watches</code>(from Biohub Forum, indicating the number of users who watched the part)</p>
        </div>
+
<p>Now a Biobrick can be transformed into a vector <code>v</code>, each component of which represents the value of a property. To meet the two expectations above, we model the measurement as:</p>
    </div>
+
<p>where <code>F[i](x)</code> are functions to map each property into interval <code>[0,1]</code> (called <strong>mapper</strong>) and <code>w[i](v)</code> are preset parameters to balance each component (called <strong>balancers</strong>).</p>
    <div id="item2" class="item two">
+
<p>Mappers simply normalize the properties into bounded real number. For properties with different types, their mappers also differ a lot. You may refer to our source code listed at the bottom of this page to read the detailed implementation of each mapper.</p>
        <h2 style="text-align: center">Frontend Part</h2>
+
<p>You may notice that balancers are designed as functions instead of constants. This is because some fields may have inner connections, and with such design, it is easier to control whether to skip a field or not, simply by setting its balancer to zero. Take <code>uses</code> as an example. <code>uses</code> is one of the properties to describe the popularity of a part, and is weighed highly in our system. However, during researching we found that for parts with certain types (such as <code>Cell</code> or <code>Measurement</code>), only a small fraction (less than 5%) of them have meaningful <code>uses</code> values, making the evaluated quality of them extremely low. To solve this, we changed the balancer of <code>uses</code> from a constant to a function depending on the type of the part. It defused the imbalance to some extent. </p>
        <div style="padding: 10px 5px 10px 5px;margin-top: 20px">
+
<p>Some properties are relatively static compared with others, such as those from the main data source, and it's a waste of resources to evaluate their mappers repeatedly. Thus, we will precalculate these fields and cache the values every time after the main data source is updated. With the assistance of MyISAM database engine and a series of extra optimizations, the quality measurements of 39311 parts can be accomplished in about 2 seconds (tested on production server).</p>
            <p>To improve the user experience, we build Biohub 2.0 as a single page application (or short for SPA), using the advanced MVVM framework <a href="http://vue.js/">Vue.js</a>. <a href="http://vue.js/">Vue.js</a> relies on webpack (not forcibly, but recommended) for pre-compiling, which encapsulating everything into a single file. This is inconvinient for a website with a plugin system, so we analyzed the code generated by webpack and developed an approach to load components dynamically. It is the theoretical basis of Biohub 2.0's plugin system. You can refer to <a href="https://github.com/USTC-Software2017-frontend/Biohub-frontend/blob/master/src/components/plugins/Plugins.vue">Main.vue</a> to see such mechanism.</p>
+
<p>Among the properties we selected, some of them are subjective, such as `watches`, `stars` or `rate_score`, representing the feedbacks from Biohub Forum users. As mentioned above, such data can be easily counterfeited. A malicious user may register many accounts to rate a specific part, in expectation of sharply increasing or decreasing its score. To avoid such frauds, we've added throttles at the interfaces of relevant actions. This will not prohibit counterfeiting thoroughly, but can reduce them to some extent.</p>
            <div class="subtitle">About UI</div>
+
<p></p>
            <p>We choose Bootstrap 3 as our basic framework. Bootstrap is a UI framework developed by Twitter team, with multiple elegant components and the ability to prototype quickly. Based on it, we added many customized styles and components to provides better experience.</p>
+
<p>For more details of the implementation, please refer to our source code:</p>
            <p>To visualize the data, we use serveral data representing frameworks. For example, <a href="http://d3.js/">d3.js</a> for DNA sequance displaying in Forum and <a href="http://ngl.js/">ngl.js</a> for protein structure illustration. Such frameworks transform certain data into graphics, and make it easier for users to learn the content.</p>
+
<p></p>
            <div class="subtitle">About Websocket</div>
+
<p><a href="https://github.com/igemsoftware2017/USTC-Software-2017/blob/master/biohub/biobrick/management/commands/refreshweight.py">refreshweight.py</a></p>
            <p>Based on the customized Websocket protocol, we encapsulate a handy library to handle Websocket packets tranferring. You may refer to it via <a href="https://github.com/USTC-Software2017-frontend/Biohub-frontend/blob/master/src/utils/websocket.js">websocket.js</a>.</p>
+
<p></p>
        </div>
+
<p><a href="https://github.com/igemsoftware2017/USTC-Software-2017/blob/master/biohub/biobrick/sql/igem/preprocess.sql">preprocess.sql</a></p>
 +
<p></p>
 +
<p><a href="https://github.com/igemsoftware2017/USTC-Software-2017/blob/master/biohub/biobrick/sql/weight/fetch.sql">weight/fetch.sql</a></p>
 
     </div>
 
     </div>
 
</div>
 
</div>

Revision as of 18:37, 1 November 2017

Team

Model

The core goal of Biohub 2.0 is to help biologists obtain the most matched and the most high-quality Biobricks more efficiently. Given a set of query conditions and searching for the most similar items, such task has already had mature solutions, which can be perfectly done by most of the search engines. In our project, we choose ElasticSearch to accomplish the it. The only problem is: how to order the matched Biobricks by their quality?

The first step is to define "high quality". Technically, a Biobrick can be described by multiple properties, including objective ones (part sequence, part status, etc.) and subjective ones (number of stars, rate scores, etc.). Objective properties are usually connected with the physical characteristics of the part, which veritably reflect the quality, but some of them are difficult to quantize (for example, part sequence is the direct proof of whether a part is good or not, but such conclusion can only be drawn after experiements are done). Subjective properties, on the contrary, are very easy to be quantized, but there's no guarantee for their authenticity. Thus, we synthesize both of them to evaluate the quality of a part. According to early investigations, we define a part as "high-quality" if:

It is in good condition. (has available samples, satisfies most of the RFC, etc.)

It is welcomed in the community.

The next step is to find an appropriate method to measure the quality. The method must be:

fast enough. The evaluation process should not consume too much computation resources.

extensible. Adding new factors into the evaluation process should be easy enough.

In our project, we use the data downloaded from [here](http://parts.igem.org/partsdb/download.cgi?type=parts_sql) as main data source, and meta information crawled from iGEM offical website and daily data generated from Biohub Forum as supplements. The data set contains tens of properties to describe a Biobrick, but we will only use a small amount of them, since not all of them are relevant to quality. Based on the assumptions above, we select the following fields:

part_status (from main data source)

sample_status(from main data source)

works(from main data source)

uses (from main data source)

has_barcode(from main data source)

favorite (from main data source)

ac (from crawling, indicating "Accessibilty Compatibility")

rates (from Biohub Forum, indicating the number of users who rated the part)

rate_score (from Biohub Forum, indicating the average score)

stars (from Biohub Forum, indicating the number of users who starred the part)

watches(from Biohub Forum, indicating the number of users who watched the part)

Now a Biobrick can be transformed into a vector v, each component of which represents the value of a property. To meet the two expectations above, we model the measurement as:

where F[i](x) are functions to map each property into interval [0,1] (called mapper) and w[i](v) are preset parameters to balance each component (called balancers).

Mappers simply normalize the properties into bounded real number. For properties with different types, their mappers also differ a lot. You may refer to our source code listed at the bottom of this page to read the detailed implementation of each mapper.

You may notice that balancers are designed as functions instead of constants. This is because some fields may have inner connections, and with such design, it is easier to control whether to skip a field or not, simply by setting its balancer to zero. Take uses as an example. uses is one of the properties to describe the popularity of a part, and is weighed highly in our system. However, during researching we found that for parts with certain types (such as Cell or Measurement), only a small fraction (less than 5%) of them have meaningful uses values, making the evaluated quality of them extremely low. To solve this, we changed the balancer of uses from a constant to a function depending on the type of the part. It defused the imbalance to some extent.

Some properties are relatively static compared with others, such as those from the main data source, and it's a waste of resources to evaluate their mappers repeatedly. Thus, we will precalculate these fields and cache the values every time after the main data source is updated. With the assistance of MyISAM database engine and a series of extra optimizations, the quality measurements of 39311 parts can be accomplished in about 2 seconds (tested on production server).

Among the properties we selected, some of them are subjective, such as `watches`, `stars` or `rate_score`, representing the feedbacks from Biohub Forum users. As mentioned above, such data can be easily counterfeited. A malicious user may register many accounts to rate a specific part, in expectation of sharply increasing or decreasing its score. To avoid such frauds, we've added throttles at the interfaces of relevant actions. This will not prohibit counterfeiting thoroughly, but can reduce them to some extent.

For more details of the implementation, please refer to our source code:

refreshweight.py

preprocess.sql

weight/fetch.sql