Team:USTC-Software/Model

Team

Model

The core goal of Biohub 2.0 is to help biologists obtain the most matched and the most high-quality Biobricks more efficiently. Given a set of query conditions and searching for the most similar items, such task has already had mature solutions, which can be perfectly done by most of the search engines. In our project, we choose ElasticSearch to accomplish the it. The only problem is: how to order the matched Biobricks by their quality?

The first step is to define "high quality". Technically, a Biobrick can be described by multiple properties, including objective ones (part sequence, part status, etc.) and subjective ones (number of stars, rate scores, etc.). Objective properties are usually connected with the physical characteristics of the part, which veritably reflect the quality, but some of them are difficult to quantize (for example, part sequence is the direct proof of whether a part is good or not, but such conclusion can only be drawn after experiements are done). Subjective properties, on the contrary, are very easy to be quantized, but there's no guarantee for their authenticity. Thus, we synthesize both of them to evaluate the quality of a part. According to early investigations, we define a part as "high-quality" if:


It is in good condition. (has available samples, satisfies most of the RFC, etc.)

It is welcomed in the community.


The next step is to find an appropriate method to measure the quality. The method must be:


fast enough. The evaluation process should not consume too much computation resources.

extensible. Adding new factors into the evaluation process should be easy enough.


In our project, we use the data downloaded from here as main data source, and meta information crawled from iGEM offical website and daily data generated from Biohub Forum as supplements. The data set contains tens of properties to describe a Biobrick, but we will only use a small amount of them, since not all of them are relevant to quality. Based on the assumptions above, we select the following fields:

part_status (from main data source)

sample_status(from main data source)

works(from main data source)

uses (from main data source)

has_barcode(from main data source)

favorite (from main data source)

ac (from crawling, indicating "Accessibilty Compatibility")

rates (from Biohub Forum, indicating the number of users who rated the part)

rate_score (from Biohub Forum, indicating the average score)

stars (from Biohub Forum, indicating the number of users who starred the part)

watches(from Biohub Forum, indicating the number of users who watched the part)

Now a Biobrick can be transformed into a vector v, each component of which represents the value of a property. To meet the two expectations above, we model the measurement as:

where F[i](x) are functions to map each property into interval [0,1] (called mapper) and w[i](v) are preset parameters to balance each component (called balancers).

Mappers simply normalize the properties into bounded real number. For properties with different types, their mappers also differ a lot. You may refer to our source code listed at the bottom of this page to read the detailed implementation of each mapper.

You may notice that balancers are designed as functions instead of constants. This is because some fields may have inner connections, and with such design, it is easier to control whether to skip a field or not, simply by setting its balancer to zero. Take uses as an example. uses is one of the properties to describe the popularity of a part, and is weighed highly in our system. However, during researching we found that for parts with certain types (such as Cell or Measurement), only a small fraction (less than 5%) of them have meaningful uses values, making the evaluated quality of them extremely low. To solve this, we changed the balancer of uses from a constant to a function depending on the type of the part. It defused the imbalance to some extent.

Some properties are relatively static compared with others, such as those from the main data source, and it's a waste of resources to evaluate their mappers repeatedly. Thus, we will precalculate these fields and cache the values every time after the main data source is updated. With the assistance of MyISAM database engine and a series of extra optimizations, the quality measurements of 39311 parts can be accomplished in about 2 seconds (tested on production server).

Among the properties we selected, some of them are subjective, such as `watches`, `stars` or `rate_score`, representing the feedbacks from Biohub Forum users. As mentioned above, such data can be easily counterfeited. A malicious user may register many accounts to rate a specific part, in expectation of sharply increasing or decreasing its score. To avoid such frauds, we've added throttles at the interfaces of relevant actions. This will not prohibit counterfeiting thoroughly, but can reduce them to some extent.


For more details of the implementation, please refer to our source code:

refreshweight.py

preprocess.sql

weight/fetch.sql