Team:Groningen/carlosinsert

iGEM term analysis: Seeing the big figure

INTRODUCTION

The iGEM competition brings into contact many student groups that worked on a research project for one year. Because this competition started over 10 years ago it would be interesting to investigate the changes in research ideas or lab equipment that have taken place since then. Fortunately, one of the requirements for the competition is to upload to the web a description of the project (also called the wiki). The links to each wiki are neatly organized on the official iGEM page they are a word frequency analysis feasible. Also useful visualization of the results of the analysis would allow more people to get an idea of the evolution of iGEM projects. This idea was concieved by the MIT team and was proposed as a collaboration in the official iGEM page (Colaboration number #26). We, at the Groningen team decided to collaborate with the MIT team by doing this analysis.

METHODS

Web Scraping

To conduct a word frequencty analysis, the text of each iGEM team web description must be easily accessible and organized on a table. To achive this, the iGEM team of MIT web scraped the text of each team using the official archive of wikis available in the iGEM web page (<>). They structured this data into a CSV file where the team name, supervisor name, project title, year of the project and wiki text were columns of the data table.

Data Cleaning

The original csv file could not be read from file due to some conficts in the interpretation of quoted text. The problem is that some regions of text are interpreted as part of the csv structure and commas in this region are interpreted as field separators. This problematic text belonged to the descriptions of the wiki that were the object of this analysis. We removed all lines of text that did not have 6 comma separated fields and then all that did not have an url on the team url field (Team_Info_Page). This may have removed a lot of information from some teams wikis.

Word frequency analysis

To find out how the iGEM projects have evolved over the years we first conducted a word frequency analysis per year (2008-2015) using VOSvisualizer (<>). VOSvisualizer is a software tool designed in Leiden University to visualize bibliometric networks. One of the modes of analysis is word co-ocurrence which can be done on any kind of text. We downloaded VOSviewer 1.6.5 and conducted the analysis using the command line call:

java -jar /path/to/VOSviewer.jar -corpus 2008_abstracts.txt -counting_method 2 -min_n_occurrences 2

This call makes VOS viewer do a word co-ocurrence analysis with full counting of ocurrences (counting_method), for all words that are repeated at least twice. The corpus used was a file that had all abstracts for that specific year. VOSviewer generates a co-ocurrence network based on the frequencies of the single words or larger strings (terms) that are found repeated in the corpus. VOSviewer also generates a tab-delimited “map” file that specifies the link strength, number of links and number of ocurrences for each node (terms) in this network.

Word frequency interpretation

We consolidated all map files into one table (map table) and specified in a new column the year of each entry. Therefore if a term was captured in the analysis on more than one year it would have more than one entry in the map table. We subseted from the table words that had 3 or more entries (were important on more thatn 3 years). Because the popularity of the iGEM competition is increasing, recent years have more teams than older years. Therefore, a term that ocurrs for example, 3 times per abstract may appear to increase in ocurrence if the number of ocurrences is not corrected by the number of teams per year. There fore we preformed a correction were the each ocurrence was divided by the number of abstracts of that year. After this normalization we calculated a score for each term.

Score = O_first - O_last

Where O_last, O_first is the number of nomalized ocurrences in the last or first year available for that term respectively. This score aims to capture the terms that are more frequently used in recent times (positive) and the terms that are falling in use in recent times (negative)

We manually labeled the terms with the most negative and most positive scores based on their likely origin; if they represented a part of the lab equimpent or lab techniques (lab_equipment), or an idea behind the project (idea) or if they had unspecific meaning (term). Some terms did not originate from the description but from the references section (reference) or were found in the headers, titles and subtitles (header). Some terms were frequent because they are part of the default text instructions for the wiki and they were labeled as (artifacts).

## Loading required package: RColorBrewer

RESULTS

Using web-scraping, we obtained the text of the wikis of iGEM teams that participated during the periof between 2008-2015. Some wikis scraped text showed a warning that stated “no text in this page”.

##    year Percent_of_accessible_wikis
## 1: 2008                   0.9333333
## 2: 2009                   0.9361702
## 3: 2010                   0.9367089
## 4: 2011                   0.8924731
## 5: 2012                   0.7127660
## 6: 2013                   0.7606178
## 7: 2014                   0.7828746
## 8: 2015                   0.9966102

Barplots of teams per year and average word lenght of each wiki per year after removing wikis that were not usable as described below.

The majority of the wikis have a menu which is usually the only part of the text that was captured (low word count). because most wikis have this format, this restricts the ability of this analysis to detect representative co-ocurrence word trends. We defined as usable wiki texts that did not contain the template text or provided a warning of no text. This was done so that the analysis could at least detect changes in the short descriptions that sometimes appear in this section and of the headers that are used in the teams across the years.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram of word counts per wiki, wikis with no content as described above are the ones that were removed from analysis. These wikis have the template text or are otherwise not accessible. Note that still a lot of wikis have few words, this is because they contain a menu bar with a short description of the project which is the only text that was scraped for them.

Using VOSviewer we conducted a word co-ocurrence analysis for the usable wiki texts. We consolidated the results of these analysis in one table and created a score to allow the comparison of the per-year analysis results. Using these scores we estimated which terms appear more frequently on recent years and which terms have fallen in use. We manually classified these terms into distict categories for description.

General Terms:

Wordcloud of terms that have not a particular meaning. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in ocurrence.

plot_term(ranked_table, "term")

##                   label           V1 type
##  1:             picture -0.103482587 term
##  2:             chimera -0.077935727 term
##  3:                core -0.074642464 term
##  4:         instruction -0.072160149 term
##  5:            identity -0.066180827 term
##  6:             comment -0.057370737 term
##  7:              suffix -0.052004582 term
##  8:            effector -0.044196151 term
##  9:               class -0.041309131 term
## 10:              server -0.035285430 term
## 11:             channel -0.029767828 term
## 12:          preference  0.009507446 term
## 13:     promoter region  0.023331618 term
## 14:         target gene  0.023331618 term
## 15:            building  0.023995734 term
## 16:               child  0.027341439 term
## 17:    room temperature  0.027887789 term
## 18:               trend  0.030126002 term
## 19: signal transduction  0.032716882 term
## 20:      essential gene  0.032877416 term
## 21:                note  0.033281867 term
## 22:                name  0.034879725 term
## 23:      identification  0.039199255 term
## 24:    positive control  0.050434513 term
## 25:         supernatant  0.063455386 term
## 26:             student  0.065406644 term
## 27:            template  0.065856063 term
## 28:              liquid  0.071983561 term
## 29:                 fig  0.099595454 term
## 30:         cooperation  0.100620732 term
## 31:            dilution  0.105257969 term
## 32:          absorption  0.155942978 term
## 33:            choosing  0.176512651 term
## 34:              member  0.221177118 term
## 35:                 min  0.229298572 term
## 36:              please  0.282752902 term
## 37:                fig2  0.352017381 term
##                   label           V1 type

In these wordclouds we can observe that images on the wiki were called “picture” in the past and are now called (“fig” and “fig2”). The word “identity” is now used instead in athe form “identification”. We also see that words like (“core” and “effector”) have fallen in use, which may reflect that the engineered systems do not revolve around a single switch and are now composed of many parts.

We see that in more recent wikis the description of the steps has increased because of the increase in use of terms like (“min”, “absorption”,“diluton”,“supernatant”,“positive control”). We also see that the team nature of the project is now more evident on the description due to the increase in terms like (“member”,“student”,“cooperation”). Of note, not all ocurrences of “member” refer people, but sometimes to molecules. The word “class” is used where today we would use “type”, strangely we did not observe a corresponding increase in the use of this last word.

Headers:

Wordcloud of terms found on headers. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in ocurrence.

plot_term(ranked_table, "header")

## Warning in wordcloud(words = table$label, freq = table$V1, min.freq = 0, :
## project description could not be fit on page. It will not be plotted.

##                   label          V1   type
##  1:            abstract -0.04674724 header
##  2:        lab notebook -0.03098511 header
##  3:             wet lab -0.02992061 header
##  4:        presentation  0.02799503 header
##  5:               video  0.05037244 header
##  6:          background  0.06541059 header
##  7:      human practice  0.10108820 header
##  8:          motivation  0.11092917 header
##  9:              safety  0.13195630 header
## 10:            overview  0.25757294 header
## 11:             summary  0.26840472 header
## 12:         achievement  0.44037758 header
## 13:          conclusion  0.54873998 header
## 14:         attribution  0.61334210 header
## 15:       collaboration  0.64255121 header
## 16:            notebook  0.81083178 header
## 17: project description  0.93870267 header
## 18:                team  1.04760679 header
## 19:           reference  1.05142768 header

In this explaratory analysis we could also observe that in the past Wikis were composed mainly of different sections than today. We observed a descrease in the use of the headers (“abstract”,“lab notebook”, “wet lab”). This may be because iGEM standardized the structure of the Wikis at some point during this period. Now, wikis contain sections like (“summary”,“overview”,“notebook”,“collaboration”), that belong tot his new structure. Of note because of this changes, we observed a lot of new terms appear because of the presence of references in the wikis.

Lab materials

Wordcloud of terms that describe materials and lab techniques. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in ocurrence.

plot_term(ranked_table, "lab_material")

##               label          V1         type
##  1:            tube -0.09696846 lab_material
##  2:             cfp -0.06369999 lab_material
##  3:         ethanol -0.04845361 lab_material
##  4:            ptet -0.03882682 lab_material
##  5:    nitric oxide  0.02458822 lab_material
##  6:            srna  0.03134258 lab_material
##  7:         agarose  0.03319848 lab_material
##  8:   photoreceptor  0.03358163 lab_material
##  9:        miniprep  0.03361266 lab_material
## 10:    dna fragment  0.03366501 lab_material
## 11:             egg  0.03851462 lab_material
## 12:      well plate  0.03862454 lab_material
## 13:            ompr  0.03905397 lab_material
## 14:      water bath  0.03916822 lab_material
## 15:         alcohol  0.04478585 lab_material
## 16:         plastic  0.04833906 lab_material
## 17:        sds page  0.05034140 lab_material
## 18:       red light  0.05203520 lab_material
## 19:          insert  0.05429553 lab_material
## 20: electrophoresis  0.06556030 lab_material
## 21:            heme  0.06722533 lab_material
## 22:            psti  0.09506518 lab_material
## 23:           ecori  0.11747362 lab_material
## 24:             ice  0.12801002 lab_material
## 25:           mirna  0.13877752 lab_material
## 26:          buffer  0.17349472 lab_material
## 27:           flask  0.20145872 lab_material
## 28:          column  0.26294227 lab_material
##               label          V1         type

Terms reflecting the lab materials show the evident switch of iGEM teams to genome engineering. We can see a clear increase in the use of restriction enzymes, electrophoresis and DNA purification columns, miRNA and sRNA. This change may also explain why “tube” is now less used. Another explanation is that a “tube” has been replaced by more high throughput containers like “well plate”. We also observe that “ethanol” is now mentioned as “alcohol”. We also observe an increase in use of fluerescence techniques (“red ligth”). Surprisingly we did not pick up an increase in the use of RFP.

References

Wordcloud of terms that originate from the reference section. In green terms that have increased in use. Color intensity is proportional to the magnitude of increase in ocurrence.

plot_wordcloud(ranked_table[type=="references"], "Greens")

ranked_table[type=="references"]
##                    label         V1       type
##  1:         microbiology 0.03401614 references
##  2:               august 0.03916822 references
##  3:                 chem 0.03935444 references
##  4:                 chen 0.04405060 references
##  5:              science 0.04405941 references
##  6:              america 0.04421113 references
##  7:                  liu 0.04979773 references
##  8:     national academy 0.04979773 references
##  9:                 wang 0.04995826 references
## 10: biological chemistry 0.05592800 references
## 11:                zhang 0.05841031 references
## 12:                  web 0.06615182 references
## 13:                  vol 0.06753569 references
## 14:                 june 0.07268777 references
## 15:         biochemistry 0.07855369 references
## 16:              journal 0.08951890 references
## 17:               center 0.16769088 references

Terms that are found in the references reflect the origin of the research literature of iGEM teams as can be observed in the terms like (“biochemistry”,“microbiology”,“biological chemistry”,“chem”) that originate from article titles and journal names. The most common journals are also recognizable (“science”, “national academy”). Unexpectedly, because atuomatic bibliographic tools record the date of citation we can observe that most teams conduct the research for the wiki during June and August.

LIMITATIONS

The limitations of this analysis is the filters that VOSviewer may have that prevent some words like “type” from being considered relevant. Also this analysis can pick up only words that are consistently used in the same way. This is because we did not use a dictionary of synonyms for the analysis with VOSviewer which can increase the resolution of the results. This may explain why we do not pick up commonly used fluorescence proteins like RFP but we do pick up restriction enzymes like ecoRI.

Another limitation is that the web scraping could not access the text of wikies that were hidden behind a network of links or that were not stored as plain text. Because most of the wikies have this format this severely restricts the analysis. In a future analysis can improve on this step.