Difference between revisions of "Team:Groningen/Collaborations"

Line 151: Line 151:
 
}
 
}
 
</script>
 
</script>
 
 
</html>
 
</html>
 
iGEM term analysis: Seeing the big figure
 
iGEM term analysis: Seeing the big figure
Line 159: Line 158:
  
 
INTRODUCTION
 
INTRODUCTION
The iGEM competition brings into contact many student groups that worked on a research project for one year. Because this competition started over 10 years ago it would be interesting to investigate the changes in research ideas or lab equipment that have taken place since then. Fortunately, one of the requirements for the competition is to upload to the web a description of the project (also called the wiki). The links to each wiki are neatly organized on the official iGEM page they are a word frequency analysis feasible. Also useful visualization of the results of the analysis would allow more people to get an idea of the evolution of iGEM projects. This idea was concieved by the MIT team and was proposed as a collaboration in the official iGEM page (Colaboration number #26). We, at the Groningen team decided to collaborate with the MIT team by doing this analysis.
+
The iGEM competition brings into contact many student groups that worked on a research project for one year. Because this competition started over 10 years ago it would be interesting to investigate the changes in research ideas or lab equipment that have taken place since then. Fortunately, one of the requirements for the competition is to upload to the web a description of the project (also called the wiki). The links to each wiki are neatly organized on the official iGEM page they are a word frequency analysis feasible. Also, useful visualization of the results of the analysis would allow more people to get an idea of the evolution of iGEM projects. This idea was conceived by the MIT team and was proposed as a collaboration in the official iGEM page (Collaboration number #26). We, at the Groningen team, decided to collaborate with the MIT team by doing this analysis.
  
 
METHODS
 
METHODS
Line 165: Line 164:
 
Web Scraping
 
Web Scraping
  
To conduct a word frequencty analysis, the text of each iGEM team web description must be easily accessible and organized on a table. To achive this, the iGEM team of MIT web scraped the text of each team using the official archive of wikis available in the iGEM web page (<>). They structured this data into a CSV file where the team name, supervisor name, project title, year of the project and wiki text were columns of the data table.
+
To conduct a word frequency analysis, the text of each iGEM team web description must be easily accessible and organized on a table. To achieve this, the iGEM team of MIT web scraped the text of each team using the official archive of wikis available in the iGEM web page (<>). They structured this data into a CSV file where the team name, supervisor name, project title, year of the project and wiki text were columns of the data table.
  
 
Data Cleaning
 
Data Cleaning
  
The original csv file could not be read from file due to some conficts in the interpretation of quoted text. The problem is that some regions of text are interpreted as part of the csv structure and commas in this region are interpreted as field separators. This problematic text belonged to the descriptions of the wiki that were the object of this analysis. We removed all lines of text that did not have 6 comma separated fields and then all that did not have an url on the team url field (Team_Info_Page). This may have removed a lot of information from some teams wikis.
+
The original csv file could not be read from file due to some conflicts in the interpretation of the quoted text. The problem is that some regions of text are interpreted as part of the csv structure and commas in this region are interpreted as field separators. This problematic text belonged to the descriptions of the wiki that were the object of this analysis. We removed all lines of text that did not have 6 commas separated fields and then all that did not have an url on the team url field (Team_Info_Page). This may have removed a lot of information from some teams wikis.
  
 
Word frequency analysis
 
Word frequency analysis
  
To find out how the iGEM projects have evolved over the years we first conducted a word frequency analysis per year (2008-2015) using VOSvisualizer (<>). VOSvisualizer is a software tool designed in Leiden University to visualize bibliometric networks. One of the modes of analysis is word co-ocurrence which can be done on any kind of text. We downloaded VOSviewer 1.6.5 and conducted the analysis using the command line call:
+
To find out how the iGEM projects have evolved over the years we first conducted a word frequency analysis per year (2008-2015) using VOSvisualizer (<>). VOSvisualizer is a software tool designed in Leiden University to visualize bibliometric networks. One of the modes of analysis is word co-occurrence which can be done on any kind of text. We downloaded VOSviewer 1.6.5 and conducted the analysis using the command line call:
  
 
java -jar /path/to/VOSviewer.jar -corpus 2008_abstracts.txt -counting_method 2 -min_n_occurrences 2
 
java -jar /path/to/VOSviewer.jar -corpus 2008_abstracts.txt -counting_method 2 -min_n_occurrences 2
  
This call makes VOS viewer do a word co-ocurrence analysis with full counting of ocurrences (counting_method), for all words that are repeated at least twice. The corpus used was a file that had all abstracts for that specific year. VOSviewer generates a co-ocurrence network based on the frequencies of the single words or larger strings (terms) that are found repeated in the corpus. VOSviewer also generates a tab-delimited “map” file that specifies the link strength, number of links and number of ocurrences for each node (terms) in this network.
+
This call makes VOS viewer perform a word co-ocurrence analysis with full counting of occurrences (counting_method), for all words that are repeated at least twice. The corpus used was a file that had all abstracts for that specific year. VOSviewer generates a co-occurrence network based on the frequencies of the single words or larger strings (terms) that are found repeated in the corpus. VOSviewer also generates a tab-delimited “map” file that specifies the link strength, number of links and number of occurrences for each node (terms) in this network.
  
 
Word frequency interpretation
 
Word frequency interpretation
  
We consolidated all map files into one table (map table) and specified in a new column the year of each entry. Therefore if a term was captured in the analysis on more than one year it would have more than one entry in the map table. We subseted from the table words that had 3 or more entries (were important on more thatn 3 years). Because the popularity of the iGEM competition is increasing, recent years have more teams than older years. Therefore, a term that ocurrs for example, 3 times per abstract may appear to increase in ocurrence if the number of ocurrences is not corrected by the number of teams per year. There fore we preformed a correction were the each ocurrence was divided by the number of abstracts of that year. After this normalization we calculated a score for each term.
+
We consolidated all map files into one table (map table) and specified in a new column the year of each entry. Therefore if a term was captured in the analysis for more than one year it would have more than one entry in the map table. We subsetted from the table words that had 3 or more entries (were important on more than 3 years). Because the popularity of the iGEM competition is increasing, recent years have more teams than older years. Therefore, a term that occurs, for example, 3 times per abstract may appear to increase in occurrence if the number of occurrences is not corrected by the number of teams per year. Therefore we performed a correction where each occurrence was divided by the number of abstracts of that year. After this normalization, we calculated a score for each term.
  
 
Score = O_first - O_last
 
Score = O_first - O_last
  
Where O_last, O_first is the number of nomalized ocurrences in the last or first year available for that term respectively. This score aims to capture the terms that are more frequently used in recent times (positive) and the terms that are falling in use in recent times (negative)
+
Where O_last, O_first is the number of normalized occurrences in the last or first year available for that term respectively. This score aims to capture the terms that are more frequently used in recent times (positive) and the terms that are falling in use in recent times (negative)
  
We manually labeled the terms with the most negative and most positive scores based on their likely origin; if they represented a part of the lab equimpent or lab techniques (lab_equipment), or an idea behind the project (idea) or if they had unspecific meaning (term). Some terms did not originate from the description but from the references section (reference) or were found in the headers, titles and subtitles (header). Some terms were frequent because they are part of the default text instructions for the wiki and they were labeled as (artifacts).
+
We manually labeled the terms with the most negative and most positive scores based on their likely origin; if they represented a part of the lab equipment or lab techniques (lab_equipment) or an idea behind the project (idea) or if they had unspecific meaning (term). Some terms did not originate from the description but from the references section (reference) or were found in the headers, titles, and subtitles (header). Some terms were frequent because they are part of the default text instructions for the wiki and they were labeled as (artifacts).
  
 
## Loading required package: RColorBrewer
 
## Loading required package: RColorBrewer
 
RESULTS
 
RESULTS
  
Using web-scraping, we obtained the text of the wikis of iGEM teams that participated during the periof between 2008-2015. Some wikis scraped text showed a warning that stated “no text in this page”.
+
Using web-scraping, we obtained the text of the wikis of iGEM teams that participated during the period between 2008-2015. Some wikis scraped text showed a warning that stated “no text in this page”.
  
 
##    year Percent_of_accessible_wikis
 
##    year Percent_of_accessible_wikis
Line 207: Line 206:
 
Barplots of teams per year and average word lenght of each wiki per year after removing wikis that were not usable as described below.
 
Barplots of teams per year and average word lenght of each wiki per year after removing wikis that were not usable as described below.
  
The majority of the wikis have a menu which is usually the only part of the text that was captured (low word count). because most wikis have this format, this restricts the ability of this analysis to detect representative co-ocurrence word trends. We defined as usable wiki texts that did not contain the template text or provided a warning of no text. This was done so that the analysis could at least detect changes in the short descriptions that sometimes appear in this section and of the headers that are used in the teams across the years.
+
The majority of the wikis have a menu which is usually the only part of the text that was captured (low word count). because most wikis have this format, this restricts the ability of this analysis to detect representative co-occurrence word trends. We defined as usable wiki texts that did not contain the template text or provided a warning of no text. This was done so that the analysis could at least detect changes in the short descriptions that sometimes appear in this section and of the headers that are used in the teams across the years.
  
 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
  
  
Histogram of word counts per wiki, wikis with no content as described above are the ones that were removed from analysis. These wikis have the template text or are otherwise not accessible. Note that still a lot of wikis have few words, this is because they contain a menu bar with a short description of the project which is the only text that was scraped for them.
+
Histogram of word counts per wiki, wikis with no content as described above are the ones that were removed from the analysis. These wikis have the template text or are otherwise not accessible. Note that still a lot of wikis have few words, this is because they contain a menu bar with a short description of the project which is the only text that was scrapped for them.
  
Using VOSviewer we conducted a word co-ocurrence analysis for the usable wiki texts. We consolidated the results of these analysis in one table and created a score to allow the comparison of the per-year analysis results. Using these scores we estimated which terms appear more frequently on recent years and which terms have fallen in use. We manually classified these terms into distict categories for description.
+
Using VOSviewer we conducted a word co-occurrence analysis for the usable wiki texts. We consolidated the results of this analysis in one table and created a score to allow the comparison of the per-year analysis results. Using these scores we estimated which terms appear more frequently in recent years and which terms have fallen in use. We manually classified these terms into distinct categories for description.
  
 
General Terms:
 
General Terms:
  
Wordcloud of terms that have not a particular meaning. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in ocurrence.
+
Word cloud of terms that have not a particular meaning. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in occurrence.
  
 
plot_term(ranked_table, "term")
 
plot_term(ranked_table, "term")
Line 264: Line 263:
 
In these wordclouds we can observe that images on the wiki were called “picture” in the past and are now called (“fig” and “fig2”). The word “identity” is now used instead in athe form “identification”. We also see that words like (“core” and “effector”) have fallen in use, which may reflect that the engineered systems do not revolve around a single switch and are now composed of many parts.
 
In these wordclouds we can observe that images on the wiki were called “picture” in the past and are now called (“fig” and “fig2”). The word “identity” is now used instead in athe form “identification”. We also see that words like (“core” and “effector”) have fallen in use, which may reflect that the engineered systems do not revolve around a single switch and are now composed of many parts.
  
We see that in more recent wikis the description of the steps has increased because of the increase in use of terms like (“min”, “absorption”,“diluton”,“supernatant”,“positive control”). We also see that the team nature of the project is now more evident on the description due to the increase in terms like (“member”,“student”,“cooperation”). Of note, not all ocurrences of “member” refer people, but sometimes to molecules. The word “class” is used where today we would use “type”, strangely we did not observe a corresponding increase in the use of this last word.
+
We see that in more recent wikis the description of the steps has increased because of the increase in the use of terms like (“min”, “absorption”,“dilution”,“supernatant”,“positive control”). We also see that the team nature of the project is now more evident in the description due to the increase in terms like (“member”,“student”,“cooperation”). Of note, not all occurrences of “member” refer people, but sometimes to molecules. The word “class” is used where today we would use “type”, strangely we did not observe a corresponding increase in the use of this last word.
  
 
Headers:
 
Headers:
  
Wordcloud of terms found on headers. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in ocurrence.
+
Word cloud of terms found on headers. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in occurrence.
  
 
plot_term(ranked_table, "header")
 
plot_term(ranked_table, "header")
Line 297: Line 296:
 
## 18:                team  1.04760679 header
 
## 18:                team  1.04760679 header
 
## 19:          reference  1.05142768 header
 
## 19:          reference  1.05142768 header
In this explaratory analysis we could also observe that in the past Wikis were composed mainly of different sections than today. We observed a descrease in the use of the headers (“abstract”,“lab notebook”, “wet lab”). This may be because iGEM standardized the structure of the Wikis at some point during this period. Now, wikis contain sections like (“summary”,“overview”,“notebook”,“collaboration”), that belong tot his new structure. Of note because of this changes, we observed a lot of new terms appear because of the presence of references in the wikis.
+
In this exploratory analysis, we could also observe that in the past Wikis were composed mainly of different sections than today. We observed a decrease in the use of the headers (“abstract”,“lab notebook”, “wet lab”). This may be because iGEM standardized the structure of the Wikis at some point during this period. Now, wikis contain sections like (“summary”,“overview”,“notebook”,“collaboration”), that belong to his new structure. Of note because of this changes, we observed a lot of new terms appear because of the presence of references in the wikis.
  
 
Lab materials
 
Lab materials
  
Wordcloud of terms that describe materials and lab techniques. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in ocurrence.
+
Word cloud of terms that describe materials and lab techniques. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in occurrence.
  
 
plot_term(ranked_table, "lab_material")
 
plot_term(ranked_table, "lab_material")
Line 340: Line 339:
 
References
 
References
  
Wordcloud of terms that originate from the reference section. In green terms that have increased in use. Color intensity is proportional to the magnitude of increase in ocurrence.
+
Word cloud of terms that originate from the reference section. In green terms that have increased in use. Color intensity is proportional to the magnitude of increase in occurrence.
  
 
plot_wordcloud(ranked_table[type=="references"], "Greens")
 
plot_wordcloud(ranked_table[type=="references"], "Greens")
Line 368: Line 367:
 
LIMITATIONS
 
LIMITATIONS
  
The limitations of this analysis is the filters that VOSviewer may have that prevent some words like “type” from being considered relevant. Also this analysis can pick up only words that are consistently used in the same way. This is because we did not use a dictionary of synonyms for the analysis with VOSviewer which can increase the resolution of the results. This may explain why we do not pick up commonly used fluorescence proteins like RFP but we do pick up restriction enzymes like ecoRI.
+
The limitations of this analysis are the filters that VOSviewer may have that prevent some words like “type” from being considered relevant. Also, this analysis can pick up only words that are consistently used in the same way. This is because we did not use a dictionary of synonyms for the analysis with VOSviewer which can increase the resolution of the results. This may explain why we do not pick up commonly used fluorescence proteins like RFP but we do pick up restriction enzymes like ecoRI.
  
Another limitation is that the web scraping could not access the text of wikies that were hidden behind a network of links or that were not stored as plain text. Because most of the wikies have this format this severely restricts the analysis. In a future analysis can improve on this step.
+
Another limitation is that the web scraping could not access the text of wikis that were hidden behind a network of links or that were not stored as plain text. Because most of the wikis have this format this severely restricts the analysis. In a future analysis can improve on this step.
 +
 
 +
</script>
 +
</html>

Revision as of 13:57, 17 October 2017


Collaborations

Science is rarely a one (wo)man job and usually requires people with different backgrounds to work together to solve the challenges encountered. iGEM is no different. To encompass this spirit the iGEM 2017 Groningen team strives to work together with multiple teams on different aspect of our project and hopefully further strengthen connections.

  1. To further develop and establish Lactococcus lactis as a chassis in iGEM we collaborated with the IGEM team of Sao Paulo. We sent them protocols since we have a lot of in house experience with working on lactis.
  2. To help the Nottingham team we tested their E.coi RFP fluorescence in our lab to provide an external control.
  3. NAWI-Graz: Our friends from Austria are developing a bioelectronic interface controlled by bacterial GFP-expression. They have developed a software to validate part of their experiments. iGEM Groningen has worked together with them to design mazes and therefore improve the functionality and identify flaws in the design.

  1. Virtual meet up Vilnius-Lithuania, Abu Dhabi (with follow up discussion) We presented our project designs to each other and critically debated their feasibility as well as implementation of the final product. This helped us gain insight into possible experimental flaws. Abu Dhabi is also working on designing a cartridge, so their engineering advice was appreciated and contributed to improving our cartridge. We held a follow-up discussion to update on the progress achieved and get advice on the challenges met. We had a few issues with cloning and got some help that enabled us to get our construct from team Vilnius. We hoped our input also improved on their design.
  2. Virtual meet up – ethics, Oslo, Graz, Zurich, Lund, Upsala. Team Uppsala moderated a discussion together with Oslo, Graz, Zurich, and Lund about the ethical implications of our projects. We tried to respond to the following questions:
    1. Uncontrolled Release. Can we anticipate how our Genetically Engineered Machine (GEM) would behave if released? What would be ideal conditions to grow and could they potentially be met? Can we anticipate any interactions with any form of wildlife? What would be ideal conditions to grow and could they potentially be met?
    2. Misuse. Can we think some steps ahead and imagine a potentially harmful usage with our open-source GEM?
    The whole conversation was live streamed and can be found here
  3. BENELUX meetup
    1. The BENELUX teams were invited to meet and present their ideas and receive critical feedback from other teams, experts in the field and an iGEM HQ representative. Teams also participated in workshops to immerse ourselves in the shareholders perspectives and debate safety issues. The meeting was hosted by the Wageningen team.
  4. EUROPEAN meetup The European iGEM meet up for the Netherlands was held in Delft this year. The meet up started with a talk by Cees Dekker, a well known physicist. It was quite interesting to hear about the common ground between physics and biology in his talk. After the break we had the talk from Denis Murphy, he is highly involved in palm oil. Palm oil is used a lot for cosmetics in richer countries, and for sustenance in poorer countries. He expanded on a specific application for genetic engineering for making sustainable palm oil plantations. The main event of the day was of course the poster presentations of all the iGEM teams themselves. We walked around a lot, talking to pretty much every team at least once. It was very nice to see all the Dutch teams again here after we had met them during the Dutch iGEM meet up in Wageningen. During all this we handed out our 3D printed phages to the teams too. The last part of the day was a BBQ with some drinks. We left together with some of the other teams and talked more about how our respective projects were going.
  5. Tolerance Photo Challenge – Technion – Israel To highlight the diversity and tolerance in our team we participated in the Tolerance photo challenge in conjunction with Technion, Isreal (photo)
  6. Little snazzy man (Flat Stanly) – Caroll High school Caroll High school got inspired for their collaboration by the book “Flat Stanley by Jeff Brown. A bulletin board falls on Stanley, he survives but is now flat. His altered form comes with some perks though, such as being able to slip under doors or being mailed to California to meet his friends. Caroll HS also made a flat Stanley and mailed it to us, this time being a microbe. We welcomed it to our lab and made it an honorary team member.
  7. Online meet-up about fermentation factories – SCUT-FSE- China

  1. We developed our own card game and will shared it with Team Franconia & in return also evaluated their card game. We will also assist them with the QR stand game. (58., 15.)
  2. We were curious about the history of IGEM and have collaborated with the MIT team to visualize the development of tracks, teams, topics etc. MIT has provided us with a file stating these and Groningen has analyzed and visualized the content. This provides insight into how the competition has changed over the years, information about shifting interests and emerging trends and technologies.
  3. To further strengthen the bond between different IGEM teams we also send & received post cards from different teams initiated by team Düsseldorf. The postcards should quickly highlight the main idea of the respective projects. We received a vast variety of post card designs and had a lot of fun reading through the other teams’ projects and designs. (1.)

We also participated in the surveys of the following teams to help them gather some data on their projects problems:
  1. Microfluidics survey – Boston university
  2. Survey about Cholera – INSA-UPS France
  3. Methane production - Nebraska- Lincoln
  4. Air pollution – Pasteur Paris
  5. Biological Material transport survey – Team Amazonas, Brazil
  6. Health care & liver cancer – Team Brit
  7. Genetic engineering & medicine – Team Cardiff
  8. Tell us about your chassis
  9. Insulin accessibility – Syndney Australia
  10. GMO perception study - Sup’Biotech, Paris, France
  11. Heavy metal toxicity – DEI Agra, India
  12. Antibiotically resistant bacteria - UNBC- Canada
  13. Diabetics & Psicose - Evry, France
  14. Perspectives on Treatments for Illnesses Survey -Columbia University
  15. Lead Contamination in YOUR Drinking Water? - Team WPI Worcester
  16. Survey on Colorectal Cancer - Team Worldshaper-Wuhan
  17. CRISPR along the iGEM - Team Amazonas Brazil
  18. Directed Evolution and Artificial Intelligence Survey - Team Heidelberg

iGEM term analysis: Seeing the big figure Carlos Urzua, in colaboration with the iGEM team Groningen and iGEM team MIT

September 27, 2017

INTRODUCTION The iGEM competition brings into contact many student groups that worked on a research project for one year. Because this competition started over 10 years ago it would be interesting to investigate the changes in research ideas or lab equipment that have taken place since then. Fortunately, one of the requirements for the competition is to upload to the web a description of the project (also called the wiki). The links to each wiki are neatly organized on the official iGEM page they are a word frequency analysis feasible. Also, useful visualization of the results of the analysis would allow more people to get an idea of the evolution of iGEM projects. This idea was conceived by the MIT team and was proposed as a collaboration in the official iGEM page (Collaboration number #26). We, at the Groningen team, decided to collaborate with the MIT team by doing this analysis.

METHODS

Web Scraping

To conduct a word frequency analysis, the text of each iGEM team web description must be easily accessible and organized on a table. To achieve this, the iGEM team of MIT web scraped the text of each team using the official archive of wikis available in the iGEM web page (<>). They structured this data into a CSV file where the team name, supervisor name, project title, year of the project and wiki text were columns of the data table.

Data Cleaning

The original csv file could not be read from file due to some conflicts in the interpretation of the quoted text. The problem is that some regions of text are interpreted as part of the csv structure and commas in this region are interpreted as field separators. This problematic text belonged to the descriptions of the wiki that were the object of this analysis. We removed all lines of text that did not have 6 commas separated fields and then all that did not have an url on the team url field (Team_Info_Page). This may have removed a lot of information from some teams wikis.

Word frequency analysis

To find out how the iGEM projects have evolved over the years we first conducted a word frequency analysis per year (2008-2015) using VOSvisualizer (<>). VOSvisualizer is a software tool designed in Leiden University to visualize bibliometric networks. One of the modes of analysis is word co-occurrence which can be done on any kind of text. We downloaded VOSviewer 1.6.5 and conducted the analysis using the command line call:

java -jar /path/to/VOSviewer.jar -corpus 2008_abstracts.txt -counting_method 2 -min_n_occurrences 2

This call makes VOS viewer perform a word co-ocurrence analysis with full counting of occurrences (counting_method), for all words that are repeated at least twice. The corpus used was a file that had all abstracts for that specific year. VOSviewer generates a co-occurrence network based on the frequencies of the single words or larger strings (terms) that are found repeated in the corpus. VOSviewer also generates a tab-delimited “map” file that specifies the link strength, number of links and number of occurrences for each node (terms) in this network.

Word frequency interpretation

We consolidated all map files into one table (map table) and specified in a new column the year of each entry. Therefore if a term was captured in the analysis for more than one year it would have more than one entry in the map table. We subsetted from the table words that had 3 or more entries (were important on more than 3 years). Because the popularity of the iGEM competition is increasing, recent years have more teams than older years. Therefore, a term that occurs, for example, 3 times per abstract may appear to increase in occurrence if the number of occurrences is not corrected by the number of teams per year. Therefore we performed a correction where each occurrence was divided by the number of abstracts of that year. After this normalization, we calculated a score for each term.

Score = O_first - O_last

Where O_last, O_first is the number of normalized occurrences in the last or first year available for that term respectively. This score aims to capture the terms that are more frequently used in recent times (positive) and the terms that are falling in use in recent times (negative)

We manually labeled the terms with the most negative and most positive scores based on their likely origin; if they represented a part of the lab equipment or lab techniques (lab_equipment) or an idea behind the project (idea) or if they had unspecific meaning (term). Some terms did not originate from the description but from the references section (reference) or were found in the headers, titles, and subtitles (header). Some terms were frequent because they are part of the default text instructions for the wiki and they were labeled as (artifacts).

    1. Loading required package: RColorBrewer

RESULTS

Using web-scraping, we obtained the text of the wikis of iGEM teams that participated during the period between 2008-2015. Some wikis scraped text showed a warning that stated “no text in this page”.

    1. year Percent_of_accessible_wikis
    2. 1: 2008 0.9333333
    3. 2: 2009 0.9361702
    4. 3: 2010 0.9367089
    5. 4: 2011 0.8924731
    6. 5: 2012 0.7127660
    7. 6: 2013 0.7606178
    8. 7: 2014 0.7828746
    9. 8: 2015 0.9966102


Barplots of teams per year and average word lenght of each wiki per year after removing wikis that were not usable as described below.

The majority of the wikis have a menu which is usually the only part of the text that was captured (low word count). because most wikis have this format, this restricts the ability of this analysis to detect representative co-occurrence word trends. We defined as usable wiki texts that did not contain the template text or provided a warning of no text. This was done so that the analysis could at least detect changes in the short descriptions that sometimes appear in this section and of the headers that are used in the teams across the years.

    1. `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Histogram of word counts per wiki, wikis with no content as described above are the ones that were removed from the analysis. These wikis have the template text or are otherwise not accessible. Note that still a lot of wikis have few words, this is because they contain a menu bar with a short description of the project which is the only text that was scrapped for them.

Using VOSviewer we conducted a word co-occurrence analysis for the usable wiki texts. We consolidated the results of this analysis in one table and created a score to allow the comparison of the per-year analysis results. Using these scores we estimated which terms appear more frequently in recent years and which terms have fallen in use. We manually classified these terms into distinct categories for description.

General Terms:

Word cloud of terms that have not a particular meaning. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in occurrence.

plot_term(ranked_table, "term")


    1. label V1 type
    2. 1: picture -0.103482587 term
    3. 2: chimera -0.077935727 term
    4. 3: core -0.074642464 term
    5. 4: instruction -0.072160149 term
    6. 5: identity -0.066180827 term
    7. 6: comment -0.057370737 term
    8. 7: suffix -0.052004582 term
    9. 8: effector -0.044196151 term
    10. 9: class -0.041309131 term
    11. 10: server -0.035285430 term
    12. 11: channel -0.029767828 term
    13. 12: preference 0.009507446 term
    14. 13: promoter region 0.023331618 term
    15. 14: target gene 0.023331618 term
    16. 15: building 0.023995734 term
    17. 16: child 0.027341439 term
    18. 17: room temperature 0.027887789 term
    19. 18: trend 0.030126002 term
    20. 19: signal transduction 0.032716882 term
    21. 20: essential gene 0.032877416 term
    22. 21: note 0.033281867 term
    23. 22: name 0.034879725 term
    24. 23: identification 0.039199255 term
    25. 24: positive control 0.050434513 term
    26. 25: supernatant 0.063455386 term
    27. 26: student 0.065406644 term
    28. 27: template 0.065856063 term
    29. 28: liquid 0.071983561 term
    30. 29: fig 0.099595454 term
    31. 30: cooperation 0.100620732 term
    32. 31: dilution 0.105257969 term
    33. 32: absorption 0.155942978 term
    34. 33: choosing 0.176512651 term
    35. 34: member 0.221177118 term
    36. 35: min 0.229298572 term
    37. 36: please 0.282752902 term
    38. 37: fig2 0.352017381 term
    39. label V1 type

In these wordclouds we can observe that images on the wiki were called “picture” in the past and are now called (“fig” and “fig2”). The word “identity” is now used instead in athe form “identification”. We also see that words like (“core” and “effector”) have fallen in use, which may reflect that the engineered systems do not revolve around a single switch and are now composed of many parts.

We see that in more recent wikis the description of the steps has increased because of the increase in the use of terms like (“min”, “absorption”,“dilution”,“supernatant”,“positive control”). We also see that the team nature of the project is now more evident in the description due to the increase in terms like (“member”,“student”,“cooperation”). Of note, not all occurrences of “member” refer people, but sometimes to molecules. The word “class” is used where today we would use “type”, strangely we did not observe a corresponding increase in the use of this last word.

Headers:

Word cloud of terms found on headers. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in occurrence.

plot_term(ranked_table, "header")


    1. Warning in wordcloud(words = table$label, freq = table$V1, min.freq = 0, :
    2. project description could not be fit on page. It will not be plotted.


    1. label V1 type
    2. 1: abstract -0.04674724 header
    3. 2: lab notebook -0.03098511 header
    4. 3: wet lab -0.02992061 header
    5. 4: presentation 0.02799503 header
    6. 5: video 0.05037244 header
    7. 6: background 0.06541059 header
    8. 7: human practice 0.10108820 header
    9. 8: motivation 0.11092917 header
    10. 9: safety 0.13195630 header
    11. 10: overview 0.25757294 header
    12. 11: summary 0.26840472 header
    13. 12: achievement 0.44037758 header
    14. 13: conclusion 0.54873998 header
    15. 14: attribution 0.61334210 header
    16. 15: collaboration 0.64255121 header
    17. 16: notebook 0.81083178 header
    18. 17: project description 0.93870267 header
    19. 18: team 1.04760679 header
    20. 19: reference 1.05142768 header

In this exploratory analysis, we could also observe that in the past Wikis were composed mainly of different sections than today. We observed a decrease in the use of the headers (“abstract”,“lab notebook”, “wet lab”). This may be because iGEM standardized the structure of the Wikis at some point during this period. Now, wikis contain sections like (“summary”,“overview”,“notebook”,“collaboration”), that belong to his new structure. Of note because of this changes, we observed a lot of new terms appear because of the presence of references in the wikis.

Lab materials

Word cloud of terms that describe materials and lab techniques. In red terms that are falling in use and in green terms that are increasing in use. Color intensity is proportional to the magnitude of increase in occurrence.

plot_term(ranked_table, "lab_material")


    1. label V1 type
    2. 1: tube -0.09696846 lab_material
    3. 2: cfp -0.06369999 lab_material
    4. 3: ethanol -0.04845361 lab_material
    5. 4: ptet -0.03882682 lab_material
    6. 5: nitric oxide 0.02458822 lab_material
    7. 6: srna 0.03134258 lab_material
    8. 7: agarose 0.03319848 lab_material
    9. 8: photoreceptor 0.03358163 lab_material
    10. 9: miniprep 0.03361266 lab_material
    11. 10: dna fragment 0.03366501 lab_material
    12. 11: egg 0.03851462 lab_material
    13. 12: well plate 0.03862454 lab_material
    14. 13: ompr 0.03905397 lab_material
    15. 14: water bath 0.03916822 lab_material
    16. 15: alcohol 0.04478585 lab_material
    17. 16: plastic 0.04833906 lab_material
    18. 17: sds page 0.05034140 lab_material
    19. 18: red light 0.05203520 lab_material
    20. 19: insert 0.05429553 lab_material
    21. 20: electrophoresis 0.06556030 lab_material
    22. 21: heme 0.06722533 lab_material
    23. 22: psti 0.09506518 lab_material
    24. 23: ecori 0.11747362 lab_material
    25. 24: ice 0.12801002 lab_material
    26. 25: mirna 0.13877752 lab_material
    27. 26: buffer 0.17349472 lab_material
    28. 27: flask 0.20145872 lab_material
    29. 28: column 0.26294227 lab_material
    30. label V1 type

Terms reflecting the lab materials show the evident switch of iGEM teams to genome engineering. We can see a clear increase in the use of restriction enzymes, electrophoresis and DNA purification columns, miRNA and sRNA. This change may also explain why “tube” is now less used. Another explanation is that a “tube” has been replaced by more high throughput containers like “well plate”. We also observe that “ethanol” is now mentioned as “alcohol”. We also observe an increase in use of fluerescence techniques (“red ligth”). Surprisingly we did not pick up an increase in the use of RFP.

References

Word cloud of terms that originate from the reference section. In green terms that have increased in use. Color intensity is proportional to the magnitude of increase in occurrence.

plot_wordcloud(ranked_table[type=="references"], "Greens")


ranked_table[type=="references"]

    1. label V1 type
    2. 1: microbiology 0.03401614 references
    3. 2: august 0.03916822 references
    4. 3: chem 0.03935444 references
    5. 4: chen 0.04405060 references
    6. 5: science 0.04405941 references
    7. 6: america 0.04421113 references
    8. 7: liu 0.04979773 references
    9. 8: national academy 0.04979773 references
    10. 9: wang 0.04995826 references
    11. 10: biological chemistry 0.05592800 references
    12. 11: zhang 0.05841031 references
    13. 12: web 0.06615182 references
    14. 13: vol 0.06753569 references
    15. 14: june 0.07268777 references
    16. 15: biochemistry 0.07855369 references
    17. 16: journal 0.08951890 references
    18. 17: center 0.16769088 references

Terms that are found in the references reflect the origin of the research literature of iGEM teams as can be observed in the terms like (“biochemistry”,“microbiology”,“biological chemistry”,“chem”) that originate from article titles and journal names. The most common journals are also recognizable (“science”, “national academy”). Unexpectedly, because atuomatic bibliographic tools record the date of citation we can observe that most teams conduct the research for the wiki during June and August.

LIMITATIONS

The limitations of this analysis are the filters that VOSviewer may have that prevent some words like “type” from being considered relevant. Also, this analysis can pick up only words that are consistently used in the same way. This is because we did not use a dictionary of synonyms for the analysis with VOSviewer which can increase the resolution of the results. This may explain why we do not pick up commonly used fluorescence proteins like RFP but we do pick up restriction enzymes like ecoRI.

Another limitation is that the web scraping could not access the text of wikis that were hidden behind a network of links or that were not stored as plain text. Because most of the wikis have this format this severely restricts the analysis. In a future analysis can improve on this step.

</script> </html>