Difference between revisions of "Team:Bordeaux/Software"

 
(24 intermediate revisions by 3 users not shown)
Line 8: Line 8:
 
<link href="https://fonts.googleapis.com/css?family=Lato" rel="stylesheet">
 
<link href="https://fonts.googleapis.com/css?family=Lato" rel="stylesheet">
 
<style>
 
<style>
.ourWorks * {width : 80%; margin: 2% auto; color: #E0E0E0}
+
.ourWorks * {width : 85%; margin: 2% auto; color: #E0E0E0}
 
.ourWorks p {padding-left: 10px; text-align: justify; font-family: 'Lato'; font-size: 16px;}
 
.ourWorks p {padding-left: 10px; text-align: justify; font-family: 'Lato'; font-size: 16px;}
 
.ourWorks h1 {padding-left: 20px; font-size: 30px}
 
.ourWorks h1 {padding-left: 20px; font-size: 30px}
Line 26: Line 26:
  
 
<p>
 
<p>
   At the beginning of DNA sequencing in the 70’s, two main methods were developed. One of them by Walter Gilbert (USA) and another one by Frederick Sanger (UK), they both obtained the chemistry Nobel prize in 1980. The two approaches are really different and we will quickly sum them up.</p>
+
   At the beginning of DNA sequencing in the 70’s, two main methods were developed. One of them by Walter Gilbert (USA) and another one by Frederick Sanger (UK), they both obtained the chemistry Nobel prize in 1980. From the 90’s several new methods were developed to increase the performances and decrease the costs of sequencing. These methods so called NGS methods opened new perspectives for biologists. This year, for the IGEM competition we focused on one of the based NGS method, which is RNA-Seq (Figure 1).
 
+
<ul class="ourWorks">
+
  <li>
+
    <b>Maxam and Gilbert method :</b> This method works in several steps. First, the two extremity of a double strand DNA (dsDNA) are radioactively labeled. Then the targeted DNA sequence is selected by polyacrylamide gel electrophoresis (PAGE). The two strands are separated by thermic denaturation and purified by PAGE. Some chemical modifications are performed on these strands in a way that each sample of DNA contain zero or one modification. The DNA is then cleaved by piperidine. Finally an electrophoresis allows to recomposing the initial sequence. The main problem with this method is the use or radioactivity and the toxic chemicals used.
+
  </li>
+
  <li>
+
    <b>Sanger method :</b> This approach use a target DNA which will be incubated with a polymerase. This polymerase needs oligonucleotids to perform polymerisation. Sanger had the idea to use desoxyribonucleotids and a little quantity of didesoxyribonucleotids (ddNTP). When the polymerase incorporates a ddNTP the polymerisation stops. Thus a migration by electrophoresis allows to determine the initial DNA sequence.
+
  </li>
+
</ul>
+
 
+
<p>
+
  From the 90’s several new methods were developed to increase the performances and decrease the costs of sequencing. These methods so called NGS methods reduced costs and increased speed of sequencing, and thus opened new perspective for biologist. This year, for the IGEM competition we focused on one of the based NGS method, which is RNA-Seq (Figure 1).
+
 
</p>
 
</p>
  
Line 48: Line 36:
  
 
<p>
 
<p>
   RNA-Seq as said previously allows to quantify RNA into a cell at a particular time. With NGS development, a huge amount of data became available to scientists. They actually needed peoples to compute these data and this is when bioinformaticians came up. Computers are actually thought to treat a lot of data faster than humans. Thus, a lot of tools were developed to process NGS outputs. For the competition we used some of these tools to study splicing in C. elegans organism. Lets see how we proceeded !
+
   RNA-Seq as said previously allows to quantify RNA into a cell at a particular time. With NGS development, a huge amount of data became available to scientists. They actually needed people to compute these data and this is when bioinformaticians came up. Computers are actually thought to treat a lot of data faster than humans. Thus, a lot of tools were developed to process NGS outputs. For the competition we used some of these tools to study splicing in <i>C. elegans organism</i>. Lets see how we proceeded !
 
</p>
 
</p>
  
Line 56: Line 44:
  
 
<p>
 
<p>
   In bioinformatics, sequence alignment is a way of arranging RNA sequences in relation to each other, to determine their structure or function similarities. Sequences are stored in a matrix where rows from each sequence are compared. Gaps can be added into sequences so that identical or similar characters are aligned in successive columns. The organism studied here is <i> C.elegans</i>. The purpose here was to align RNAseq reads to its reference genome by using the Hisat algorithm.
+
   In bioinformatics, sequence alignment is a way of arranging RNA sequences in relation to each other, to determine their structure or function similarities. Sequences are stored in a matrix where rows from each sequence are compared. Gaps can be added into sequences so that identical or similar characters are aligned in successive columns. The organism studied here is <i> C.elegans</i>. The purpose here was to align RNA-Seq reads to its reference genome by using the Hisat2 algorithm.
  RNA is transcribed from DNA sequences that are composed of alternating coding exons and non-coding introns. A pre-RNA is produced that contains the transcribed Exons and Introns.
+
RNA is transcribed from DNA sequences that are composed of alternating coding exons and non-coding introns. A pre-RNA is produced that contains the transcribed exons and introns.
 
</p>
 
</p>
  
 
<p>
 
<p>
  Out of this pre-RNA, only coding Exons must be kept and the introns removed. This process of removing introns is called splicing. Different combinations of exons can be brought together to produce different variants of the protein to be, in a process called alternative splicing.
+
Out of this pre-RNA, only coding exons must be kept and the introns removed. This process of removing introns is called splicing. Different combinations of exons can be brought together to produce different variants of the protein to be, in a process called alternative splicing.
  It is those spliced RNA sequences that are then sequenced. To do, so they are retro-transcribed into their complementary DNA, the cDNA. This DNA is sequenced using NGS.
+
It is those spliced RNA sequences that are then sequenced. To do so, they are retro-transcribed into their complementary DNA, the cDNA. This DNA is sequenced using NGS.
 
</p>
 
</p>
  
 
<p>
 
<p>
   Current sequencing technologies methods split the large DNA molecules to be sequenced into small chunks called reads. These reads sequences are mapped to the genome reference using algorithms like bowtie. Because reads are small, some sequences can be redundant, present at different locations in the genome, making them hard to map. To circumvent this, a technique of mapping called paired-end is used. It consists in sequencing a cDNA fragment at its extremities in both directions, 3’ to 5’ and 5’ to 3’ (reverse strand). Because these reads originate from the same fragment the distance between them is know and it is easier to map them. Indeed, if two reads can map at a same location only one will have its pair mapping further at the correct distance.
+
   Current sequencing technologies methods split the large DNA molecules to be sequenced into small chunks called reads. These reads sequences are mapped to the reference genome using algorithms like bowtie. Because reads are small, some sequences can be redundant, present at different locations in the genome, making them hard to map. To circumvent this, a technique of mapping called paired-end is used. It consists in sequencing a cDNA fragment at its extremities in both directions, 3’ to 5’ and 5’ to 3’ (reverse strand). Because these reads originate from the same fragment the distance between them is know and it is easier to map them. Indeed, if two reads can map at a same location only one will have its pair mapping further at the correct distance.
 
</p>
 
</p>
  
Line 78: Line 66:
  
 
<p>
 
<p>
   These fastq files are the input for the HISAT software, based on bowtie, it performs the mapping of the reads on the genome. HISAT was used with the parameters previously described in the work of Denis Dupuy that produced the reference junctions file (ref). HISAT outputs bam files, they are a binary version of a sam file which contains the mapping informations like localisation of sequences reads sequences.
+
   These fastq files are the input for the HISAT software, based on bowtie, it performs the mapping of the reads on the genome. HISAT was used with the default parameters. HISAT outputs bam files, they are a binary version of a sam file which contains the mapping informations like localisation of sequences reads sequences.
 
</p>
 
</p>
  
Line 103: Line 91:
  
 
<p>
 
<p>
  The final output of this step is a CSV file containing the  usage ratio for each junction. From this, we had to clean the data to keep only the junctions, for each gene, with a common acceptor/donor and a ratio equal to one. It was an important step because the CSV file contained all the junctions, even the one which where very rare, and could not be separated from the background noise due to the RNA-Seq method or some errors from the splicing machinery.
+
The final output of this step is a CSV file containing the  usage ratio for each junction. Since we wanted to compare the junctions between two tissues and that the graphical representation is based on two ratios (one by tissue) for each gene, we needed to extract only the common junctions between the two tissues. This step led to a loss of many genes but allows a selection of genes expressed only in both tissues studied.
 
</p>
 
</p>
  
Line 112: Line 100:
 
</p>
 
</p>
  
<p>To produce the different plots, we used RStudio (GUI for R) in combination with ggplo2 and plotly packages allowing the generation of pretty plots. We obtained several graphs which a part of them will be presented in the following lines.</p>
+
<p>To produce the different plots, we used RStudio (GUI for R) in combination with ggplot2 and plotly packages. By using this we were able to generate beautiful plots and moreover interactive ones. The user can now easily travel inside the data and visualize what he wants. We obtained several graphs , some of which will be presented in the following part.</p>
 
+
 
<p>
 
<p>
  The results obtained from the ratio calculation was then computed to extract some extra data. At the beginning we simply plotted f(reference_ratio) = sample_ratio. Denis had the idea to calculate the distance and slope between the points related to be able to generate a new type of plot. Actually we were not really happy on how our plots were. There were a lot of points and no ways to focus on specific gene unless digging into the code itself to make a selection manually. That was not an option so we asked ourselves : how could we represent our data to make them easy to use ? Like a lot of things in informatics, other people thought about it and developed a really nice library called `plotly`. By using this we were able to generate beautiful plots and moreover interactive ones. The user can now easily travel inside the data and visualize what he wants. There is still a step left and it is the interpretation of our data.
+
The first thing we plotted was f(muscle_ratio) = neuron_ratio. We then had the idea to plot f(distance_between_points) = slope_line_between_points but this type of graph can be tricky to understand. Thus we decided to keep the initial plots generated to perform our analyses which is the final step of our work.
 
</p>
 
</p>
  
Line 123: Line 110:
 
Since the biology team had not produced any results of RNA-Seq, we had to choose a training dataset from Mae et al, which is composed of stages and muscle specific RNA-Seq reads. A very useful asset in order to detect tissue specific splicing patterns.</p>
 
Since the biology team had not produced any results of RNA-Seq, we had to choose a training dataset from Mae et al, which is composed of stages and muscle specific RNA-Seq reads. A very useful asset in order to detect tissue specific splicing patterns.</p>
  
<p>If the biology team had produced a modified <i>C.elegans</i> worm, we would have been interested in checking if other gene splicing were impacted by the genetic construct. We therefore compared muscle and neuron alternative splicing patterns in order to identify specific genes which could be responsible for the differentiation in one of the tissue studied.
+
<p>If the biology team had produced a modified <i>C.elegans</i> worm, we would have been interested in checking if other gene splicing were impacted by the genetic construct and verify if unc-60 splicing was modified. We therefore compared muscle and neuron alternative splicing patterns in order to identify specific genes which could be responsible for the differentiation in one of the tissue studied.  
It could also have been possible to compare RNA-Seq samples from our worms to neuron or muscle specific WT patterns and detect modified junction usages.
+
 
</p>
 
</p>
  
Line 136: Line 122:
 
<img style="width:500px; margin-left:auto; margin-right:auto; display:block" src="https://static.igem.org/mediawiki/2017/thumb/0/03/Bdx-all.png/655px-Bdx-all.png">
 
<img style="width:500px; margin-left:auto; margin-right:auto; display:block" src="https://static.igem.org/mediawiki/2017/thumb/0/03/Bdx-all.png/655px-Bdx-all.png">
  
<h3>3.1. Validating the efficiency of the pipeline results</h2>
+
<h3>3.1. Evaluation of pipeline results</h2>
 
<p>First of all, to confirm the efficiency of our workflow we decided to look for housekeeping genes behaviors. Among all these genes we have chosen the actin-3. As expected we have been able to locate its junctions in the diagonal area meaning that this particular gene does not have a different alternative splicing between the neuron and muscle. Thus we confirmed the robustness of our pipeline and that allowed us to perform more analysis which are discussed in the following lines.</p>
 
<p>First of all, to confirm the efficiency of our workflow we decided to look for housekeeping genes behaviors. Among all these genes we have chosen the actin-3. As expected we have been able to locate its junctions in the diagonal area meaning that this particular gene does not have a different alternative splicing between the neuron and muscle. Thus we confirmed the robustness of our pipeline and that allowed us to perform more analysis which are discussed in the following lines.</p>
  
Line 142: Line 128:
  
 
<h3>3.2. unc-60 splicing investigation</h2>
 
<h3>3.2. unc-60 splicing investigation</h2>
<p>Since we knew a priori the behavior of unc60, it was an interesting positive control to investigate. We can see on the plot that muscular isoform B and non-muscular isoform A usages behave as expected. Indeed, in the muscle, the usage ratio for UNC-60B is 0.98 versus 0.02 for UNC-60A, a very dichotomic junction usage reflecting the muscle isoform specificity. In contrast, the usages ratios for both isoforms are neighbouring 0.5, which would indicate that both isoforms are used in neuron.</p>
+
<p>Since we knew a priori the behavior of unc60, it was an interesting positive control to investigate. We can see on the plot that muscular isoform B and non-muscular isoform A usages behave as expected. Indeed, in the muscle, the usage ratio for unc-60B is 0.98 versus 0.02 for unc-60A, a very dichotomic junction usage reflecting the muscle isoform specificity. In contrast, the usage ratios for both isoforms in neuron are neighbouring 0.5, which would indicate that both isoforms are used.</p>
  
 
<img style="width:500px; margin-left:auto; margin-right:auto; display:block" src="https://static.igem.org/mediawiki/2017/thumb/5/5b/Bdx-unc-60.png/612px-Bdx-unc-60.png">
 
<img style="width:500px; margin-left:auto; margin-right:auto; display:block" src="https://static.igem.org/mediawiki/2017/thumb/5/5b/Bdx-unc-60.png/612px-Bdx-unc-60.png">
Line 148: Line 134:
 
<h3>3.3. ric-4 splicing investigation</h2>
 
<h3>3.3. ric-4 splicing investigation</h2>
  
<p>We had no a priori knowledge about ric-4 but it caught our attention since its behavior is very characteristic of an outlier. Actually its two isoforms are located on the opposite of the diagonal meaning an inversion of spliced forms in comparison with the genes located in the central area. We can see one form very used in the neuron whereas the other one is more used in the muscular tissue.We then investigate the role of ric-4. Thus we found that this gene is involved in the structuration of synapses and their functions.  
+
<p>We had no a priori knowledge about ric-4 but it caught our attention since its behavior is very characteristic of an outlier. Actually its two isoforms are located on the opposite of the diagonal meaning an inversion of spliced forms in comparison with the genes located in the central area. We can see one form very used in the neuron whereas the other one is more used in the muscular tissue. We then investigate the role of ric-4.  
ric-4 is thought to be related to vesicles trafficking including SNARE vesicles. It is tagged as involved in synapses structuration and function. However SNARE vesicles processes are also found in muscle. Therefore muscle and neuron specific isoforms of these vesicular transport related proteins could exist.</p>
+
It is thought to be related to vesicles trafficking including SNARE vesicles. It is tagged as involved in synapses structuration and function. However SNARE vesicles processes are also found in muscle. Therefore muscle and neuron specific isoforms of these vesicular transport related proteins could exist.</p>
  
 
<img style="width:500px; margin-left:auto; margin-right:auto; display:block" src="https://static.igem.org/mediawiki/2017/thumb/8/89/Bdx-ric-4.png/593px-Bdx-ric-4.png">
 
<img style="width:500px; margin-left:auto; margin-right:auto; display:block" src="https://static.igem.org/mediawiki/2017/thumb/8/89/Bdx-ric-4.png/593px-Bdx-ric-4.png">
Line 155: Line 141:
 
<h3>3.4. rsr-1 splicing investigation</h2>
 
<h3>3.4. rsr-1 splicing investigation</h2>
  
<p>rsr-1 was picked up because it presents a splicing pattern very similar to UNC-60. Indeed, rsr-1 isoforms in muscle have poles-apart usage ratios (0.98 vs 0.02) while in neuron this dichotomic usage is quite less pronounced (0.65 vs 0.35). rsr-1 is a homolog of SR160m, a splicing co-activator. It is important for development including normal pharyngeal morphology.
+
<p>rsr-1 was picked up because it presents a splicing pattern very similar to unc-60. Indeed, rsr-1 isoforms in muscle have poles-apart usage ratios (0.98 vs 0.02) while in neuron this dichotomic usage is quite less pronounced (0.65 vs 0.35). rsr-1 is a homolog of SR160m, a splicing co-activator. It is important for development including normal pharyngeal morphology.
In Ensembl database this gene is featuring only one splice variant. We obtained 7 and 229 read counts for muscular isoforms, and 7 and 13 for the neuron. The few read counts could be due to mapping errors, revealing alternative junctions that are not actually real. This is possible in regions of lower complexity. rsr-1 actually present a low complexity region, long serine and arginine repeats.</p>
+
In Ensembl database this gene is featuring only one splice variant. We obtained 7 and 229 read counts for muscular isoforms, and 7 and 13 for the neuron. The few read counts could be due to mapping errors, revealing alternative junctions that are not actually real. This is possible in regions of lower complexity and rsr-1 actually presents a low complexity region, long serine and arginine repeats.</p>
  
  
Line 167: Line 153:
 
</p>
 
</p>
 
<p>
 
<p>
The area gathering junctions with similar usage ratios is not yet supported by statistical analysis, it is only an arbitrary threshold selected by us. Statistical clustering of junctions is still to be found in order to more robustly separate junctions with similar patterns to those with a significant usage ratio difference.
+
The area gathering junctions with similar usage ratios is not yet supported by statistical analysis, it is only an arbitrary threshold selected by us. Statistical clustering of junctions is still to be found in order to more robustly separate junctions with similar patterns to those with a significant usage ratio difference. We are trying to find a method that would be similar to confidence intervals in linear regression analyses.
We are trying to find a method that would be similar to confidence intervals in linear regression analyses.
+
 
</p>
 
</p>
 
<p>
 
<p>
We could rationalise the selection of junction alternatives based on the read count. As well as finding a representation involving this read count.
+
We could rationalise the selection of alternative junctions based on the read count. As well as finding a representation involving this read count.
 +
</p>
 +
<p>We would like to underline something else which can also be used with our pipeline. Recently, a scientific team generated a reference file for spliced isoforms for the whole <i>C.elegans</i> genome. Combining our pipeline with this new information could allow us to spot specific splicing in a tissue specific manner. This could allow scientists to detect splicing variations in their sample in comparison with “reference” usage values. Thus we are currently working on the upgrade of our scripts to take into account this reference file.
 +
</p>
 +
 
 +
<p>
 +
We had the idea to apply the method used to generate the reference file for <i>C.elegans</i> to create tissues specific references. Using this could refine the analysis since the reference file currently generated is based on the whole body, and does not reflect the reality in a specific tissue.
 +
</p>
 +
 
 +
<p>
 +
In the future we could even imagine indexing the different splicing patterns in order to find which pattern is most close to a sample submitted by a researcher. Using machine learning it could be possible to predict what would be the impact of conditions on the splicing pattern, or what conditions to apply in order to obtain a desired pattern.
 +
</p>
 +
 
 +
<p>
 +
Finally our main goal for now is to develop a web platform to release our tool to the whole scientific community. This would be very useful to improve our pipeline by taking into account their different feedbacks.
 
</p>
 
</p>
  
Line 184: Line 183:
 
<h1 style="text-align:center;color:#d8b700">How to find us ?</h1>
 
<h1 style="text-align:center;color:#d8b700">How to find us ?</h1>
  
<p style="font-size:1.5em; text-align:center; color:white">Feel free to email us to provide some feedback on our project, have some information on the team and our work, or to just say hello !</p>
+
<p style="font-size:1.5em; text-align:center; color:#E0E0E0">Feel free to email us to provide some feedback on our project, have some information on the team and our work, or to just say hello !</p>
  
 
<ul style="font-family: 'Arial', cursive;  
 
<ul style="font-family: 'Arial', cursive;  
Line 198: Line 197:
 
      Wordpress</a></li>  
 
      Wordpress</a></li>  
 
</ul>
 
</ul>
<h5 style="float:left ; color:white">Mail:  
+
<h5 style="float:left ; color:#E0E0E0">Mail:  
 
  <a style="font-family: 'Arial', cursive; font-size: 1.5em;" href="mailto:igembdx@gmail.com">igembdx@gmail.com</a></h5>
 
  <a style="font-family: 'Arial', cursive; font-size: 1.5em;" href="mailto:igembdx@gmail.com">igembdx@gmail.com</a></h5>
<h5 style="text-align:right ; color:white"><i>Copyright &#169; iGEM Bordeaux 2017</i></h5>
+
<h5 style="text-align:right ; color:#E0E0E0"><i>Copyright &#169; iGEM Bordeaux 2017</i></h5>
  
 
       </div>
 
       </div>

Latest revision as of 20:40, 1 November 2017

Wrong