Template:Greece/PSP

Protein Structure Prediction

Proteins consist of amino acid chains that fold in 3-dimensional space in ways we do not yet completely comprehend. Each protein chain spans from a handful of amino acids to more than a thousand residues, rendering the problem of determining the relationship between primary and tertiary structure immensely complex .

In our project, we employed a mutated fimH modifiedby adding the RPMrel peptide in order to achieve selective adhesion of our E. coli strains on cancer cells. In particular, we the fimH sequence was altered by substituting 1 (Proline to Glutamine) and inserting 28 new residues (RPMrel, SpyTag, HisTag). We would like to explore how these changes affect the 3-dimensional structure of the fimH protein.

To this end, we embarked on a journey to create an artificial neural network model that, given a protein's primary structure, can yield sufficient information to reconstruct the full 3-dimensional structure.

The idea is that our model can provide insight as to how the tertiary structure changes after the modifications, in that we can compare the native structure of the protein with the structure of the protein's modified sequence.

Moreover, such a model would be invaluable as it could predict structures de novo for proteins for which we do not yet have structural information. Such a protein is Apoptin, the toxin we used as our classifier's output in order to induce selective cell death.

Therefore, our motivation is 2-fold:

to study how a protein's structure changes after altering its primary sequence profile (in our case FimH)
to predict the de novo tertiary protein structures based solely on primary structure information (in our case Apoptin)

What we propose is an end-to-end pipeline that solely requires the amino acid sequence of the protein (primary structure) as input, predicts its corresponding secondary structure & contact map and finally recreates and visualizes the full 3-dimensional tertiary structure.

In order to unravel the protein sequence-to-structure mystery one inevitably needs to summon the power of neural networks due to the aforementioned complexity and data volume. To fully exploit the generalization power of neural networks, we need to design a concise yet unabridged representation for our data, the protein sequences and their corresponding tertiary structures.

To make use of the primary and secondary structure sequence information we map the one-letter Amino acid codes to integers in range 1-20 and we use 0 to signify unknown encountered residues. We have now a vocabulary of 21 words that we project to a high-dimensional vector through the use of word embeddings [1], a technique widely used in natural language processing. Similarly, regarding the secondary structure, we have a vocabulary of 8 words that is mapped to high-dimensional vectors through embedding.

A common representation of a protein's tertiary structure is the contact map. This N x N matrix (N the number of residues in the chain) encompasses all the required information to reconstruct the 3-dimensional structure. Every (x,y) point on the map is interpreted as a contact probability between the residues x and y.

The fimH contact map with distance cutoff 9 Å recreated from chain A of PDB id 2VCO

We can see that the majority of contacts span around the diagonal and that is only natural as it means that residues which are at a closer distance have a higher probability of forming a contact than distant ones. In general, secondary structures form patterns on the contact map, e.g. helices are adjacent to the diagonal, and provide a foundation for the prediction of the most difficult long-range contacts. That being said, we first need to obtain information about the secondary structure and subsequently use that to assist in contact prediction.

Since, we have access to ground truth for every protein sequence in our dataset we can employ supervised learning approaches. Secondary structure prediction is a classification problem of 8 classes, where we need to predict the secondary structure class of every residue, while providing context information for residues of the same sequence. Contact map prediction can be interpreted as a regression problem, where one needs to predict the contact probabilities for every pair of residues in the chain.

The first step, given the protein's primary structure is to make an educated guess about its secondary structure. In 2017, we don't even need to do that. We can have machine learning models do that for us.

Although secondary structure provides valuable information for the local structural entities along the chain, it is not enough on its own to produce the contact map. Therefore, we can interpret this step, in terms of machine learning terminology, as an internal representation of the input that is passed on deeper in the network to assist contact prediction. However, since secondary structure has a semantic meaning of its own in the eyes of biologists, we treat the two models separately as it is standard practice in the literature.

Our Secondary Structure Predictor consists of a wide bidirectional recurrent layer, followed by a number of fully connected layers and the output layer. The model trains on primary sequence samples and predicts the secondary structure class for every Amino acid.

In a nutshell, the sequence passes through the recurrent layer and the model “remembers” what has already happened in the sequence -in terms of secondary structures- to provide context for every subsequent residue that it sees. The bigger the size of the recurrent layer the greater the memory capacity of the network and, in turn, its predictive power.

The additional layers that follow progressively reduce the representation dimension until we are left with a 1-hot vector (1 x 8) that indicates the most prevalent class choice. In the end, the full predicted sequence corresponds to all secondary structure segments.

Top Row: primary structure sequence of one-letter Amino acid codes

Bottom Row: secondary structure sequence of one-letter class of the 8 classes

Our end goal is to predict contacts between residues in a pairwise fashion. For a sequence of N Amino acids we want to derive a N x N map, where each position shows the probability of the two residues being in contact.

At first, we need to build a N x N map, hereafter referred to as “tensor”, that will serve as the input layer to our network and will be updated after each epoch as the network learns to solve the regression task. The tensor captures the semantic relationship between residues and constantly changes as training progresses.

In the previous section, we mentioned how we use embeddings to map our single-letter Amino acid codes to high-dimensional vectors. That is, for a primary structure sequence of N residues, we end up with a N x m matrix, where m is the size of the embedding.

In order to exploit the additional information we have for every Amino acid about its secondary structure classification, we employ, once again, word embeddings of size k that we stack to our previous m-dimensional vector. Now, every Amino acid is encoded as a concatenation of the two embeddings with a resulting vector of size m + k.

The secondary structure information may be incorporated directly as a result of our Secondary Structure Predictor model or - for proteins with known structures- can be provided directly as additional input. However, the latter can only be of use for model evaluation as, in practice, we are more interested to predict structures de novo.

The next step is to create a representation for every pair of Amino acids in order to create the full N x N tensor. For a pair (x,y) of Amino acids we concatenate their respective embeddings thus creating a new vector of size 2 * (m + k). As an alternative, we can perform an element-wise operation (e.g. multiplication) between the two distinct residue embeddings and maintain the resulting pair embedding dimension to m + k.

Now that we have constructed our tensor, we add several 2-dimensional convolutional layers to scan every Amino acid pair on our map for contacts. When training is complete, the resulting map is compared to the native contact map by specifying a probability threshold that distinguishes contacts from non-contacts.

The final step of the process is to retrieve the 3d structure from the predicted contact map. To achieve that, we feed the contact map to FT-COMAR ], a tool that is able to recreate the tertiary structure by reading a contact map with values 1 (contact), 0 (no contact), -1 (uncertain) [2]. In our implementation, a contact is declared uncertain if the predicted probability is between 0.3 and 0.6. The resulting 3-dimensional information is written to a file that we can load to iCin3D to visualize and interact with the reconstructed structure.

Due to limited access to computational resources, we were not able to train models of adequate performance to assess the changes in FimH tertiary structure or visualize the estimated structure of Apoptin in a reliable way. Although the models' performance were monotonously improving we could not reach convergence before the wiki freeze, as we were restricted by shallow architectures and small mini-batch sizes.

For example, for the Secondary Structure Predictor we obtained a 27% Q8 accuracy using a relatively shallow 2-layer architecture. Q8 accuracy measures the percent of residues for which 8-state secondary structure is correctly predicted. For the Contact Map Predictor, we set a probability threshold of 0.45 to determine whether two residues are in contact and the model correctly predicted 43% of all contacts.

We will continue to work towards improving the model performances as well as attempt different approaches to solve the contact map.

However, for the sake of completeness, we employed a publicly available tool that offers similar functionality but different approach. RaptorX-Contact-Predict is a web server tool that predicts the contact map and provides direct visualization of the resulting tertiary structure through JSmol.

Contact Map comparison of fimH wild type and our modified fimH. Black dots signify common contacts that exist in both structures, green dots the unique contacts in wild type fimH and the magenta dots the unique contacts that exist in our modified fimH sequence. The locations where the RPMrel and tags were inserted are evident in the contact map in places where there exist only magenta dots along the diagonal.

The top left structure corresponds to wild type fimH, the bottom left to our modified fimH. To the right, we see the two structures superimposed, with wild type fimH shown in grey.

Data

Our dataset is a set of 10932 proteins from the PDB database, that were selected based on certain criteria using the PISCES server. Specifically, the percentage identity cutoff is 60%, the resolution cutoff is 1.8 angstroms, and the R-factor cutoff is 0.25.

We observed that 9754 out of 10932 proteins have some missing residues, for which there is no structural information available. In all such cases, the missing residues were discarded and the sequence truncated. 21 proteins were excluded from the dataset as there was no available secondary structure information whatsoever.

We split the dataset of 10911 remaining proteins, into 8728 for training and 2183 for validation. In addition, we tested our Secondary Structure Predictor to the benchmark dataset CB513 in order to compare our approach with existing methods.

Labels

For the Secondary Structure Predictor the required label for every Amino acid in the sequence is the corresponding secondary structure information. Every Amino acid belongs to one of the 8 classes [3]. Hence, every protein sequence is paired with its label, a sequence of equal length consisting of the class id for every Amino acid.

To derive secondary structure labels for our dataset, we employed STRIDE, an algorithm designed for the assignment of protein secondary structure elements given the atomic coordinates of the protein, as defined by X-ray crystallography, protein NMR, or another protein structure determination method [4].

For the Contact Map Predictor the required label is the native contact map. To obtain the contact map we need the structural information provided in the PDB file of every protein sample, that is the atomic coordinates of every Amino acid. The contact map is constructed by computing the Cb distance for every pair of Amino acids in the chain and ,subsequently, if that distance satisfies a specified threshold [5] the pair is considered to be in contact.

To extract the native contact map for every PDB file, we implemented a group of helper functions that allow for the generation of contact maps of different distance cutoffs and distance types, namely Ca, Cb and Ca + Cb.

Model Architecture

The Secondary Structure Predictor model consists of an embedding layer that maps the one hot Amino acid representations to 100-dimensional vectors, a bidirectional LSTM layer of 180 cells and a fully connected layer of 140 neurons with linear activation function. Finally, we have the output softmax layer of 8 neurons that points to the most probable class choice. The training loss is calculated according to the formula of categorical cross entropy.

The Contact Map Predictor model has a tensor input that is feeded to 6 2D-convolutional layers. After each convolutional layer we enforce batch normalization to prevent exploding gradients. The output layer that computes the probability of contact has a linear activation function.

Frameworks

The models were developed in Python using the Theano and Keras frameworks.

References

[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
[2] Vassura, M., Margara, L., Di Lena, P., Medri, F., Fariselli, P., & Casadio, R. (2008). FT-COMAR: fault tolerant three-dimensional structure reconstruction from protein contact maps. Bioinformatics, 24(10), 1313-1315.
[3] Kabsch, W., & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers, 22(12), 2577-2637.
[4] Frishman, D., & Argos, P. (1995). Knowledge‐based protein secondary structure assignment. Proteins: Structure, Function, and Bioinformatics, 23(4), 566-579.
[5] Duarte, J. M., Sathyapriya, R., Stehr, H., Filippis, I., & Lappe, M. (2010). Optimal contact definition for reconstruction of contact maps. BMC bioinformatics, 11(1), 283.