Journal of Proteomics and Genomics
Functional Protein Domains Evolve Very Specifically Over Mutations
Copyright: © 2014 Sabharwal NS. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Related article at Pubmed, Google Scholar
Mutation in a single nucleotide of a gene has the potential to change the structure and/or function of its protein. Albeit simply saying, it is not observed to be a general phenomenon. The effect of mutation is primarily determined by the stereochemical nature of the amino acid which has replaced the previous amino acid, resulting in the residue location being affected. Here we show that despite a change in the frequency of occurrence of a particular amino acid in a particular protein in different types of organisms, the overall function of the protein can still remain unaffected, even when the resultant protein conformation is relatively altered. Phylogenetic trees were constructed for the proteins belonging to the same family on the basis of the sequences extracted from protein structures. Variation in the percentage of every existing amino acid of each of the considered protein is further calculated. In contrast to this sequence based mutual comparison of proteins, structural comparison is also computed in terms of standard TM_Score and alteration in the count of structurally similar residues falling within the 5Å distance deviation. The functional and structural role of an evolutionary alteration or mutation in a protein sequence and its concomitant effect on the protein structure is thus analyzed.
Keywords: Villin; Headpiece; Sub-domain; Evolution; Topology
Villin is one of the major cytoskeleton proteins that bind to actin. It is localized to some specific tissues like the microvilli of intestine and kidney . Some of the cell types present in the unorganized brush border in the pancreatic and bile duct also have villin, although at very low concentration. These cells help in the absorption process exactly similar to functional microvilli of intestine and kidney. Villin is also expressed in intestinal cells of the embryo . Villin belongs to a large class of actin regulating proteins. Regulation of the actin filaments is performed by actin binding proteins. These proteins assist the sequestration of actin monomers, severing and cross-linking of filaments, and cover their ends . In humans, the villin is encoded by two mRNAs of dissimilar lengths and surprisingly, the coding region of their cDNAs shows no difference.
In culture conditions, the mRNA expression level is also found to be similar in differentiated intestinal cells and the villin expressing microvillus tissues . Here begins the interesting aim of our study to depict the amino acids which show evolutionary changes or evolve more often in a protein. We also screen the amino acids which normally remain conserved in a protein and thus define its peculiar conformation. We have studied the effect of evolutionary alterations in the frequency of occurrence of amino acids on its respective protein structure. For computational ease and small size, we have selected the Villin Headpiece as our model protein. Furthermore, as it is naturally available in most of the organisms, the application of this study can be extended to a wide range of organisms.
Villin protein structure consists of seven domains, six of which are present on the N-terminal and one domain is present on the C-terminal. Six N-terminal domains make up the N- terminal core and the C-terminal domain constitutes the villin Headpiece [5,6].
A short fragment of amino acid residues is repeated six times in the N-terminal core, with each domain sharing a repeat .These structurally similar repeats are present in all actin-severing proteins viz. gelsolin, fragmin and severin . The Headpiece engulfing the C-terminal domain does not show significant sequence similarity with other regions of the villin . Both C and N-terminal core engulf actin and calcium binding sites [5,6]. The Headpiece is believed to be evolved by a recombination event between gelsolin and some other actin binding protein genes. These genes are believed to be synapsin-I (an acting bundling protein present in the synaptic vesicles), as it shares significant homology with the Headpiece . Villin structure and function is related to two groups of proteins, one consisting of a bundling protein Quail (found in drosophila) and protovillin (found in Dictyostelium), and the other group including gelsolin and scinderin proteins found in higher eukaryotes [9,10], whereas fragmin and severin proteins are normally found in lower eukaryotes [11,12]. The first group of proteins encodes seven domains, as in the case of villin, and the second group represents the domain arrangement exactly similar to villin core .
The 76 residue long villin Headpiece (HP) is also present in many other actin bundling proteins [14,15]. It is furthermore intriguing to observe that removal of first 9 residues from C-terminal domain results in the construct, normally represented as HP-67, and this partial truncation does not affect the overall function and stability of the HP Protein [15,16]. Furthermore, this HP-67 structure consists of two Sub-Domains (SD), one present at the N-terminal and the other one localized at the C-terminal. The latter 35 residue SD consists of 3 helices forming a hydrophobic core and is generally represented as HP-35 [16,17]. This HP-35 is one of the shortest, naturally occurring proteins to show cooperative and rapid folding probably due to its relatively short size engulfing high helical content [16,18,19]. Also quite noteworthy, three phenylalanine residues partly stabilize this conformation  and the evolutionary conserved TRP64 additionally plays a major role in its interaction with F-actin .
It is also well known that the villin protein is associated with actin bundling, severing, nucleation and capping, which regulate the actin filaments. Villin due to N-terminal half of its core, is found to severe actin filaments under high calcium concentrations . Such actin binding sites are present on both N-terminal and C-terminal, the former being regulated in calcium dependent manner while the latter one shows a calcium independent activity. When the calcium concentration is more than 10-4 M, villin severing activity produces short filaments by acting upon F-actin.
Conversely, at lower concentration range of 10-7 to 10-6 M, villin prevents elongation of actin filaments by capping whereas at substantially lower concentration (lesser than 10-7 M) it shows the bundling activity . Unlike the severing activity, bundling activity of villin is normally balanced by other in-vivo proteins . Similarly, phosphorylation of tyrosine residue in villin results in its decreased affinity for F- actin, decrease in nucleation activity and substantial increase in its severing action . There is yet another interesting aspect of villin structure. It is cleaved by trypsin into two fragments, one encompassing domain1 to domain3 and the other one encoding domain4 to domain7. These fragments inclusive of the referred SDs are respectively known as 44T and 51T . Here the presence of calcium is very important. In presence of calcium, the proteolytic cleavage between domain2 and domain3 is inhibited and such cleavage stays unchallenged in the presence of EGTA (Ethylene Glycol Tetra Acetic acid). It could be probably due to alteration in the N-terminal segment conformation in the presence of calcium .
Protein structures related to villin HP and SD were downloaded from the Protein Data Bank (PDB). A total of 32 PDB files were downloaded and their structural records were converted to FASTA format through an in-house PERL script. These structures were then divided into three groups according to their encoded count of amino acids. These structures along with their FASTA sequences with around 67 residues were categorized as "villin Headpiece" (Group HP) (1QQVA, 1QZPA, 1UJSA, 1YU5X, 1YU7X, 1YU8X, 1ZV6A, 2K6MS, 2K6NA, 2RJVA, 2RJWA, 2RJXA, 2RJYA, 3MYAA, 3MYCA, 3MYEX and 3NKJA). Similarly the structure and sequence files with around 35 residues were classified as "villin Sub-domain" (Group SD) (1UNCA, 1UNDA, 1VIIA, 1WY3A, 1WY4A, 1YRFA, 1YRIA, 2JM0A, 2PPZA, 3IURA, 3TJWA, 3TRVA, 3TRWA and 3TRYA). The lastly considered group comprised of substantially bigger villin structure (3FG7) with 398 amino acids, which was thus excluded from our sequence and structural comparison with other considered structures. Similarly, 3IURA was also removed from the SD group, as it encodes three chains A, B and C with 684, 6 and 5 amino acids respectively and that is nowhere comparable to the default chosen set size of 34-36 amino acids encoded in all the aforementioned SD structures. So finally, a total count of 30 structures (17 for HP and 13 for SD) were considered for further analysis.
Structural FASTA information extracted from the selected HP and SD structures was employed to construct 3 phylogeny trees (One each for HP, SD and one for both of them) using an online tool. For its custom support and ease of usage, we preferably employed this online server and it made the analysis quite handy. These phylogeny trees are well illustrated in Figures 2, 3 and 4 respectively. Through these trees, mutual evolutionary distances among the selected structures were manually computed, as enlisted in tables 1, 3 and 4 respectively.
Here, we computed the percentage of structurally encoded amino acids in each of the selected protein, through an in-house PERL script with the consideration of modified amino acid entries (Heteroatom) as the normal residues. The heteroatom consideration allowed us to unanimously exploit and extrapolate specific amino acid position and availability for all the considered evolutionarily linked protein structures. It further resulted in data showing the percentage of every amino acid encoded in HP and SD structures, as represented in Table 5 and 6.
It also allowed us to screen the evolutionarily selected and variant amino acids (represented as boldface in the tables), with the arbitrarily selected threshold difference of more than 0.2 percent against the respective average data. To compare the percentage of every encoded amino acid and the structural similarity of compared proteins, we further considered the proteins as per their source organism. Through this analysis, we obtained the percentage difference of amino acids across different species, as enlisted in Table 2. To compare our results with the percentage of amino acids generally available in naturally occurring proteins, Table 7 was constructed through the premier information resources [25-27].
Other than the sequence based phylogeny tree distance deviations among the considered proteins for each of the HP and SD group, we also calculated their TM_Score and GDT residues (Global Displacement Test residue within the 5Å distance deviation from the other considered structurally equivalent residue) through TM-Align tool of Zhang's lab. The TM_Score and GDT residue percentage data thus obtained is enlisted in Tables 8 and 9 respectively for HP and SD group proteins.
Considering all the selected proteins encoding Headpiece domain of the Villin structure, we constructed a phylogenetic tree (Figure 2). This snapshot illustrates the evolutionary link between the considered HP structures, and it was used to calculate mutual evolutionary distances amongst the selected structures (Table 1). As per Figure 2, 1QZPA and 1ZV6A are close to each other and the same is referred by their distance value enlisted in Table 1. Contrary to it, 1QZPA and 1UJSA lie far apart in the tree and so their evolutionary distance should be more than that of 1QZPA-1ZV6A, as clearly observed in Table 1. Similarly, the 2K6MS-2K6NA distance is zero and these structures are expected to be evolutionarily very close, which was clearly observed in the tree shown in Figure 2. Such evolutionary distance analysis can be uniformly applied to all the structures (2RJVA to 1YU7X), as shown in Figure 2. Considering the source organism information for these structures (as listed in Table 2), we see that Homo sapiens sequences are mutually closer compared to their correlation against Gallus gallus.
For Headpiece distance analysis, the selected SD structures were also scrutinized through their phylogenetic tree information (Figure 3) which further resulted in their mutual distance matrix, as shown in Table 3. As shown in Figure 3, 1UNDA is closer to 1UNCA than 3TJWA, and it implies that 1UNDA-3TJWA might have evolved much more than 1UNCA-3TJWA, as correctly shown in Table 3.
Similarly 1UNDA, 2JM0A are closely placed in Figure 3, and are thus minimally deviant in Table 3. It also shows that 3TJWA, 3TRYA, 1YRFA, 3TRVA and 1YRIA have evolved almost to same extent and so their mutual distance should be less, as shown in Table 3. Like the aforementioned source organism analysis for HP structures, here also sequences from the same species have showed lower mutual evolutionary distance when correlated with different species.
Besides considering the HP and SD distance analysis individually, we also employed them together for constructing the phylogenetic tree (Figure 4), and its distance data is enlisted as Table 4. Here it is interesting to note that some of the SD entities are evolutionarily closer to the HP structures. Other than the shifted tree localization and the minor mutual distance deviations of these structures, caused due to consideration of more files, the overall outcome is similar as reported.
After all this mutual distance calculation for HP proteins, we calculated the percentage availability of amino acids in the selected structures (Table 5). Here the AVG row represents the average occurrence of a specific residue in all the selected structures and the boldface values are the structural entries with an arbitrary difference of more than 0.2 compared to the average (AVG) value of that amino acid. These boldface entries therefore represent the structures showing a significant difference in the available percentage of an amino acid compared to its average value. The total count of such boldface structures can thus be used to calculate the percentage of amino acid residues (shown as aa% change), that have got significantly altered in the selected HP structures. This analysis can then be plotted to yield Figure 5(a) and 5(b) respectively, representing the average percentage and percentage change for each of the amino acid. This analysis further reveals that leucine is eminently available in each HP structure, and is orderly followed by lysine, glutamic acid, alanine, phenylalanine, aspartic acid, proline, valine, asparagine, arginine, glycine, threonine, serine, glutamine, methionine, tryptophan, tyrosine, histidine, isoleucine and cysteine. Another glance at Figure 5(b) shows one interesting aspect that evolutionary variation amongst the selected structures has preferentially employed certain specific residues viz. valine, threonine, proline, isoleucine, lysine, asparagine, aspartic acid, glutamine, phenylalanine, glycine and arginine, as their available percentage is significantly altered in each of the selected HP structure. Another fold of this analytical story implies that the frequency of occurrence of tryptophan is kept majorly unaltered and the percentage of amino acids is mostly repeated in a similar pattern across different HP structures.
Exactly like the aforementioned HP structural analysis, SD structures were also scrutinized as represented in Table 6 and Figure 6. Here also the boldface values are the ones which have shown a significant change in their percentage, as compared to their average figures. We observed that leucine is eminently available in all SD structures and is orderly followed by lysine, phenylalanine, alanine, glutamine, glycine, glutamic acid or serine, aspartic acid, arginine, asparagine, methionine, proline or threonine, valine, tryptophan, histidine, isoleucine, tyrosine and lastly followed by cysteine.
Quite intriguingly, cysteine is not available in any of the SD structure, be it a natural protein (1VIIA) or any of the other employed synthetic constructs. Furthermore, certain amino acids viz. isoleucine, histidine, lysine, asparagine, aspartic acid, glutamine, tyrosine, methionine, phenylalanine and arginine are plausibly evolutionarily selected for variations, as their availability percentage has been found to be significantly variant across all the selected SD structures. Moreover, two amino acids glutamic acid and serine are not showing variant percentages across the SD structures, and are thus very important for the SD conformation. Same as HP, here also the amino acid percentages are mostly repeated in a similar fashion across different structures.
To cluster the amino acid percentage data, as per the native source organism, we constructed Table 2 which represents that the frequency of a specific amino acid follows an almost similar prototype for a particular source organism. Here we observed that isoleucine is not present in any of the selected non-synthetic or natural proteins with the source organism Gallus gallus, although for Homo sapiens it is present in all the selected structures. Besides this, serine is found in all selected Gallus gallus structures while only in three out of five Homo sapiens proteins. Similarly tyrosine is available in all the Homo sapiens structures, though for Gallus gallus it is found only in three out of thirteen proteins. Likewise, except one Gallus gallus structure, tryptophan is present in all the natural Homo sapiens as well as Gallus gallus proteins. Cysteine is also found to be present only in two Homo sapiens structures, although histidine is found restricted to only a few Gallus gallus and Homo sapiens structures.
After scrutinizing the naturally available structures, a similar brain-storming analysis was performed for the selected synthetic proteins. Here also, isoleucine follows exactly the same trend, i.e. it is unavailable in all the synthetic constructs for the source organism Gallus gallus. Similarly cysteine is unavailable in all the synthetic constructs. Moreover, asparagine, lysine and tryptophan are found in all but one synthetic constructs. Conversely, tyrosine is available only in one synthetic construct. This analysis simply shows that the percentage of amino acids is mostly kept unaltered across synthetically constructed structures also, or in other words, the defined functional attributes are well pertained to the availability and localization of specific amino acids across these structures.
To compare our computed average amino acid percentage figures in our considered structural dataset against the average residue percentage naturally encoded in protein structures, we considered several research works employing a wide array of proteins, ranging from only 118 to 0.55 million (Table 7). Here, AVG-118, AVG-1150 and AVG_0.55_million respectively represents an average amino acid percentage of 118 proteins from different Super-families , around 1150 proteins  and 5, 49, 616 proteins . Here we have enlisted our computed average amino acid statistics for HP and SD structures as AVG-HP and AVG-SD. For most of the amino acids, we observed an almost equivalent average percentage data among the Doolittle and Carugo referred articles [26,27]. This variation is comparatively higher for Cornish figures . While Cornish and Bowden considered only 118 proteins for the calculation of naturally available amino acid percentages , Doolittle increased the sample size to 1150 proteins thereby making the resultant data more reliable. Carugo's work in this regard sounds even more trustworthy, as it considered 549, 616 proteins to compute the amino acid percentages naturally encoded among diverse proteins.
Comparing our observed statistics for HP and SD structures with the aforementioned reference data, we found a substantial variation among residue percentages. This observed difference among the average amino acid percentage of AVG_0.55_million and AVG-HP data is shown as "Substantially Variant for HP" and the similar variation for SD is represented as "Substantially Variant for SD". This observation could be easily explained from the incredibly different data size of proteins considered for both these research works. Secondly, the AVG_0.55_million data encompasses an extremely large set of proteins, including both related and unrelated structures, and our minimal sample size solely comprises of structurally and functionally related proteins.
Although the AVG-HP, AVG_0.55_million amino acid percentage variation is lesser than that of AVG-SD, AVG_0.55_million dataset, the AVG-HP data is significantly different against the AVG_0.55_million information. This is quite predictable also as the structures considered for calculating AVG-HP specifically belong to only villin HP and it becomes fairly obvious to perceive the considerable similarity among the HP proteins. Moreover, according to the research work of Cornish-Bowden "Natural peptides and small proteins in general have amino acid compositions that diverge much more from the average composition of all proteins" . Due to the existing length of our considered SD and HP structures, this Cornish-Bowden statement is valid for our data also, as lucidly enlisted in Table 7.
Subsequent to this sequence analysis of all the selected structures, we computed their structural similarity through TM_Score and GDT residue percentage calculation . As enlisted in Table 8 and for every single structural comparison, here we found a remarkably similar structural topology of the selected proteins in terms of TM_Score. It was also observed that the GDT residue percentage was quite high, which implies a reliable structural similarity of the selected structures. Comparing 1UJSA- 1QZPA and 2K6NA-2K6MS Homo sapiens structures, we found that the former pair showed a TM_Score of 0.61148 in comparison to the latter score of 0.90301 and those protein pairs showed a GDT residue count of 87.17949 and 98.50746 respectively. Hence, we observed that the TM_Score and GDT residue percentage figures individually changed drastically even within the same organism.
Quite interestingly after considering all such protein pairs, we observed that the TM_Score showed a drastic difference in two protein structures even when their GDT residue percentage score was pretty high, being greater than 90% in almost all the cases. Moreover, even when such percentage count of GDT residues was substantially lower, the TM_Score figure followed almost a similar trend. As TM_Score calculation emphasized on the distance deviation in the equivalent residues of the compared protein structures, it nullified their global structural similarity even when they shared a higher count of structurally similar and conserved residues within an allowed distance deviation. Such global structural similarity of proteins can be astutely attributed to the evolutionarily unaltered characteristic core and functional domains.
In a similar way as mentioned above, SD structures were also scrutinized for mutual structural similarity (Table 9). Amongst all the selected SD structures (listed in Table 2), 1VIIA was the only naturally available conformation. Here we observed several TM_score values lesser than 0.5. This implies a high structural dissimilarity and so this workout seems to be a futile structural similarity analysis. Now logically emphasizing, we know that synthetic constructs are developed by altering certain amino acids for studying some specific properties including nucleation and folding kinetics of the considered protein. The GDT residue percentage and TM_Score analysis, as done on HP, is an insignificant and unreliable parameter to structurally study the synthetic construct in reference to the natural proteins. It is because the synthetic constructs encode specific residue alteration(s), which are not ascribed to the natural phenomenon of evolution.
But still comparing these artificial SD structures against the natural 1VIIA, we observed a non-linear difference among GDT residue percentage and TM_Score measures in contrast to the similar amino acid percentage encoded in these proteins. Therefore, it became fairly reasonable to skip them for further analysis. But, we had already employed all the selected SD sequences, mutually sharing similar amino acid percentage along with the HP sequences also following such a trend, for a phylogenetic tree including all the HP and SD structures (as shown in Figure 4). Thus hereafter eliminating the SD files from our consideration, Figure 4 would also become unrealistic for the structural comparison of proteins in our workout. Leaving this very plausible although strange result aside, it is indeed an incredible work that researchers have developed specifically variant conformations, which retain the conserved backbone topology and are encoded for variant functional attributes.
Through mutual structural comparison analysis as per Table 8 information, we screened the highest as well as the lowest TM_Score match, respectively signified as Highest_TM and Lowest_TM in Table 10, to select the most and the least structurally similar structure available for a particular HP protein. The Table 10 also enlists the GDT residue percentage (GDT Res %) and the phylogenic distance (PHYLO_DIST), both harnessed from Table 8 and Table 1 respectively. We also evaluated Table 5 to screen the residue percentages altered by more than 0.2 against the Highest_TM and Lowest_TM structural matches for a particular protein and represented them as Letter Y (i.e. Yes). The Table 10 also enlists an average percentage alternation of a particular residue as "Percent Change". It further shows that the amino acids alanine, tyrosine and phenylalanine are the most frequently changed or evolutionarily altered amino acids (79.4%) and similarly the amino acids glycine, arginine, aspartate, leucine, serine, asparagine, valine, threonine, methionine and glutamine are also found to be significantly altered. We also observed that an evolutionary alteration percentage for histidine and glutamate residues is exactly 50% (Count of Y in the respective column) and such evolutionary variation fraction is even lower for proline, lysine, cysteine and tryptophan.
Here, we saw a particular trend that all such significant and comparatively lesser noteworthy evolutionary altered percentages of certain specific residues equally include several hydrophobic and hydrophilic amino acids. In this regard, for the fact that hydrophobic amino acids are usually present in the protein core to maintain the overall functional conformation, it is often presumed that such residues are rarely altered evolutionarily. However here we could examine that some of the hydrophobic residue percentages have changed quite significantly parallel to that of hydrophilic residues. We observed that 3NKJA-2RJYA structural comparison showed a GDT residue, TM_Score similarity of 100% and 0.95413 respectively, despite having a non-zero phylogenetic distance (PHYLO_DIST) of 0.0197. This can be fairly attributed to alteration of a few specific amino acids, as represented by leucine and glycine (Shown as Y in Table 10).
Similarly we observed that 1YU7X-1YU8X distance was lesser than 1YU8X-2RJYA distance which clearly implies that the former pair TM_Score is higher than the latter one, although in our analysis it gave quite contrary results. Comparing the count of altered amino acids for both these pairs, it was found higher for 1YU7X-1YU8X than 1YU8X- 2RJYA (4/17=23.529% and 2/17=11.767% respectively) and it thus implies that former pair should have a comparatively lower TM_Score, being 0.99097 and 0.99691 for the former and latter pairs respectively. This data was completely synchronous with our observation, as enlisted in Table 10. It further implies that the former pair with higher amino acid percentage alteration should show a higher phylogenetic distance while to our surprise; it was exactly the other way round. This indirectly means that phylogenetic distance as such is an incomplete term to compare the similarity of two proteins, as its sequence based analysis may prove to be wrong.
As per Table 10, we observe that 3NKJA-2RJYA phylogenetic distance is more than 1YU8X-2RJYA and exactly the same correlation was found through their TM_Score comparison, being 0.95413 and 0.99691 respectively for both these pairs. Further adding to it, 2RJXA-2RJWA and 3MYAA-2RJWA protein pairs showed a quite converse relationship. Here in these compared protein pairs, two amino acids tyrosine and histidine show significant evolutionary alteration for 2RJXA-2RJWA pair and such variant residues for 3MYAA-2RJWA pair are tyrosine and phenylalanine. As tyrosine is common among both these considered pairs, their TM_Score difference can be mainly and respectively attributed to the change in the availability percentage of histidine and phenylalanine residues and hence it further implies that the chemical nature of variant amino acid as well as its specific functional locus in a protein is extremely important.
Here in this entire analysis, we simply observe that the TM_Score is comparatively lower among the sequences extracted from different source organisms and it shows that the protein structures are relatively more conserved within the species, as earlier shown by Doolittle. In this phylogenetic distance data (Table 10) the GDT residue percentage and TM_Score figures are pretty high for the structures with zero phylogenetic distance, as expected. Comparing 2K6MS-2K6NS, we do not see any change in their encoded percentages of amino acids, although we still find that their GDT residue percentage is not equal to 100 while their TM_Score is also not equivalent to 1.
This observation could be reasonably attributed to interaction of such structures with different or same ligands in varying microenvironment available in same/different source organisms, or it might also be the result of minor sequence/structural shifts for certain specific residues, or the evolutionary extension/shortening of protein sequences despite retaining the earlier percentages of all the encoded amino acids. Furthermore we observe that the Lowest_TM Score match is still more than 0.5 even when their PHYLO_DIST distance is reasonably good and it highlights the structural conservation of protein structures. In Table 10, we logically observe that several residue percentages are significantly altered for the protein pairs with a low TM_Score. Quite intriguingly, the pairs including 3MYCA-2RJVA with variant residue percentages also show minimal phylogenetic distance and a considerable TM_Score.
We observe that functional protein copies of different organisms, with almost similar overall topology, show a widely ranged amino acid percentage variation. To avoid the complication in study, we have not considered the evolutionary alterations at the level of DNA which could have resulted in silent mutations without altering the encoded amino acid at that specific position. Thus evolutionary alterations only varying the encoded native amino acid are considered significant here.
Considering two proteins lying far apart in the phylogenetic tree, we normally assume them being significantly different, both in terms of sequence and structural similarity. And we have proven this scientific myth to be wrong here after showing several such instances where the phylogenetically distant proteins show TM_Score more than 0.5 and a significant structural similarity. Quite interestingly, the phylogenetic distance among the proteins belonging to the same species is found to be lesser than that of inter-species protein comparisons. Similarly, an amino acid frequency is found to be similar within the proteins belonging to the same species.
Regarding our predictions about the phylogenetic trees and their further respective analytical observation, we find that it is totally inefficient to extract the correct level of similarity among protein sequences. Simply relying on the sequence information based phylogenetic tree, we should not categorize proteins to be evolutionarily linked or divergent. Hence when the structural information exists or can be predicted for protein sequences, we should not restrict ourselves to the sequence based phylogenetic analysis. It is because phylogenetically distant proteins need not be structurally dissimilar also, as shown by our villin HP structural comparison study.
Intra-species structural comparison analysis for the villin HP proteins show that protein conformation is extensively conserved and the observed marginal structural shift is generally due to the altered amino acids or differential availability of certain specific ligand molecules in the immediate micro-environment. It is also well observed that sequences encoding the comprehensively altered amino acid percentages normally show a low structural similarity in terms of TM_Score. As per our detailed analysis of the residue percentage comparison across functionally similar proteins of different source organisms, tryptophan is found to be the least altering and it thus advocates the vital role of tryptophan to maintain the structural topology of the villin HP. It also shows that tryptophan is not extensively altered evolutionarily to preserve the major structural topology. Similarly alanine, tyrosine and phenylalanine are found to be the mostly variant amino acids across the selected villin HP proteins. We normally argue that hydrophobic amino acids are mostly buried in the protein core and they do not show extensive evolutionary variations. However as per our observations the percentage change of hydrophobic and hydrophilic amino acids stands almost similar. It is thus well realized through this study that the count and physicochemical nature of altered amino acids in a protein sequence proportionally imply the alteration in its conformation.
Hence we can say that, the amino acids with significantly altered percentages are the ones which are less important for the protein structure (including the original primary function) and are evolutionarily more preferably chosen than other amino acids. The evolutionarily unaltered amino acids should thus be the ones which are more conserved, being important for the functional stability and structure maintenance. So, it seems obvious that the amino acids which are evolutionarily altered in a protein are the ones which give an additional functional edge to the protein conformation to attain an increased functionality or the half-life in the exposed micro-environmental constraints. Hence we can state that the protein sequences normally alter to a great extent during evolution, but they can still retain their overall structural topology to maintain the native function.
Through the comprehensively analyzed data and contemplating the enlisted as well as implied information, we conclude that several residue substituting mutations can normally occur in protein sequences without altering their function and overall topology. Although this study was done on villin HP, it can be equally extrapolated to other proteins as well. Here it is well realized that the change in percentage of an amino acid can vary within a wide range across several similar functional copies available in same or different species, but the overall structure may still be unerringly similar retaining its native function. Furthermore, nature alters the functionally important amino acids to improve their specific role in the evolved conformation which is still almost similar to the native structure. Likewise nature alters structurally insignificant amino acids also to provide the altered functionality to the evolved conformation, possibly due to binding of some new ligands.
We also conclude that the phylogenetic tree single-handedly cannot extract the detailed similarity among the protein sequences. Hence, a better phylogenetic model predicting the structure of considered protein sequences is essentially needed to reliably find the significant evolutionary or functional relationship information among the considered protein sequences. Such structural comparison guided highly informative evolutionary tree would thus be far better than the routinely used sequence based phylogenetic trees.
Although the robustness of protein structures is well understood, we often assume that evolved sequences might have altered conformations. However precisely concluding, this study illustrates that the protein structures or their functional domains are not evolutionarily robust over every residue substitution. Or in other words, nature specifically tweaks certain amino acids in a protein domain for attaining its desired function in its constrained micro- environment. To predict this structural robustness simply from the protein sequence, we can specifically employ the characteristics of the altered residues along with the provided proximal sequence and structural context. Normally, we assume that a protein is evolutionarily susceptible for a few residues and they are responsible for the evolved functionalities of its conformation. However, all these susceptible residues do not evolve quite equally likely to any other different residue. We should thus predict the consequence of such an alteration so that the evolutionarily related information of every single protein residue can be correctly mapped to make us competent enough to reliably select even the distantly related templates for correctly modelling a protein sequence. Although this information is worked out in HMM profile based template search algorithms, they fail to extrapolate it to every single target residue and they do not efficiently employ the evolutionarily related, functionally significant information of a residue along with the other residues that are within a predefined residue boundary cutoff. It is because all such proximal residues within a defined structural boundary cutoff are responsible for the functionally positive evolutionary alteration of a single one. Hence, if we properly track the evolutionarily nature and structural implication of the altered residues, we can reliably link the distant relationship of a protein to its related orthologous that are conventionally not considered.
Further concluding, the application of this study is its exquisite importance for protein structure prediction methodologies which search homologous as well as reliable templates for modelling a protein sequence. As per this study, a phylogenetically distant sequence may still share a similar structure and we should try finding such structurally solved conformations available for modelling a target sequence. During the routinely employed template search methodology, by plausibly improving the substitution scores for the computed same-column template profile residues retaining a good reliable TM_Score in their structural comparison, along with a maximal reliable span of the considered target sequence with minimal count of such selected templates, we might reach closer to our ultimate goal to quickly as well as efficiently search evolutionarily distant and reliable hits for modelling an improved near- native conformation of a target sequence.
|Figure 1: Flowchart representing the overview of the complete methodology used in the research.|
|Figure 2: Phylogenetic tree for Headpiece.|
|Figure 3: Phylogenetic tree for Sub-domain.|
|Figure 4: Phylogenetic tree for Headpiece + Sub-domain.|
|Figure 5: (a) Graph for the average occurrence of amino acid residues in HP structures (b) Percentage variation in occurrence of amino acid residues in HP structures.|
|Figure 6: (a) Graph for the average occurrence of amino acid residues in the SD structures (b) Percentage variation in occurrence of amino acid residues in SD structures.|
|Table 1: Phylogenetic distances for Headpiece|
|Synthetic construct (Homo sapiens)||Leu||Ala||Val||Ser||Thr||Pro||Ile||His||Lys||Asn||Asp||Glu||Gln||Tyr||Trp||Met||Phe||Gly||Arg||Cys|
|Synthetic construct (Gallus gallus)||Leu||Ala||Val||Ser||Thr||Pro||Ile||His||Lys||Asn||Asp||Glu||Gln||Tyr||Trp||Met||Phe||Gly||Arg||Cys|
|Table 2: Amino acid percentage distribution according to the source of PDB filess|
|Table 3: Phylogenetic distances for Sub-domain|
|Table 4: Phylogenetic distances for Headpiece + Sub-domain||0|
|HP amino acid%||Leu||Ala||Val||Ser||Thr||Pro||Ile||His||Lys||Asn||Asp||Glu||Gln||Tyr||Trp||Met||Phe||Gly||Arg||Cys|
Table 5: Percentage of amino acid residues for Headpiece
|SD amino acid%||Leu||Ala||Val||Ser||Thr||Pro||Ile||His||Lys||Asn||Asp||Glu||Gln||Tyr||Trp||Met||Phe||Gly||Arg||Cys|
|Table 6: Percentage of amino acid residues for Sub-domain|
|Our Work||(Cornish-Bowden, A., 1983||(Doolittle, R.F., 1989 )||(Carugo O., 2008)|
|AVG-HP||AVG-SD||AVG-118||AVG-1150||AVG_0.55_million||Substantially Variant for HP||Substantially Variant for SD|
Table 7: Comparison of the amino acid percentage present in our selected proteins with general amino acid percentage found in diverse sets of naturally occurring proteins
Table 8: TM_Score and GDT residue percentage for Headpiece
Table 9: TM_Score and GDT residue percentage for Sub-domain
|TM_Score||GDT Res%||PHYLO_DIST||EVOLUTIONARILY ALTERED AMINO ACIDS|
Table 10: Overall change in TM_Score and amino acid residues