Is Bayesian Networks Apt To Predict Receptor Conformational States?
Part II. Bayesian inference applied to GPCRs in past: predicting receptor-G-protein-coupling
Predicting G-protein coupling is a tricky task by standard methods of sequence alignment such as BLAST or ClustalW alone, because the GPCRs with low overall similarity can couple to the same G-protein subtypes whereas closely related receptors can couple to different G-protein subtypes; moreover, some receptors display a tendency to couple promiscuously (reviewed by Weiss 1998).
Nearly a decade ago, a following study had successfully employed Bayesian inference in predicting G-protein coupling preferences of GPCRs. This publication allows open access; the link to the paper and a summary of the contents are provided below.
A Naive Bayes Model for Predicting G-protein Coupling. Bioinformatics 19: 234-240
J Cao, R Panetta, S Yue, A Steyaert, M Young-Bellido and S Ahmad. 2003.
AstraZeneca R&D Montreal, Canada.
The supplementally information was supposedly viewable at:
http://www.astrazeneca-montreal.com/AZRDM_info/supporting_info.pdf.
Notwithstanding the page seems no longer existing. How disappointing….!
Summary
The study constructed a model which successfully predicted G-protein coupling with 72% accuracy for 55 GPCRs tested, after training over 80 GPCRs of which coupling tendencies have already been identified empirically. The model predicted multiple G-protein coupling for the majority of the test set.
Methods: Naive Bayes Model
Naive Bayes allows efficient performance in a model with many domains. As G-protein coupling involves intracellular domains and the C-terminus, there provided was a set of random variables: {IC1, IC2, IC3, IC4, C} where ICn corresponds to each intracellular domains and C signifies the variable which assumes numerical values associated with each G-protein subtypes Gi/0, Gq/11 or Gs; other G-protein subtypes were not included in the study due to insufficient data availability in the early 2000s. In the model the relationship between GPCRs and G-proteins were encoded in the model which randomly draw the distribution: P(IC1, IC2, IC3, IC4, C) and the variables become conditionally independent for given classes of G-proteins when trained and tested.
The joint probability for a given class label c:
P(ic1, ic2, ic3, ic4, c) = P(c)P(ic1|c)P(ic2|c)P(ic3|c)P(ic4|c).
The likelihood of probabilities for a given receptor can be estimated as:
P(c|ic1, ic2, ic3, ic4)∝ P(ic1|c)P(ic2|c)P(ic3|c)P(ic4|c).
For the multiple coupling of G-proteins to a receptor,
Let c’ = argmax {P(ic1|c)P(ic|c)P(ic3|c)P(ic4|c)} where c ∈ C
Then Bayesian Classifier (BC) = [P(ic1|c’)P(ic2|c’)P(ic3|c’)P(ic4|c’)]/[P(ic1|c)P(ic2|c)P(ic3|c)P(ic4|c)]
BC = 1 when c = c’, and BC > 1 when c ≠ 1.
(BC ≤3.0 to be the cutoff score in their study for assigning the class labels c.)
Dataset:
146 GPCRs were divided into two sets: a training set of 91 GPCRs with a coupling ratio of Gi/0:Gq/11:Gs = 2:1.5:1, and a validation set of 55 GPCRs.
To avoid over-fitting due to uneven distribution, GPCRs with intracellular domain sequences with over 80% sequence identity > 80% of the length of a query or a hit were excluded from the training set as determined by BLAST. Consequentially that resulted in a training set of 83 GPCRs with all intracellular domains, two sets of 4 GPCRs with two and three intracellular domains. About two-thirds of the GPCRs of the validation set contain at least a redundant intracellular domain.
The sequence distribution of each intracellular domain in the training set was estimated and grouped phylogenetically by ClustalW. Based on the clusters, the distribution of the intracellular domain sequences of the underlining receptor population was approximated, so that de novo synthesis of phylogenic trees to update the conditional probabilities are not required for any additional data. In the study, the phylogenetic trees were partitioned in a manner to keep nodes in a cluster within 5 branches of each other: such a tree typically had 8 to 10 clusters.
Learning From the Training Set:
Multinomial distribution was assumed with the i-th intracellular domain of n class c receptors distributed in k clusters.
With a conjugate prior following Dirichlet distribution, the posterior probability also follows the distribution. The Bayesian inference of the conditional probability for each i intracellular domain of the class c is the average for the cluster icij|c,
E (icij|c) = P(icij|c) = (αij + Nij)/(αi + Ni)
where α designates hyperparameter.
The hyperparameter was given the value of 1 in accordance with Laplace rule to discount unknown features of receptor sequences and making that non-informative priors.
The pseudo-counts with α(new)ij = α(old)ij + N(old)ij were tabulated, and updated conditional probabilities were evaluated as (α(new)ij + N(new)ij)/(α(new)ij + N(new)i).
Calculating the likelihood probabilities:
They have written a perl program to produce the likelihood probabilities as it samples the intracellular domain sequences of a GPCR. The process was performed in following steps:
1) Four intracellular domain sequences of a receptor was added to Fasta files of the same domains of the training set;
2) Sequence alignment with ClustalW;
3) Phylogenetic trees files were parsed to find the closest homologue; and
4) Calculation of the likelihood probabilities.
Assessment
The accuracy of the results were measured and presented by confusion matrices.
Results
The study validated 55 GPCRs which have in total 165 possible coupling.
As the cutoff score of Bayesian classifier (BC) was raised from BC = 1 to BC ≤ 3, some of the classification accuracy also increased in the range.
Multiple coupling was detected for at least 23 receptors among 146 in the total dataset; 7 of the 23 were in the validation set.
Future Work
The group employed Naive Bayes because it was then not understood how the intracellular domains relates in G-protein coupling. In their paper the authors implied that they might have chosen a Tree Augmented Naive Bayes (TAN) model (Friedman et al. 1997) had the dependency among the variables been clearer.
Now with more experimental data available for the interactions between receptors and their effectors, not only various G-proteins but also β-arrestins, and hopefully with other ongoing studies out there, an extended model could well be made in future, allowing a further advanced approximation of receptor-effector coupling tendencies for those displaying uncertain behaviours. Such tools would confirm, as an additional verification method, any obscure or inconsistent experimental findings when that occur. The models would assist experimenters clarifying their views with additional insights upon the cellular events, whenever gloomy shadows cast in the dense pot of sticky proteins and lipids.
References
Cao J et al. 2003. A Naive Bayes Model for Predicting G-protein Coupling. Bioinformatics 19: 234-240
Friedman N, Geiger D & Goldszmidt M. 1997. Bayesian Network Classifiers. Mach. Learn. 29: 131–163.
Weiss J. 1998. Molecular basis of receptor/G-protein-coupling selectivity. Pharmacol Ther. 80: 231–264.
Also the referred links to wikipedia (Thank you to all the unnamed contributors).
No comments:
Post a Comment