Technical Highlight - April 2010
Short description: A new approach to automated functional subfamily classification works on huge superfamilies and is well suited to structural genomics.
It is not an easy task to group proteins according to their function. There are several reasons for this: the overriding one is the lack of good experimental data for many proteins, but lack of traceable author statements, and errors from function-prediction software also contribute. The new GeMMA (Genome Modelling and Model Annotation) protocol developed by Christine Orengo in partnership with PSI MCSG is, however, now able to classify very large and diverse superfamilies into functional subfamilies.
Although computational prediction of function has greatly improved over the past few years, most approaches still rely on sequence homology, but it is not clear what level of similarity is needed. This has led to the information being not very specific and an error rate for the annotation of complete genomes that is hard to determine, with some workers estimating that it is greater than 40% and others that it is less than 5%.
There are three ways to predict protein function: phylogenomics, pattern recognition and clustering. Phylogenomics relies on evolutionary relationships within a family of proteins and so compares whole protein sequences. Pattern recognition classifies proteins using locally conserved sequence patterns; an example of this approach is Pfam, a comprehensive collection of protein families that is used extensively to guide target selection in structural genomics. Clustering groups together sequences on the basis of their similarity and displays them as a hierarchical tree.
GeMMA uses two methods: pattern recognition and clustering. GeMMA is not the first hybrid method — SCI-PHY (Subfamily Classification In PHYlogenomics) is also a hybrid — but it is the first that does not require an initial multiple alignment of all sequences. The upshot is that much larger and more diverse superfamilies can be compared than before. In addition, GeMMA can be 'trained' on annotated protein families to establish similarity thresholds for low-quality annotated families.
When GeMMA was compared with SCI-PHY, Orengo's team found that SCI-PHY was optimized for high specificity at the expense of sensitivity. GeMMA, by contrast, achieves a balance between sensitivity and specificity. In future, it might well be that SCI-PHY and GeMMA are routinely used together, combining GeMMA's ability to handle large data sets and SCI-PHY's high specificity. A high-throughput version of GeMMA has also developed.
David A. Lee, Robert Rentzsch & Christine Orengo. GeMMA: functional subfamily classification within superfamilies of predicted protein structure domains.
Nucleic Acid Res. 38, 720-737 (2009). doi:10.1093/nar/gkp1049