PSI Structural Biology Knowledgebase

PSI | Structural Biology Knowledgebase
Header Icons

Related Articles
Predicting Protein Crystal Candidates
October 2014
Drug Discovery: Finding Druggable Targets
October 2013
Membrane Proteome: Unveiling the Human α-helical Membrane Proteome
August 2013
Infectious Diseases: Determining the Essential Structome
May 2013
Infectious Diseases: Targeting Meningitis
May 2013
Microbial Pathogenesis: Computational Epitope Prediction
January 2013
Microbial Pathogenesis: Influenza Inhibitor Screen
January 2013
Targeting Enzyme Function with Structural Genomics
July 2012
Disordered Proteins
February 2012
The cancer kinome
April 2010
Learning from failure
December 2009
Dealing with difficult families
February 2009

Technology Topics Target Selection

Predicting Protein Crystal Candidates

SBKB [doi:10.1038/sbkb.2014.225]
Technical Highlight - October 2014
Short description: An improved algorithm predicts whether a protein will be amenable to crystallization for structural studies.

Bioinformatic analyses can recognize biophysical features favoring well diffracting crystals (blue bars) and those leading to poorly diffracting ones (gray bars). Figure courtesy of Adam Godzik.

Inordinate amounts of time and money are spent on failed attempts to produce diffraction-quality protein crystals. Software such as the XtalPred server helps avoid wasted effort by predicting the likelihood of successful crystal formation based on a protein's physicochemical properties. Homologs can often be selected or variants engineered with better chances of forming high-quality crystals. XtalPred uses an “expert pooling” method to combine probabilities of success based on seven protein properties.

Godzik and colleagues (PSI JCSG and Sanford-Burnham Medical Research Institute) now test a number of machine-learning methods and update XtalPred with a random forests classifier that yields superior performance. Random forests is an ensemble method that searches large numbers of decision trees to predict the closest match to training data; each tree derives from a bootstrap data sample and random subsets of variables. When trained on the same features, XtalPred-RF (XtalPred with random forest) doubles the performance of its predecessor, as measured by the Matthews correlation coefficient.

XtalPred-RF also exploits a much larger training data set from the PSI TargetTrack database and incorporates additional surface features, including hydrophobicity, surface “ruggedness,” side-chain entropy and amino acid composition of the protein surface. The authors define ruggedness as the ratio of surface area (the sum of solvent accessibilities of individual residues) to the total accessible area estimated for a protein of a given mass.

To convert binary classifications into a ranking system, XtalPred-RF includes a number of independent classifiers, each trained on a data set with a different proportion of successful and failed attempts at structural determination (the number of failed proteins are undersampled to balance the data to differing extents).

The authors demonstrate XtalPred-RF on target selection from 271 Pfam families studied by the PSI JCSG. They estimate that, using the new software, 30% fewer structures would have been attempted, without affecting the number of families represented by solved structures. Thus, this new software promises to help identify individual targets, as well as facilitate high-throughput structural biology.

Tal Nawy


  1. S. Jahandideh, L. Jaroszewski & A. Godzik Improving the chances of successful protein structure determination with a random forest classifier.
    Acta Crystallogr D Biol Crystallogr. 70, 627-35 (2014). 10.1107/S1399004713032070

Structural Biology Knowledgebase ISSN: 1758-1338
Funded by a grant from the National Institute of General Medical Sciences of the National Institutes of Health