Technical Highlight - October 2014
Short description: An improved algorithm predicts whether a protein will be amenable to crystallization for structural studies.
Inordinate amounts of time and money are spent on failed attempts to produce diffraction-quality protein crystals. Software such as the XtalPred server helps avoid wasted effort by predicting the likelihood of successful crystal formation based on a protein's physicochemical properties. Homologs can often be selected or variants engineered with better chances of forming high-quality crystals. XtalPred uses an “expert pooling” method to combine probabilities of success based on seven protein properties.
Godzik and colleagues (PSI JCSG and Sanford-Burnham Medical Research Institute) now test a number of machine-learning methods and update XtalPred with a random forests classifier that yields superior performance. Random forests is an ensemble method that searches large numbers of decision trees to predict the closest match to training data; each tree derives from a bootstrap data sample and random subsets of variables. When trained on the same features, XtalPred-RF (XtalPred with random forest) doubles the performance of its predecessor, as measured by the Matthews correlation coefficient.
XtalPred-RF also exploits a much larger training data set from the PSI TargetTrack database and incorporates additional surface features, including hydrophobicity, surface “ruggedness,” side-chain entropy and amino acid composition of the protein surface. The authors define ruggedness as the ratio of surface area (the sum of solvent accessibilities of individual residues) to the total accessible area estimated for a protein of a given mass.
To convert binary classifications into a ranking system, XtalPred-RF includes a number of independent classifiers, each trained on a data set with a different proportion of successful and failed attempts at structural determination (the number of failed proteins are undersampled to balance the data to differing extents).
The authors demonstrate XtalPred-RF on target selection from 271 Pfam families studied by the PSI JCSG. They estimate that, using the new software, 30% fewer structures would have been attempted, without affecting the number of families represented by solved structures. Thus, this new software promises to help identify individual targets, as well as facilitate high-throughput structural biology.
S. Jahandideh, L. Jaroszewski & A. Godzik Improving the chances of successful protein structure determination with a random forest classifier.
Acta Crystallogr D Biol Crystallogr. 70, 627-35 (2014). 10.1107/S1399004713032070