PSI Milestones Documentation

Protein Structure Initiative (PSI) Steering Subcommittee on Goals and Milestones January 2007 (updated 8.30.07)

Steering Subcommittee on Goals and Milestones:
Chair: Gaetano Montelione
Members: Helen M. Berman, David Eisenberg, Wayne Hendrickson, Andrzej Joachimiak, George Phillips, Janna Wehrle, Stephen Burley, Ian Wilson
Advisor: Steven Brenner

Navigational Note: Links to available metrics are highlighted in red in the text below.

Mission Statement

The long-range goal of the Protein Structure Initiative is to make the three-dimensional atomic-level structures of most proteins easily obtainable from knowledge of their corresponding DNA sequences.

Broad Overall Goals: The National Institutes of Health-National Institute of General Medical Sciences (NIH-NIGMS) Protein Structure Initiative (PSI) was created to expand the impact and value of the Human Genome Project, and other genome sequencing projects, using three-dimensional (3D) protein structure analysis. The primary goals of the PSI include (i) large-scale protein structure determination by X-ray crystallography and nuclear magnetic resonance (NMR) methods, along with broad structural coverage of protein sequences through homology modeling, (ii) development of new technologies and infrastructure that accelerate the process of 3D protein structure analysis, and (iii) community outreach.

The most important goal of the PSI is to maximize the coverage of protein sequences with structural information. Selected gene products are being systematically prioritized with the aim of attaining structural coverage of every major protein domain family found in nature. Within this comprehensive target selection program, emphasis is placed on obtaining 3D structures of human proteins, fundamental and disease-causing proteins from bacterial, fungal, protozoal, and viral pathogens, proteins from model organisms, such as M. musculus, C. elegans, D. melanogaster, S. cerevisiae, as well as proteins from gram-positive and gram-negative bacteria. These include many proteins thought to represent drug discovery targets.

By 2010, the PSI aims to deliver more than 4,000 new 3D structures of proteins to the biological and biomedical research community, including more than 1,000 structures produced in the initial five-year pilot phase. Through the leverage provided by the Computational Modeling and the Knowledge Base Centers, these experimental 3D structures will be used to generate structure/function information for millions of gene products. In many cases, these PSI structures provide key clues to evolutionary and functional relationships among proteins that are not evident from sequence information alone, creating new opportunities for biological discovery. These novel biological insights can often only be gleaned by elucidation of 3D protein structure. The mission of the PSI also includes development of new technologies and methods aimed at reducing costs of structure production, and at providing 3D structures of particularly challenging proteins, such as membrane proteins, certain classes of eukaryotic proteins, and multi-protein complexes. A key goal of the PSI's community outreach efforts is to make 3D structure an important component of biological research. Finally, the PSI organizes and maintains an extensive database of protein sample production protocols, data, and reagents that are available to the broad scientific community.

Background: The PSI was initiated in 2000 by the National Institute of General Medical Sciences. The initial five-year phase of the program funded eleven pilot projects, aimed at developing core technologies for structural genomics and for creating the infrastructure required for large-scale protein structure production. In this pilot phase, ~ 1,300 protein structures were deposited into the public domain. The second phase of PSI (PSI-2), initiated on July 1st, 2005, supports four Large-Scale Research Centers, together with six additional Specialized Research Centers, two Computational Modeling Centers, the PSI Materials Repository, and a PSI Knowledge Base.

PSI-2 puts strong emphasis on determining 3D structures from (i) large families of protein domains (with tens to hundreds of members) for which essentially no 3D structural information is presently available, and (ii) very large families (with hundreds to tens-of-thousands of members) for which only limited 3D structure information is available. These include proteins from human and other model organisms of significant biological or biomedical interest. Protein target selection and 3D structure determination is coordinated across the PSI-2 centers to minimize redundancy. The program is also supported by extensive structural bioinformatics efforts that leverage these experimental data by structural and functional annotation, including large-scale homology modeling. The PSI Materials Repository provides infrastructure for distributing tens of thousands of physical reagents generated by the PSI program to the broader biological community. This highly integrated program is designed to enhance the value of the Human Genome Project and other large scale gene sequencing projects using protein structure/function analyses, and to provide information, reagents, and technologies that will strengthen hypothesis-driven research programs in biology, chemistry, and medicine.

Specific Goals and Measures of Success

The following sections summarize the Specific Goals and Measures of Success in each of three areas: (i) Protein Structures, (ii) New Technologies; and (iii) Outreach to the Scientific Community.

Throughout these sections, an Experimental Structure is defined as one determined by X-ray crystallography or NMR methods having satisfactory structure quality assessment statistics, as defined in the Appendix, and deposited in the Protein Data Bank (PDB). Some of these statistics will be assessed centrally, by the PSI Knowledge Base, and others will be reported on a regular basis by PSI-2 centers.

I. Protein Structures

Goals for Protein Structure Production

The central objective of PSI-2 is to increase the total number of proteins whose structure can be inferred from knowledge of their respective DNA sequences. Toward this aim, the PSI-2 program will determine more than 3,000 high-quality unique experimental protein (or protein domain) structures (Experimental Structures) using X-ray crystallography or NMR spectroscopy. At the time each of these structures is determined, nearly all will have "distinct" non-redundant protein sequences, distinctly different from those previously determined and deposited in the PDB. Although these structures will be produced primarily by the Large-Scale Research Centers, the Specialized Research Centers will also contribute to the overall PSI-2 production of protein structures, particularly for challenging proteins.

Many of the proteins targeted in PSI-2 will be the first structural representatives from large families of protein domains with ten to thousands of members. In addition, there is high scientific value in obtaining structures of multiple members from highly-diverse protein domain "Mega families", that include hundreds to tens-of-thousands of members and many subfamilies which cannot presently be modeled. These Mega families will be targeted both to advance our knowledge about the evolution of protein structure and function and to improve our understanding of normal physiology and disease in humans. Accordingly, an additional goal of PSI2 is to sample extensively across these Mega families so as to provide structural and functional coverage.

In pursuing these central goals, PSI-2 will maximize the impact of experimental structures through computational homology modeling, and leverage the information content of the experimentally-determined protein structures using structural bioinformatics approaches. In particular, the Experimental Structures produced in PSI-2 will provide templates required for modeling the 3D structures of millions of proteins, including tens of thousands of human proteins.

Measures of Success for Experimental Structure Determination and Modeling Leverage

The following sections describe some key Measures of Success for PSI-2. These metrics provide a standardized means of counting Experimental Structures, assessing the impact of these Experimental Structures in the community, measuring the value of Experimental Structures in terms of structural models for related proteins, and estimating structural coverage for specific proteomes. Additional details and definitions of the metrics outlined in this section are presented in the Appendix.

I.1. Numbers of Experimental Structures and Residues

I.1.A. Number of Novel Experimental PSI-2 Structures. This metric enumerates the number of Experimental Structures (or domains within multi-domain Experimental Structures) deposited into the PDB for which, at the time of deposition, no 3D structure was publicly available for a close homolog, defined operationally as one with more than ~30% sequence identity over the length of the relevant segment of the polypeptide chain. These structures may resemble known protein structures, but are novel at the time they are deposited in the PDB in the sense that their structures cannot be predicted reliably by comparative modeling methods. The technical process for defining a Novel Experimental Structure is outlined in the Appendix. The majority of the 3,000 structures determined by PSI-2 would contribute to this metric.

I.1.B. Number of Distinct Experimental PSI-2 Structures with Nonredundant Sequences. This metric enumerates structures of proteins (or protein domains) with sequences distinctly different (i.e. not identical in sequence, as specifically defined for a Distinct Experimental Structure in the Appendix) from sequences deposited in the PDB prior to completing the targeted PSI-2 structure. This metric counts separately the multiple homologues across a protein domain family that are not 'novel' by criterion I.1.A. Although most proteins are selected using criterion I.1.A, by the time some structures are completed they may no longer be Novel Experimental Structures. The deliverable of more than 3,000 Distinct Experimental Structures in PSI-2 refers specifically to this metric.

I.1.C. Number and Size of Domain Families for which PSI-2 provides the first Experimental Structure Representative. This metric enumerates the numbers and sizes of Domain Families, or Mega Family subclusters, for which PSI-2 provides the first Experimental Structure. As part of the process of target selection, the PSI-2 Production Centers are assigned specific families of protein domains, referred to as "BIG" or "Mega" families, which are selected and organized in a coordinated bioinformatics effort. Each of these domain families, selected on the basis of high novelty and leverage value by the PSI-2 Target Selection Committee, includes ten to thousands of members. Many of the 3,000 structures determined by PSI-2 would contribute to this metric.

I.1.D. Total Number of Experimental PSI-2 Structures. This metric enumerates all PDB depositions, including multiple structures of the same protein sequence determined by different methods (i.e., NMR versus X-ray crystallography), in different crystal forms, different solution conditions, or bound to different ligands. It would also count separately proteins that differ at just a few amino acid sites, which are not distinct by criteria I.1.C. The number of protein structures determined in PSI-2 that would contribute to this metric should significantly exceed the expected 3,000 unique Experimental Structures.

I.1.E. Numbers of Experimentally Determined Residues. Each of the measures above (in I.1.A - D.) will also be assessed on a residue basis; i.e., the number of residues for which structural information is provided will also be estimated, reflecting the value and challenge of determining larger protein (or domain) structures.

I.2. Impact and Classification of Experimental Structures

The following measures assess impact of Experimental Structures in expanding our knowledge about specific classes of proteins, and provide statistics on the classes of Experimental Structures determined in PSI-2.

I.2.A. Number of Experimental Structures from Specifically-Targeted BIG and Mega Domain Families. These families of domains, defined by the PSI-2 Target Selection Committee as having high value for extensive coverage, contain hundreds to tens-of-thousands of members and many subfamilies which cannot presently be modeled. They also include representatives in a broad range of proteomes, often including the human proteome.

I.2.B. Number of Experimental Structures from Biomedical Theme Target Lists. Center-Specific Biomedical themes of PSI2 projects include (i) widely conserved domain families constituting central processes conserved across all kingdoms of life; (ii) domain families of phosphatases; (iii) proteins and domains involved biological networks associated with cancers and other human diseases; and (iv) proteins and domains from the proteomes of pathogenic bacteria. In order to provide high quality homology models of these important proteins, sequences with greater than the 30% sequence identity with homologues in the PDB are often targeted for these biomedical targets.

I.2.C. Number of Experimental Structures from Community Outreach Target Lists. Community Outreach targets are defined by nominations from the broad biological community. Protein sequences with greater than the 30% sequence identity with homologues in the PDB may be targeted in these community outreach efforts.

I.2.D. Number of Experimental Structures from Specifically-Targeted Organisms or Groups of Organisms. These include enumerations of protein structures from individual organisms, as well as metagenomes or metabiomes, as defined by the PSI2 Target Selection Committee.

I.2.E. Numbers of Novel Chain Folds, New Multidomain Structures or New Arrangements of Domains in Multidomain Proteins

I.2.F. Number of Experimental Structures that Identify Previously Unrecognized Relationships Between Protein Domain Families. This metric enumerates cases where a homologous relationship that was not previously recognized by sequence similarity is discovered by structural similarity.

I.2.G. Number of Experimental Structures that are the First Protein Structural Representatives from Specific Functional Classes.

I.2.H. Number of Experimental Structures that Suggest Previously Unrecognized Biochemical (Molecular) Function(s).

I.2.I. Number of Experimental Structures that Provide Substantially New Biomedical Insights.

I.2.J. Number of Experimental Structures of Human Proteins.

I.2.K. Number of Experimental Structures of Eukaryotic Proteins.

I.2.M. Number of Experimental Structures of Membrane Proteins. Membrane proteins (or membrane protein domains) are defined operationally as proteins (or domains) that require detergent extraction from cellular membrane fractions for structural analysis.

I.2.N. Number of Experimental Structures Determined at the Atomic Level using X-ray Crystallography, Solution State NMR, and Solid State NMR methods, respectively.

I.2.O. Number and List of Publications Describing PSI-2 3D Structures.


I.3. Numbers of Sequences For Which Homology Models Can Be Produced from PSI Structures and Corresponding Coverage of Specific Proteomes.

A second key goal of the PSI is to leverage the information provided by these four thousand Experimental Structures through computational modeling, generating millions of homology models that will be invaluable for advancing many different areas of scientific investigation. These measures attempt to estimate how many such protein models can be constructed using a specific Experimental Structure, as well as assess the coverage of specific proteomes by experimentally-determined and modeled 3D structures.

A critical challenge in reporting such "Modeling Leverage" is assessment of the reliability of the reuslting models. This is an area of current active research with no broadly accepted standards or conventions for assessing model accuracy. For the purpose of PSI-2, Modeling Leverage will be operationally defined based on sequence similarity using the conventions outlined in the Appendix. As modeling technologies improve, these conventions may be refined over time by the PSI Target Selection Committee.

The following metrics provide estimates of Modeling Leverage and Structural Coverage of specific proteomes. They will each be assessed in terms of numbers of protein structures and numbers of residues in these protein structures which can, in principle, be modeled from Experimental Structures. The Appendix provides detailed operational definitions and conventions for these measures that will be used to assess modeling leverage for PSI.

I.3.A. Total Modeling Leverage.

I.3.B. Novel Modeling Leverage.

I.3.C. Modeling Leverage and Coverage of the Human Proteome.

I.3.D. Modeling Leverage and Coverage of Proteomes of Model Organisms and Pathogenic Microorganisms.

II. New Technologies

Goals for Technology Development

The PSI-2 is committed to developing and making available technological and methodological advances that provide enabling infrastructure for biology, chemistry, and medicine. In addition to the Large-Scale Research Centers, the PSI-2 supports a number of Specialized Research Centers, whose mission is to develop novel technologies for target selection, protein production, and structure determination, particularly for challenging eukaryotic proteins, membrane proteins, and multi-protein complexes. The primary goal of these technology development efforts in both the Large-Scale and Specialized centers is to provide to the scientific community new technologies and protocols that reduce costs and improve the efficiency of protein sample production, and the speed and accuracy of experimental structure determination. This goal includes making accessible to the public the corresponding data, protocols, reagents, hardware, and software associated with PSI-2 supported technology development efforts.

Specific Goals for Technology Development include:

  • Cost reduction of protein sample preparation and experimental 3D structure determination by X-ray crystallography and NMR spectroscopy.
  • Advances providing improved efficiency in gene synthesis and cloning.
  • Advances providing improved efficiency in protein expression.
  • Advances providing improved efficiency in protein purification.
  • Advances providing improved efficiency in protein crystallization.
  • Advances in methods providing improved efficiency and/or accuracy of protein structure determination by NMR and X-ray diffraction.
  • Advances in structural genomics approaches for human and other eukaryotic proteins.
  • Advances in structural genomics approaches for membrane proteins.
  • Advances in structural genomics approaches for protein-protein and protein-ligand complexes.
  • Advances providing improved efficiency and accuracy in computational homology modeling of structures from sequences.
  • Advances in protein structure quality assessment and refinement.
  • Advances in determination of biochemical (molecular) functions from 3D protein structures.
  • Advances in developing laboratory information management systems (LIMS) for organizing and integrating information generated in large-scale structure production.
Measures of Success for Technology Development

The following measures will be reported regularly by each PSI-2 Center.

II.1. Numbers and list of publications on new technologies.

II.2. Numbers of citations of publications on new technologies.

II.3. Numbers and list of workshops organized by PSI-2 centers for the scientific community.

II.4. Numbers and list of intellectual property disclosures, licensing agreements, patents, and patent applications on technologies invented in PSI-2 Centers.

II.5. Adoption of technologies or methodologies by PSI and other structural genomics centers.

II.6. List of accomplishments relative to each of the Goals outlined above.

II.7 Decrease in Total Cost per Structure. Total Cost is defined as the sum of annual indirect plus direct costs awarded to each center. These would be assessed separately for the Large-scale and Specialized Research Centers, using methods outlined in the Appendix. Of particular interest is the change in cost-per-structure from year to year. This metric will be refined further by the Milestones and Goals Committee.

III. Outreach to the Scientific Community

Goals for Community Outreach

A key goal of the PSI is to make 3D structure an important component of biological research. This includes propagation of new technologies and structural information to the broad scientific community, and incorporation of project nominations from the community, whether they be specific targets, groups of targets, or methodological or technological advances. Approximately 15% of the effort of PSI-2 centers will be devoted to Community Outreach Targets. Given that these nominated projects are likely to be more difficult than the majority of PSI-2 structures, the PSI-2 plans to deposit at least 100-300 such structures of Community Outreach Targets to the PDB.

The PSI-2 is also providing comprehensive documentation of experimental protocols and interim results for gene cloning, gene expression, protein purification, protein crystallization, and structural characterization, including negative results. All PSI-2 Centers must deposit standardized sets of data on protein sample preparation and structure production in the public-domain PepcDB database, shortly after the data are generated.

A further goal of PSI-2 Community Outreach is to provide expression clones for all program targets, initially directly from the Centers, and eventually from the PSI-2 Materials Repository. Other reagents, such as small quantities of purified proteins, will also be provided when readily available.

Measures of Success for Community Outreach

The following metrics will be reported annually by each of the PSI-2 Centers, except where indicated otherwise.

III.1. Number of protein targets "accepted" from community requests for investigation by PSI-2 Centers.

III.2. Number of PDB depositions of Community Targets.

III.3. Number and lists of publications with joint authorship between PSI Centers and non-PSI investigators.

III.4. Numbers and lists of expression vectors, expression hosts, purified proteins, and other materials distributed to non-PSI investigators.

III.5. Numbers of hits on TargetDB and PepcDB web sites per unit time. It is recognized that this metric underestimates the use of these data, as it does not include offline usage of these databases. This metric will be assessed centrally by TargetDB and PepcDB.

III.6. Number of downloads of PSI-2 Experimental Structures from the PDB. It is recognized that this metric underestimates the use of these data, as it does not include offline usage of these databases. This metric will be assessed centrally by the PDB.

III.7. Number of accesses (hits) to PSI-2 Experimental Structures in PDB, PDBsum, SCOP, or CATH databases.

III.8. Numbers of attendees at Workshops offered by PSI-2 Centers, as self-declared by each center.

III.9. Numbers of seminars at non-PSI institutions/department meetings/national and international conferences on PSI activities given by PSI scientists, as self-declared by each center.

III.10. Numbers and names of student trainees (undergraduate, graduate, postdoctoral) (i) directly supported by and (ii) otherwise involved in PSI-2 sponsored research programs.

III.11. Numbers and names of visiting scientists (i) directly supported by and (ii) otherwise involved in PSI-2 sponsored research programs.

III.12. Numbers and names of underrepresented minority scientists involved in PSI-2 sponsored research programs.

III.13. Numbers of citations of published papers describing PSI-2 structures.