MrGrid: A Portable Grid Based Molecular Replacement Pipeline

Jason W. Schmidberger; Mark A. Bate; Cyril F. Reboul; Steve G. Androulakis; Jennifer M. N. Phan; James C. Whisstock; Wojtek J. Goscinski; David Abramson; Ashley M. Buckle

doi:10.1371/journal.pone.0010049

Abstract

Background

The crystallographic determination of protein structures can be computationally demanding and for difficult cases can benefit from user-friendly interfaces to high-performance computing resources. Molecular replacement (MR) is a popular protein crystallographic technique that exploits the structural similarity between proteins that share some sequence similarity. But the need to trial permutations of search models, space group symmetries and other parameters makes MR time- and labour-intensive. However, MR calculations are embarrassingly parallel and thus ideally suited to distributed computing. In order to address this problem we have developed MrGrid, web-based software that allows multiple MR calculations to be executed across a grid of networked computers, allowing high-throughput MR.

Methodology/Principal Findings

MrGrid is a portable web based application written in Java/JSP and Ruby, and taking advantage of Apple Xgrid technology. Designed to interface with a user defined Xgrid resource the package manages the distribution of multiple MR runs to the available nodes on the Xgrid. We evaluated MrGrid using 10 different protein test cases on a network of 13 computers, and achieved an average speed up factor of 5.69.

Conclusions

MrGrid enables the user to retrieve and manage the results of tens to hundreds of MR calculations quickly and via a single web interface, as well as broadening the range of strategies that can be attempted. This high-throughput approach allows parameter sweeps to be performed in parallel, improving the chances of MR success.

Citation: Schmidberger JW, Bate MA, Reboul CF, Androulakis SG, Phan JMN, Whisstock JC, et al. (2010) MrGrid: A Portable Grid Based Molecular Replacement Pipeline. PLoS ONE 5(4): e10049. https://doi.org/10.1371/journal.pone.0010049

Editor: Haiwei Song, Institute of Molecular and Cell Biology, Singapore

Received: December 19, 2009; Accepted: March 18, 2010; Published: April 6, 2010

Copyright: © 2010 Schmidberger et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The authors thank the National Health & Medical Research Council, Australian Research Council, Victorian Partnership for Advanced Computing, the Victorian Bioinformatics Consortium, Monash eResearch Centre, and the state government of Victoria (Australia) for funding and support. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The most common method of protein structure determination is molecular replacement (MR). This involves using the structure of a protein that shares significant sequence similarity with the protein of unknown structure as a starting point in the structure determination. The process involves four steps: (1) Using sequence-comparison methods such as PSI-BLAST [1] to identify suitable structures that can be used for MR; (2) modification of structures (e.g., removal of flexible loop regions and non-identical side chains), to yield search models; (3) Finding the orientation and position of the search model in the unit cell of the target crystal; (4) Refinement of the model using iterative model-building and maximum likelihood atomic refinement. Although there are other methods of structure determination, molecular replacement is predicted to become an increasingly common technique, for two reasons. First, as the number of new folds reported in the Protein Data Bank (PDB) is decreasing, it is increasingly likely that the unknown target structure will belong to a known fold. Second, the emergence of more sophisticated sequence searching algorithms, such as profile-profile matching [2], improve the probability of finding a suitable search model, even in cases of very low similarity (<20% identity). Finally, the MR algorithms are steadily improving.

Where the sequence similarity between the unknown target and the search model is high (sequence identity >40%) the success rate of MR is very good, even without optimisation of the search model. However, in cases of low similarity (identity <30%) MR, and subsequent structure refinement becomes non-trivial, and can require more complicated strategies to effect a solution. Bearing this in mind, there are several criteria that affect the outcome of the MR calculation; 1) structural similarity between search model and target structure (measured by root mean square deviation (RMSD)); 2) percentage of residues missing from the search model (coverage); 3) the amount of conserved side chains (those expected to remain structurally conserved, e.g., in the protein interior). These factors, and thus the outcome of the MR calculation, can be influenced by improvement of the search model. The simplest approach is to remove regions of the structure that are predicted to be different in the search model and target. Within a conserved protein family, the largest structural deviations are typically seen within the loop regions. Therefore, these regions are the first candidates for removal from the search model. However, this process is a subjective one and relies on sequence alignments, which are often incorrect, particularly at low sequence identity. Thus it is often unclear which loops should be removed and, furthermore how much of the loop should be removed. Each edited model must be tested individually in the lengthy structure determination process, with no indication until the later stages of refinement that the structure should be abandoned.

In addition to edits to the search model, other parameters can greatly influence the outcome of a MR calculation. For instance, the presence/absence and handedness of screw axes will remain unknown until a final structural solution is found and several alternatives must be tested in the MR calculation. In addition the estimated RMSD between the search model and unknown structure can affect the outcome of the MR calculation, leading in the worst case to probable solutions being missed. Therefore, the combination of multiple models, space groups and RMSD values makes MR time and labour intensive, and puts an emphasis of the availability and power of computational resources. In order to address this problem we have developed MrGrid, web-based software that allows multiple MR calculations, using the program PHASER [3], to be executed across a grid of networked computers, allowing high-throughput MR.

Methods

Mr Grid Overview

MrGrid is a portable web based application written in Java/JSP and Ruby, and taking advantage of Apple Xgrid technology. Designed to interface with a user defined Xgrid resource the package manages the distribution of multiple MR runs to the available nodes on the grid and reports all returned results. Utilizing the maximum likelihood based molecular replacement program PHASER [3], MrGrid enables the user to retrieve and manage the results of tens to hundreds of MR calculations quickly and via a single web interface, as well as broadening the range of strategies that can be attempted, increasing the likelihood of success.

Using Mr Grid to perform parallel MR on a local network

MrGrid is distributed as a self-contained software package, and downloaded and executed across a local (user managed) grid resource. Once setup MrGrid is accessed through a web portal (Figure 1). Apple Xgrid software is preinstalled on Apple operating systems OS X 10.4–10.6, allowing machines to be configured as Xgrid clients by simply ticking a box in system preferences. By default MrGrid processes on the client are given low priority, such that the client remains fully responsive. The remaining requirement is a networked machine running Mac OS Server 10.4–10.6, which acts as the Xgrid controller. The ease of setup of Xgrid is a distinct advantage in setting up MrGrid.

Download:

Figure 1. MrGrid web interface showing user input for (a) MTZ file, sequence and space group(s); and (b) search model(s) and RMSD values.

https://doi.org/10.1371/journal.pone.0010049.g001

MrGrid will first request that the user uploads the processed structure factor data (in MTZ format) for the MR calculation (Figure 1a). The file is uploaded and parsed to extract the space group (SG) & possible “F” & “SIGF” labels contained within the file. In the case that there is more than one possibility for “F” & “SIGF” labels the user is asked to select the appropriate one. The sequence of the unknown protein can be optionally provided and is used to calculate the molecular weight (Figure 1a). The expected number of molecules in the asymmetric unit (ASU) must also be entered. The user is also presented with the expanded point group of the space group that has been extracted from the MTZ file. This allows the user to expand their search, by selecting any number of SG combinations. Alternatively the user can select the “SGALTERNATIVE ALL” check box, which will search all possible point group SGs on a single node for each search model.

Having defined the experimental data, the user must now input the search models; by uploading a compressed format archive file (either zip or tar) containing all search models in PDB format (Figure 1b). The number of copies of the search model to search for is then entered, along with a packing tolerance and RMSD values to test. Before job submission, the user selects the Xgrid resource to use.

MrGrid will then analyse the user's input, breaking it down into smaller jobs and distributing them to available nodes on the grid resource that the user selected. The number of jobs will be equal to the number of space group options selected (note: SGALTERNATIVE is counted as a space group option), multiplied by the number of search ensembles contained within the compressed archive uploaded, multiplied by the number of RMSD options.

The node that the job runs on is passed a small Ruby script, the MTZ file, one of the search models extracted from the compressed archive, parameters that were entered by the user, as well as the space group derived from the MTZ file. The Ruby script writes a PHASER command script in standard CCP4 [4] format, which contains instructions for PHASER to run the job, and is executed by the node. By default, MrGrid will always run PHASER in MR_AUTO mode.

Once submitted to a node, each job runs to completion independent of all other jobs, and a URL where the results of the submission can be accessed is returned to the user. The results page presents a brief summary of the jobs; if a job finds a solution, MrGrid will display the Z-Score & Log Likelihood Gain (LLG) in the job summary (Figure 2). The jobs are also hyperlinked, allowing the user to quickly navigate through the page, the full set of results for a job. The full set of results for a job is made up of several expandable and collapsible elements, from which the user can view/download the output PDB/MTZ files, along with the standard PHASER solution file (.sol), and summary file (.sum), the complete log file for the job can also be viewed. The original MTZ and search model used for the job are also able for download.

Download:

Figure 2. Typical results interface showing PHASER jobs running on Xgrid, allowing the user to view the results of completed jobs independently of others running.

https://doi.org/10.1371/journal.pone.0010049.g002

Test Case Selection

A set of 10 proteins were used as test cases, representing 8 different SCOP [5] families (Table 1), and allowing for the parallel execution of 4 to 54 jobs at any one time. PDB entries were selected on the basis of having 3 or more homologous structures in the PDB, with datasets from a range of point group symmetries (Table 1). Both coordinate and structure factor information was retrieved from the PDB for each protein, along with peripheral information necessary for running the MR experiments (e.g., sequence, ASU content). Homologues for each test protein were identified through a BLASTP search of the PDB using the NCBI server (http://www.ncbi.nlm.nih.gov/). MR search models were generally chosen on the basis of a >30% sequence identity across the majority of the monomer of interest (i.e. no partial matches). The exception was the hypothetical protein TTHA0727 test case, which represented cases of lower identity (<30% ID) along with some examples of subdomain insertion within the chosen search models, relative to the test case protein (PDB ID 2CWQ). The purpose of this example was to provide a non-trivial MR example using a divergent group of proteins (from the AhpD-like Superfamily [6]).

Download:

Table 1. List of test case proteins extracted from the Protein PDB.

https://doi.org/10.1371/journal.pone.0010049.t001

Results and Discussion

Xgrid-accelerated parallel MR using test cases

The purpose of this phase of testing was not to assess the capacity of PHASER to perform MR. Rather, it was our intention to simply investigate the advantage to using MrGrid when screening multiple PHASER jobs at one time. Experimental data (structure factors) taken from PDB for the 10 proteins listed in Table 1 were each used in test case experiments on MrGrid in order to demonstrate the utility of the system under standard MR situations. For each protein example, data were screened against each homologue search model (including self), searching all alternative SGs belonging to the reported point group (Table 2). The number of jobs submitted to our local Xgrid (Table 3) varied between 4 and 54, and the corresponding speed up factors showed a clear linear relationship with a correlation of 0.85 (Figure 3). Featuring an average speed up value of 5.69 across all the tests, it is clear that MrGrid has the capacity to significantly reduce the time taken to achieve a MR result when screening numerous parameters.

Download:

Figure 3. Graph depicting the linear relationship between the numbers of jobs submitted to the Xgrid and the respective speed up values.

Speed-up is calculated by dividing linear run time by MrGrid total run time. Linear run time is defined as the sum of the run times of all jobs (job1_runtime + job2_runtime + jobN_runtime). MrGrid total run time is defined as the time difference between the start of the first job and the end of the last job (jobN_finish - job1_start). The linear run time is intended to provide an estimation of how long jobs would take to run synchronously on one computer. r² represents the ‘goodness of fit’ of the linear regression line to the data points. y is the intercept on the y axis.

https://doi.org/10.1371/journal.pone.0010049.g003

Download:

Table 2. Summary of MrGrid results for 10 test cases studied.

https://doi.org/10.1371/journal.pone.0010049.t002

Download:

Table 3. Specifications of the Xgrid resource utilized during this study.

https://doi.org/10.1371/journal.pone.0010049.t003

Though a more exhaustive testing may reveal a levelling off of the speed up factors as the number of jobs exceeds the capacity of the grid, the results depicted in Figure 3 display a clear advantage up to 54 jobs when run on our local Xgrid (Table 3). While it is important to note that any particular test case will always run as long as its longest job, in addition to speeding up MR calculations MrGrid provides a convenient solution to screening MR input parameters via a simple web page. It is important to differentiate between making use of spare CPU cycles on desktop computers, as we use here, to form ‘desktop grids’, and dedicated cluster nodes. Performing our experiments on dedicated cluster nodes would clearly increase the efficiency of the calculations.

Example of an evasive MR solution

We set out to test the utility of the MrGrid approach for a challenging MR case where the sequence similarity of available search models is relatively low (<25%). We chose the peroxidase-related protein yp_604910.1 from deinococcus geothermalis (PDB ID 2OYO). A globular all-alpha helical protein, it features two monomers in the ASU. It structure was determined by MAD to 1.52 Å resolution (unpublished). After performing a sequence similarity search using FFAS [2] we identified two potential search models, with sequence identities of 19% (2GMY) and 24% (2O4D). We generated “mixed” models of each (consisting of conserved side chains - all other non alanine/glycine residues truncated at Cγ atom) using the SCRWL server [7], as well as poly-Ala models with and without loop regions. This generated a total of 6 search models, which were input into MrGrid. Further screening against 5 RMSD bins generated 30 separate runs of PHASER, looking for both monomers in the ASU. The majority of calculations took >5 hours to complete. In order to assess whether solutions would refine using standard procedures, we input all solutions having Z scores greater than 7.0 into the refinement program REFMAC [8] and the automatic building and refinement program ARP/wARP [9]. From the 7 solutions tested only one solution (Z score = 9.2) produced a substantial decrease in R_free (initial = 56%, final = 49%) and successfully built to near completion in ARP/wARP.

The value of the MrGrid parallel approach is that it offers considerable timesavings, such that potential solutions can be tested relatively quickly. In this particular case, performing the MR calculations allowed all 7 potential solutions to be tested in a standard refinement procedure in a matter of hours. In contrast, this would most likely have taken significantly longer (e.g. days-weeks) using a serial approach, with the sole solution perhaps only being identified by chance after a significant period of time.

This paper reports the development of a new web portal MrGrid, which allows multiple PHASER MR calculations to be performed in parallel over networked computers typically available in protein crystallography laboratory. With a demonstrated capacity to significantly reduce the time taken to screen numerous MR jobs, MrGrid is able to facilitate difficult MR cases. Furthermore, parameters sweeps have the capacity to improve the chances of obtaining MR solutions, thus accelerating the structure elucidation process.

Availability and Future Directions

MrGrid is freely available from http://code.google.com/p/mrgrid/. There are currently efforts to extend MrGrid to non-Apple computing resources, for example using the CONDOR project (http://www.cs.wisc.edu/condor/). In addition, we are also investigating ways of implementing automatic post MR model refinement to provide an automatic method of validation.

Acknowledgments

We thank Ruby Law, Michelle Dunstone, Khalid Mahmood and Randy Read for helpful discussions.

Author Contributions

Conceived and designed the experiments: JWS AMB. Performed the experiments: JWS MAB. Analyzed the data: JWS AMB. Contributed reagents/materials/analysis tools: MAB CFR SGA JMP JCW WJG DA. Wrote the paper: JWS AMB.

References

1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402.
- View Article
- Google Scholar
2. Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A (2005) FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Res 33: W284–288.
- View Article
- Google Scholar
3. McCoy AJ, Grosse-Kunstleve RW, Adams PD, Winn MD, Storoni LC, et al. (2007) Phaser crystallographic software. J Appl Crystallogr 40: 658–674.
- View Article
- Google Scholar
4. CCP4 (1994) The CCP4 suite: programs for protein crystallography. Acta Crystallogr D50: 760–763.
- View Article
- Google Scholar
5. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540.
- View Article
- Google Scholar
6. Ito K, Arai R, Fusatomi E, Kamo-Uchikubo T, Kawaguchi S, et al. (2006) Crystal structure of the conserved protein TTHA0727 from Thermus thermophilus HB8 at 1.9 A resolution: A CMD family member distinct from carboxymuconolactone decarboxylase (CMD) and AhpD. Protein Sci 15: 1187–1192.
- View Article
- Google Scholar
7. Canutescu AA, Shelenkov AA, Dunbrack RL Jr (2003) A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci 12: 2001–2014.
- View Article
- Google Scholar
8. Murshudov GN, Vagin AA, Dodson EJ (1997) Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr D Biol Crystallogr 53: 240–255.
- View Article
- Google Scholar
9. Morris RJ, Perrakis A, Lamzin VS (2003) ARP/wARP and automatic interpretation of protein electron density maps. Methods Enzymol 374: 229–244.
- View Article
- Google Scholar

[ref1] 1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Jaroszewski L, Rychlewski L, Li Z, Li W, Godzik A (2005) FFAS03: a server for profile–profile sequence alignments. Nucleic Acids Res 33: W284–288.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. McCoy AJ, Grosse-Kunstleve RW, Adams PD, Winn MD, Storoni LC, et al. (2007) Phaser crystallographic software. J Appl Crystallogr 40: 658–674.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. CCP4 (1994) The CCP4 suite: programs for protein crystallography. Acta Crystallogr D50: 760–763.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Ito K, Arai R, Fusatomi E, Kamo-Uchikubo T, Kawaguchi S, et al. (2006) Crystal structure of the conserved protein TTHA0727 from Thermus thermophilus HB8 at 1.9 A resolution: A CMD family member distinct from carboxymuconolactone decarboxylase (CMD) and AhpD. Protein Sci 15: 1187–1192.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Canutescu AA, Shelenkov AA, Dunbrack RL Jr (2003) A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci 12: 2001–2014.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Murshudov GN, Vagin AA, Dodson EJ (1997) Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr D Biol Crystallogr 53: 240–255.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Morris RJ, Perrakis A, Lamzin VS (2003) ARP/wARP and automatic interpretation of protein electron density maps. Methods Enzymol 374: 229–244.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

Figures

Abstract

Background

Methodology/Principal Findings

Conclusions

Introduction

Methods

Mr Grid Overview

Using Mr Grid to perform parallel MR on a local network

Test Case Selection

Results and Discussion

Xgrid-accelerated parallel MR using test cases

Example of an evasive MR solution

Availability and Future Directions

Acknowledgments

Author Contributions

References