August 15, 2014

In this post we explore the power of charity to supply the massive computing capacity needed to support our protein design research. As shown in Figure 1, thanks to hundreds of thousands of Rosetta @home volunteers, the Institute for Protein Design (IPD) computing capacity has grown ~5-fold over the last two months.  Rosetta@home (R@h) is currently producing an estimated 160 teraFLOPS (thousand billion floating-point operations a second) directed at protein folding and design.  Amazing!

BoincStats
Figure 1. Growth of Rosetta@home Volunteer Computing. The Institute for Protein design has been working with Novacore, Charity Engine, and HTC’s Power to Give, the to increase the number of R@h volunteers. The impact on our R@h computing capacity has been tremendous. In July and August 2014, the number of active R@h volunteers grew 5-fold from 25,000 to 125,000, enabled by an average addition of 100s to 10,000s of computers added to R@h with the help of our partners.

Because of the vast number of possible conformations and sequences for a given protein molecule, the primary bottleneck to robust design of protein structures is the algorithmic Rosetta sampling and prediction of alternate conformations. Searching this space for the lowest free energy conformations and sequences is a formidable computational challenge, and it would be practically impossible to design accurate protein structures without the volunteer computing resources provided by Rosetta@home and BOINC (Berkeley Open Infrastructure for Network Computing). Low cost or free compute cycles contributed towards the massively distributed compute cluster R@h make it possible to efficiently address this challenge and enable the rapid discovery of new protein therapeutics and nanomaterials.

How do Rosetta@home and BOINC work?

Volunteer computing supplies more computing power to science than does any other type of computing—enabled by the huge number of highly networked PCs and mobile devices in the world.  Anyone can join the Rosetta@home project, where like-minded citizen scientists around the world contribute their idle computers to unleash the power of our “Rosetta” software for calculations of new protein structures.

As illustrated in Figure 2, the R@h/BOINC workflow beings with IPD researchers who submit designed protein sequences in the queue for computing jobs managed by R@h servers; R@h then sends jobs out to volunteer computers. Once the work unit is processed and returned, an operation that is automatically managed by BOINC, IPD researchers download structures from the R@h servers. Designs with the lowest energy conformations that are structurally identical to the design are subsequently tested in the lab. Once the designs have been tested, a new round of improved designs are calculated and new tasks are sent to the volunteer computer to go through the same cycle.

Figure 2. Computational Workflow for Rosetta@home and BOINC. Proteins are first computationally designed locally in the labs of the Institute for Protein Design (IPD). The best designs are then computationally tested on Rosetta@home (R@h). When BOINC is run, a set of tasks is sent from R@h’s scheduling server to a volunteer computer. The computer then downloads executable and input files from R@h’s data server, runs the application programs and produces output files that are uploaded to the data server. (1) The R@h scheduling server submits tasks to the volunteer computer; (2) the volunteer computer running BOINC downloads executable and input files and applications from the data server; (3) volunteer computer runs R@h application program; (4) volunteer computer produces output files; (5) output files are uploaded to R@h data server; (6) completed tasks are reported to scheduling server.  Once the completed tasks are reported to the scheduling server, the volunteer computer gets new tasks; this compute cycle is repeated indefinitely and BOINC manages it all automatically.  Once the work unit is processed and returned by a computer, the volunteer computer is granted credit; credit is used to keep track of how much CPU time has been donated to R@h and supports friendly competition between teams of volunteer computers.  Image adapted from boinc.berkeley.edu
Figure 2. Computational Workflow for Rosetta@home and BOINC. Proteins are first computationally designed locally in the labs of the Institute for Protein Design (IPD). The best designs are then computationally tested on Rosetta@home (R@h). When BOINC is run, a set of tasks is sent from R@h’s scheduling server to a volunteer computer. The computer then downloads executable and input files from R@h’s data server, runs the application programs and produces output files that are uploaded to the data server. (1) The R@h scheduling server submits tasks to the volunteer computer; (2) the volunteer computer running BOINC downloads executable and input files and applications from the data server; (3) volunteer computer runs R@h application program; (4) volunteer computer produces output files; (5) output files are uploaded to R@h data server; (6) completed tasks are reported to scheduling server. Once the completed tasks are reported to the scheduling server, the volunteer computer gets new tasks; this compute cycle is repeated indefinitely and BOINC manages it all automatically. Once the work unit is processed and returned by a computer, the volunteer computer is granted credit; credit is used to keep track of how much CPU time has been donated to R@h and supports friendly competition between teams of volunteer computers. Image adapted from boinc.berkeley.edu

As mentioned above, and as illustrated in Figure 3, a good protein design should have a low energy conformation and be structurally identical to the design.  Optimal protein sequences are those that have a big energy gap between final folded state and alternative undesired states.

As a quality check, new designed protein amino acid sequences are submitted to “forward folding” Rosetta calculations which assess the propensity of a defined amino acid sequence to fold into the designed final low energy 3D protein structure; versus alternative folding into some other undesired 3D structure.  The plotted data is tremendously helpful for prioritizing which protein design sequences are more likely to produce the desired protein structure; the very best designs are subsequently tested in the IPD.

Figure 3. Example Forward Protein Folding Plots from R@h.  Each data point represents a R@h volunteer computer calculation of the Rosetta-predicted protein folding pattern for an anti-cancer protein design. This is plotted as a low energy Score versus the root mean square RMS deviation from the designed protein structure (low values for Score and RMS are desirable). A small number of data points show accurate protein folding with low calculated energy (green dots), while the majority of calculations produce predicted structures that deviate from the design and have higher energy (red dots).
Figure 3. Example Forward Protein Folding Plots from R@h. Each data point represents a R@h volunteer computer calculation of the Rosetta-predicted protein folding pattern for an anti-cancer protein design. This is plotted as a low energy Score versus the root mean square RMS deviation from the designed protein structure (low values for Score and RMS are desirable). A small number of data points show accurate protein folding with low calculated energy (green dots), while the majority of calculations produce predicted structures that deviate from the design and have higher energy (red dots).

About BOINC

BOINC is the acronym for the Berkeley Open Infrastructure for Network Computing.  BOINC software is made up of several separate programs: the schedulers and data server programs are installed on computers owned and managed by Rosetta@home servers housed at the Institute for Protein Design. The core client, applications, GUI (BOINC manager) and screensaver are installed on the volunteer home computer.  The core client communicates with external servers via HTTP to receive and report work. The core client also runs and controls applications. The installed application, in this case R@h, does the scientific computing. The GUI or BOINC manager is a ‘control panel’ for BOINC. It provides a graphical interface to monitor and control the core client and communicates via a TCP connection. The installed screensaver runs when the computer is idle. It also communicates with the core client by TCP.

Diving Deeper into Ab Initio Forward Folding

Previous work suggests that many engineered proteins fail to show the desired activity because they fail to adopt the proper folded state. Other work shows that ab initio folding of engineered proteins based solely on the sequence and basic biochemical principles is a powerful predictor of whether such a protein will indeed adopt the desired folded state. Screening a large number of candidate sequences by ab initio folding allows us to trim out those that will not fold properly, saving time and resources when the sequences are validated experimentally.

The size of such a protein brings the importance of forward folding to a new dimension. Even the smallest change, e.g. mutation of a single residue, can change the catalytic activity by several orders of magnitude. The vast information contents of every individual ab initio forward folding calculation, however, tells us a very detailed story of the impact of mutations on the protein structure and is beneficial to optimize the design and identify those proteins with the desired geometry. Increased computational capacity in R@h allows us to test larger sets of candidate sequences for proper folding, increasing the success rate of finding the desired proteins.

In addition, the outcome of the forward folding calculation shows very distinct patterns and is different from one structure to another. These distinct changes and patterns in forward folding are a rich source of data that helps us to understand what biochemical properties contribute to the stabilization of protein structures and to identify regions in the proteins that are particularly prone to destabilize the protein. Consequently, increased computational power at R@h vastly increases the number of protein structures that can be submitted to these intensive calculations, allowing us to optimize our designs and to accurately study the impact of mutations on the stability of proteins. This will increase both the number of sequences that can be engineered and tested, and the success rate for novel functional protein discovery.

We also use R@h to sample many potential topological definitions of a protein to bind a therapeutic target. Topologies that are able to fold by R@h are further designed and optimized together with target of interest to generate models of de novo binders.

Protein Design Projects Using Rosetta@home

Protein design and structure prediction has the potential to revolutionize therapeutic design, nanotechnologies and bioremediation.  Every day, researchers at the IPD are designing hundreds of new proteins for a wide variety of basic and applied research problems. These include the building blocks for next-generation vaccines, anti-viral and anti-cancer therapies, and new nanomaterials to detect and neutralize toxins or other disease targets.

Some key examples include:

1. Design of a whole new class of hyper-stable mini-protein scaffolds that adopt a variety of different topologies, presenting surfaces with new shapes and chemical properties for binding to other proteins.  See BINDI a protein designed to block Epstein-Barr virus replication.

2. Selection of the optimal protein designs to neutralize the Ebola or flu viruses.  This is part of our ongoing anti-flu research and our “War on Ebola”, the latter which involves Foldit player team support and has generated a strong interest in new puzzles.

3. Design of an Alzheimer’s disease amyloid protein binder as part of our “Three Dreamer” protein design partnership, which also involves Foldit community support.

4. Design of proteins to specifically block the action of Mdmx and Mdm2, proteins which enable nearly all human cancer cells to survive by disrupting the action of p53, a key protein responsible for mediating quality control mechanisms (cell-cycle arrest and apoptotic cell death). Many cancers survive by up-regulating p53 regulators Mdmx and Mdm2.  We are using R@h to engineer novel proteins that specifically bind to either Mdmx or Mdm2 to be used as research agents, and potentially, as therapeutics.

Donating Compute Cycles to Rosetta@home

Novacore has proven to be a viable platform for delivering useful computing cycles to Rosetta@home via volunteer computing methodology using BOINC technology. Since Novacore began donating compute cycles to R@h in November 2013, they have risen in the ranks to the second highest contributor of compute time (99 million credits as of August 2014).  During this period, Novacore was ranked number one in Recent Average Credit (RAC) amongst R@h participants, a calculation that determines the number of credits a user accumulates on an average day and reflects how fast work is being processed. According to Novacore officials, they used less than 1% of their capacity to deliver this amount of work for free to R@h. Thank you Novacore!

Charity Engine has also proven to be a viable platform in supporting R@h and has recently been responsible for adding an average of 20,000 to 30,000 new computers per day during the first week of August 2014 (Amazing, Thank you Charity Engine!).  As illustrated in Figure 4, Charity Engine works as a for-profit entity, servicing the computing needs of other much larger companies who have huge compute needs.  The Charity Engine “grid is rented like a giant supercomputer, then all the profits shared 1/3-1/3-1/3” between Charity Engine, the charities they serve (e.g. Doctors Without Borders) and the lucky cash prize winners selected weekly in a raffle lottery from the registered volunteers who provide their idle computer time to Charity Engine. Notably, Charity Engine will always donate a minimum of 5% of their capacity to computing power that Charity Engine has (e.g. certain hardware can only run certain computations) to Rosetta@home or other public @home projects hosted by BOINC.

Figure 4. Charitiy Engine business model
Figure 4. Charitiy Engine business model

 

 

 

 

 

 

 

 

 

 

Summary and Acknowledgements

It is clear that increased contributions to Rosetta@home will continue to be crucial towards enabling the discovery of new protein therapeutics and nanomaterials. Using Rosetta@home to rigorously evaluate our protein designs before synthesizing them in the lab has dramatically increased our success rate at designing proteins with new functions.

We wish to thank all of the Rosetta@home participants, Novacore, Charity Engine, HTC, and BOINC for their generosity.