ProtClean: The Protein Preparation Tool¶

Overview¶

For many of the functionalities provided by the QSimulate platform (QSP), you will need to specify a protein receptor. This includes both QUELO FEP and QuValent. These programs will check the uploaded PDB file to ensure it is acceptable for simulation. However, the PDB format is not standardized and uniform for computational purposes and different programs expect slightly different formats. Furthermore, structural elements necessary for simulation can often be missing, like hydrogen, capping groups, and even backbone atoms sometimes, especially in files directly pulled from the PDB. The purpose of ProtClean is to enable the user to prepare and format a receptor PDB file within the QSimulate platform.

ProtClean has two modes of operation: (i) a “Clean” mode to perform all the steps necessary to prepare a raw and incomplete receptor PDB file for QSP, and (ii) a “Format Only” mode to take a receptor PDB already carefully prepared for simulation and format it for use with QSP.

The primary features of the Clean mode are its abilities to automatically perform the following tasks:

Add any missing hydrogens
Determine the protonation states of residues in the protein where protonation can vary near physiological pH (HIS, LYS, ARG, ASP, GLU)
Add any missing protein structures (loops, sidechains) that may be missing in the PDB file
General cleanup/rationalization of residue IDs and atom number to be consistent with the requirements of QUELO/QuValent
Allow the user to select which elements (chains, cofactors) are to be retained for the calculation. In particular, this feature allows the user to remove unnecessary symmetry replicates or binding partners.
Optionally, add an explicit atom model of the membrane, for membrane-bound proteins

Missing protein structure is added using a locally implemented approach based on AlphaFold, which provides reliable structure generation for many proteins. Because it is locally implemented, structural information remains securely behind the AWS-provided security wall. Missing structure (if any) is determined based on SEQRES records in the input PDB.

The Format Only mode of ProtClean just checks the uploaded structure and assigns the appropriate formatting such that the output PDB can be used by other products within QSP. This gives you total control over your input structure. This is the mode to use if you decide to modify the structure provided by the Clean mode, or you have other protein preparation tools you would prefer to use.

Below, the ProtClean interface is described in more detail.

The Clean mode of ProtClean has been developed to be able to robustly handle PDB-format-compliant structures, such as those that would be downloaded from the Protein Data Bank (RCSB) website, or that are generated by software that has been developed to adhere to the PDB rules. Note that non-compliant PDB files can trigger errors during processing, that will be reported to the user.

The ProtClean Task List¶

When you enter the platform, you will be presented with the ProtClean Task List, a list of calculations (Tasks) that you have previously set up and/or run, as well as a dialog to create a new Task. Clicking on a Task will bring you to the setup/results page for that Task.

For more details on the Task List, see the section The Task List.

Expert Mode¶

A small number of options (described below) are only shown in Expert Mode. The options shown in (default) Standard Mode are sufficient for most users to run a reliable simulation. If you need access to Expert Mode, that is accomplished via a toggle in the User Settings panel.

For more details on enabling Expert Mode, see the section Settings and Expert Mode.

File Input Specification¶

When you click on or make a new Task in the ProtClean Task table, you enter the setup dialogs for that Task. At the top of the setup, you will need to specify the input PDB file that you wish to prepare. This file should contain the protein receptor you will be subsequently using in your calculation (QUELO or QuValent). This file may also contain various cofactors, waters, and other proteins (and nucleic acids) that may interact with the protein of interest. For example, a PDB file obtained from the protein data bank will often contain multiple symmetry-related replicates that appear in the unit cell (but are not critical to understanding binding), cofactors that are included only to help with crystallization, etc.

Click on the Browse button to use your browser’s navigation dialog to select the input PDB-format file to be used. Click on the Upload button to process the PDB file. A blue bar will appear on the next line and indicate when the upload is complete.

Membrane Receptor¶

This toggle box appears below the PDB file input dialog. If the protein you are preparing is membrane bound (e.g., a GPCR) you may wish to create an explicit atomic membrane model for the protein. This can be accomplished by toggling this option on. When you toggle this option on, a new section of the panel will appear below the Options section, where you can specify the type of lipid bilayer to use, and where you want the protein to be located relative to the bilayer (see the Membrane Equilibration description below). If your protein is not membrane-bound, or you wish not to use a membrane model, leave this box unchecked.

Note that using a lipid bilayer will result in a calculation that is more costly to run.

Equilibrating the requested membrane receptor takes some time, and can require a few hours of preparation time.

Also note that if your input structure already includes the full atomic lipid bilayer, you should not use ProtClean. ProtClean cannot handle input structures with an atomic lipid bilayer already in place. Those structures should be input directly into QUELO/QuValent. The names of the bilayer atoms in such a case must follow the Lipid17 or Lipid21 convention.

Options¶

Format Only Mode¶

Select this option to enter the Format Only mode of ProtClean. This mode will take the uploaded structure as is and just format it. It will not change the structure at all - no atoms will be added or removed, no coordinates will be changed. In this mode, ProtClean will:

Update residue and atom names to reflect protonation state and conform to QSP standards
Reformat any cofactors so that their topology is properly specified

Selecting this mode turns off all other features in ProtClean - you can’t remove chains, cofactors, or patch missing loops. Note that the structure must be fully complete and ready, so with this mode ProtClean expects:

All backbone, sidechain, or cofactor atoms to be present, including hydrogen
The element column to be properly declared for each atom. For cofactors, this column should include the formal charge of each atom if it is not zero (e.g. O1- for an oxygen in a carboxylate, N1+ for a charged nitrogen in amines, …).
Receptor capping groups to be present
No gaps in chains, or that gaps in chains are capped

ProtClean has a series of checks in place to ensure that these requirements are met and will notify you if not. This is the mode to use if you decide to modify the structure provided by the Clean mode, or you have other protein preparation tools you would prefer to use.

The remainder of this section describes options if you do not opt for the Format Only mode.

The Inclusion Table¶

Once the PDB file is uploaded, the contents of the file will be analyzed and displayed in a table in the Options section. In this table, you will find a list of all the chains in the file and the contents of those chains.

There are two user-adjustable columns in this table:

Select: This user-selectable toggle appears in the fourth column, and can be used to include or exclude an entire chain from the prepared PDB output file. By default, all chains are included. If you deselect the chain, all elements of the chain (protein residues, waters, cofactors) will be excluded from the output. If you include the chain, all protein and water residues are included; the user can choose to include or exclude cofactors (which are, by default, excluded). If you de-select a chain, it will turn to grey color in the 3D visualizer (see below).
Edit: The edit button appears in the Cofactors (fifth) column. This button is active if any cofactors were identified for that chain. If the Edit button is not selectable, it is because no cofactors were identified for the corresponding chain.

The inclusion table has no effect on the Format Only mode, which will include every chain and every cofactor in its analysis.

The Edit Cofactors Dialog¶

Clicking on the Edit button for a chain where cofactors were identified will present a dialog where the user can choose cofactors to include via toggles. By default, all cofactors are excluded. Note that metals and other ions are, like waters, not shown in this dialog (but are retained in the final structure).

The toggle in the third column, Select, can be used to keep the cofactor in the prepared output. For example, in the following, Cofactor P32, residue 400, would be kept for Chain B:

If you click on any of the rows in the Cofactor Inclusion dialog, the visualizer on the left will zoom in on that cofactor. For example, clicking on the P32 row zooms in as shown below:

Mouse actions on the contents of the visualizer in the Cofactors Dialog are the same as described for the 3D Visualizer (below).

The 3D Visualizer¶

On the right hand side of the Options portion of the panel, you will find a 3D visualizer of the full system. The protein chains are shown using a cartoon representation. Waters and cofactors (if any) are not shown. Each chain is automatically colored using a different color. If a chain is de-selected (using the Inclusion Table toggle), the color of the corresponding chain will change to grey in the 3D visualizer.

Below is an example: Four crystallographic replicate chains are in the input PDB file. We have deselected all but the “B” chain, and the resulting view (with all but the “B” chain in grey) is shown at the right:

../_images/PP_Chains_Selection_Plus_Visualizer.jpg

Mouse actions available in the 3D visualizer:

Left-click Plus Drag: Rotates the contents of the visualizer
Scroll-wheel: Zoom in/out
Double-click: Will move the residue you click upon to the center of the box and reset the center of rotation to that residue.
Right-click Plus Drag: Moves the contents of the visualizer in the (x/y) plane.
Hold down scroll wheel and move mouse: Adjust clipping plane

Start Preparation¶

If you have not requested the generation of an explicit membrane, then the Start Preparation button will appear at the bottom right of the Options section of the panel.

Clicking this button will start the protein preparation.

If you have requested an explicit membrane, then this button will instead appear at the bottom right of the next section, Membrane Equilibration.

Membrane Equilibration¶

This section of the panel only appears if you have toggled the Membrane Receptor on in the protein input section. In this case, this section of the panel allows you to specify the type of lipid bilayer to create, and where the protein should be situated in the lipid bilayer.

There are two user adjustments in this section of the panel.

Lipid: Specify the type of lipid bilayer to use. Several options are provided:
- POPC: Phosophatidylcholine phospholipids. Asymmetric carbon tails (one 16-carbon, one 18-carbon)
- POPE: Phosphatidylethanolamine phospholipids. Asymmetric carbon tails (one 16-carbon, one 18-carbon)
- DOPC: Phosophatidylcholine phospholipids. 18-carbon symmetric tails
- DOPE: Phosphatidylethanolamine phospholipids. 18-carbon symmetric tails
- DLPC: Phosophatidylcholine phospholipids. 10-carbon symmetric tails
- DLPE: Phosphatidylethanolamine phospholipids. 10-carbon symmetric tails
Bilayer position (slider): Using the slider, the user can adjust the position of the bilayer via z-translation. If the user has obtained the membrane bound protein from the OPM (Orientations of Proteins in Membranes) database, then it will be aligned so that the default position of the slider (centered on z=0) is optimal. But if the protein structure is coming from another source, it will likely be necessary to adjust the alignment here.

If the protein structure is not coming from the OPM, it is critical to ensure that the protein is aligned so that the z-axis of the protein is perpendicular to the x/y plane sides of the rectangular “box” defined by the lipid bilayer.

Residues of the protein in this viewer are either yellow (hydrophobic) or blue (hydrophilic). If adjusting the slider, attempt to maximize the amount of yellow in the bilayer, and minimize the amount of blue in the bilayer.

The chosen bilayer will be included at the specified position. The starting model is an equilibrated all-atom bilayer generated using the Amber Lipid 17 force field. Atoms of the bilayer that would overlap atoms of the protein are omitted, and equilibration is performed on the resulting model.

Some additional specifics of the bilayer model used will appear below the Lipid selector dialog on the left side of the panel. These are informative, and the user cannot directly change them (they change with modifying the Lipid model or the slider).

For more information about the OPM and about the atomic membrane bilayers, see

https://opm.phar.umich.edu/about#methods_and_definitions

https://pubs.acs.org/doi/epdf/10.1021/ct4010307

Note that membrane equilibration can be somewhat time consuming, possibly taking a few hours to complete.

Start Preparation¶

If you are generating an explicit membrane, then the Start Preparation button will appear at the bottom right of the the Membrane Equilibration portion of the panel.

Clicking this button will start the protein preparation.

Simulation Status¶

This portion of the panel provides the status of the calculation once it has been started using the Start Preparation button. It also provides the ability to stop and restart a preparation that has been submitted.

Stop: Stop a calculation that was previously submitted and is in progress. A stopped calculation is saved in the cloud storage associated with your account, and can be restarted later, using the “Run” command.
Resume: Resume a previously Stopped job.

Below, and also to the right of the control buttons, you will find information about the status of your job. The total estimated virtual CPU usage (vCPU) is given, as is an overall progress bar.

Most of the steps of protein preparation are relatively fast. The one step that takes the vast majority of the time is prediction of missing structural elements. This calculation is performed using a locally implemented approach based on AlphaFold. This is an advanced ML method that is trained on a vast amount of protein structure and sequence variational data, and in many cases can very accurately predict missing structure. Loop patching can sometimes take as much as an hour to run, although it is faster than that for most systems.

Results¶

Below the Simulation Status section, you will find the results of your calculation. The information in this section will be auto-populated when the simulations requested have completed.

3D Results Viewer¶

The 3D viewer in the results portion of the panel allows the user to visualize the resulting prepared structure and to compare it to the input structure.

Three buttons appear in the visualization space:

Original: This is the input structure for the chain(s) that were selected .
Output: This is the output structure, including the protein and (if requested) the membrane.
Overlay: Overlays the input structure with the output structure to help visualize any changes that were made. For example, if missing residues were replaced in the structure during preparation, the overlay will make these apparent. The original structure appears in red while the output appears in silver.

Mouse actions are as described for the 3D Visualizer section (above).

PDB Download¶

This button allows the user to download the resulting prepared protein system, in PDB format. The downloaded file can subsequently be input to QUELO or QuValent. The prepared protein will include the elements chosen in the Inclusion Table, and any missing residues of the protein will have been generated using an approach based on AlphaFold. It will also contain the atomic coordinates for the membrane, if one was generated.