QuaiPa: High-Accuracy Automated Force Field Parameterization¶
Overview¶
QuaiPa is the industry-leading peptide parameterization solution, leveraging a sophisticated AI model to deliver accuracy approaching that of quantum mechanics with the efficiency of a conventional molecular mechanics (MM) potential. QuaiPa models are seamlessly extensible, and can be refined for particular applications through the automated uncertainty quantification and dataset augmentation workflows that are included. Owing to the QuaiPa model architecture, training data can be generated efficiently on the basis of molecular fragments.
QuaiPa models are trained using a very large database of quantum mechanical energy and gradient data (from more than 107 structures and conformers) in combination with conventional non-bonded molecular mechanics models. Performing inference using a QuaiPa model generates optimized bonded force field parameters consistent with the user’s chosen nonbonded model, and the output is of the same format as a conventional MM model.
The power of QuaiPa is delivered through a command-line interface (CLI) to support: parameterization, peptide fragmentation, fragment combination for peptide construction, dataset generation, and QuaiPa model training. Each of the following workflow steps below are described in detail in this User Manual, along with example usage scripts.
Parameterize a peptide using a trained QuaiPa model
Fragment a peptide into fragments suitable for more efficient training
Combine fragments to construct linear and cyclic oligopeptides
Dataset generation to be added to the foundation (provided) QM DFT energy/forces database
Train a new QuaiPa model from the QM DFT energy/forces database
In addition to the main tasks, post-processing utilities are also available:
QuaiPa Workflow¶
QuaiPa Param¶
The most common application for QuaiPa is to perform direct parameterization of peptide systems. This process is managed through the quaipa param CLI. Executing this command performs inference on a trained QuaiPa model and generates a molecular mechanics force field with optimized bonded parameters as output.
The user has several options for performing inference (the available options can be displayed by executing quaipa param -h). The general form of the command is:
quaipa param input
[--no_opt]
[--models, -m models json]
[--format, -f {amber/gromacs}]
[--fragment]
[--charge_method, {bcc/resp}]
[--output_directory, -o directory name]
input (positional argument): A path to the input structure file containing a single molecule. There are two classes of structure files that represent acceptable input for quaipa param, and the steps involved in the parameterization workflow vary depending on which type is provided. The type is inferred from the file suffix (sdf/pdb/prmtop/gro).
Molecular structure-containing files: These structure files can be in either the SDF or PDB format. In this case, the charge generation step is performed, and the output results.json file is amenable to the quaipa fragment and quaipa dataset workflows that are described below.
NOTE: The SDF file format is strongly preferred due to more robust bonding and charge specification. Use PDB files with caution!
Prior parameter files: These files can be in either the AMBER PRMTOP or the GROMACS GRO format. In this case, the charge generation step is not performed, and the charges are instead taken directly from the input file. In this case, the output results.json file is not amenable to the quaipa fragment and quaipa dataset workflows that are described below.
no_opt (–no_opt): If the no_opt flag is provided then no optimization step is performed during the AM1-BCC charge generation step. This flag only impacts parameterization starting from a molecule structure-containing file. Including the no_opt flag makes the parameterization step significantly faster, but the user should confirm that the input coordinates are representative.
models (-m or –models, optional): A path to a json file that contains configuration and checkpoint files for each model. If not provided, QuaiPa will perform inference on the committee of base models that are distributed with the package. If provided, QuaiPa will perform inference on all of the models included in the json file, which must be formatted as follows:
format (-f or –format, options: (amber, gromacs), optional): The format for output writing. This is an optional input, and the default value is amber. If the format is chosen as amber, then the parameters are written in PRMTOP format. If the format is chosen as gromacs, then the parameters are written in GRO format.
fragment (–fragment): If the fragment flag is provided, the input peptide is first fragmented into its individual residues, each of which is capped with ACE/NME and parameterized separately. The charges from each residue are then combined back into the input peptide. This option is intended to be used for large peptides for which the AM1-BCC or RESP charge generation methods are too expensive.
charge_method (–charge_method, options: (bcc, resp), optional): The method used to calculate partial charges for either the entire peptide or individual fragment residues (if the fragment flag is included). The default value is bcc. If the charge method of bcc is chosen, the semiempirical AM1-BCC (Austin Model 1-Bond Charge Correction) method is used. If the charge method of resp is chosen, the full QM RESP (Restrained Electrostatic Potential) method is used. By default, RESP charges are always computed using fragments. For more detail on how the fragment-based charges are implemented, see below.
output_directory (-o or –output_directory, optional): The name of the directory where the parameterization results will be written. If this is not provided, then the directory name will be the prefix of the input file followed by “_$current_time”.
When quaipa param is run, inference is performed on every model that is included in the models.json file. If more than one model is included, then an additional “average” parameter file is generated which contains the force field whose gradients are equal to the average of each force field’s gradients. An example usage script is shown below.
This example will perform inference on each of the 5 default models, will skip optimization during the charge generation step, and will write output in PRMTOP format in a directory such as input_struct_1746550951/. In this case, the output will include: model_0.prmtop, model_1.prmtop, model_2.prmtop, model_3.prmtop, model_4.prmtop, average.prmtop, and results.json.
Fragment-Based Charge Generation¶
The partial charges for an input structure can be calculated with either the semiempirical AM1-BCC or quantum mechanical RESP models for charge generation using the charge_method option. AM1-BCC and RESP are both limited by the size of peptides that can be efficiently computed, however. While AM1-BCC does not become computationally limited until roughly 200+ atoms, RESP becomes prohibitively expensive much sooner, due to its use of a full QM calculation (HF/6-31G* level) with poor scaling. Larger peptides pose a challenge, therefore, if the generally more accurate RESP charges are required. The fragment option is intended for these cases, as this flag will proceed to break up a larger peptide into smaller residues and then calculate the charges of these smaller residues. When using the RESP charge method, fragments are made as small as possible (individual residues or linkers when possible). Conversely, when using AM1-BCC peptides are split up so as to create larger fragments of roughly equal sizes with up to 100 atoms, as shown in the below figure. Note that if the fragment option is included on a peptide with fewer than 100 total atoms, fragment will be turned off, as fragmenting for AM1-BCC of small peptides does not provide a noticeable speedup for the trade-off in accuracy. The partial charges for the atoms in each residue are then combined back into the original peptide. Also note that fragments without N- and C-termini caps are capped with ACE/NME caps with pre-set charges in order to approximate the original neighboring amino acids. The fragments retain the coordinates from the original peptide, so the no_opt option has no effect.
QuaiPa Fragment¶
Generally, training data for QuaiPa should consist of small-to-moderate sized structures. For example, single amino acids or oligopeptides with only a handful of amino acids (<=6) represent good choices. While there is, in principle, nothing inherently wrong with including larger systems, the QM calculations that are necessary to generate training data impose a significant performance dip when larger molecules are included. The quaipa fragment CLI is designed to streamline the process of generating optimal structures for training data by parsing amino acids from the input structure and, optionally, labeling which amino acids that are likely to be the most impactful for future iterations of trained QuaiPa models. At present, quaipa fragment is only applicable to oligopeptide structures.
When using the quaipa fragment CLI, the user must provide a few input arguments (the available options can be displayed by executing quaipa fragment -h). The general form of the command is:
quaipa fragment input
[--options options json]
[--structure structure file]
[--cap_type cap type]
[--output_directory, -o directory name]
input (positional argument): Path to the output results.json file from a quaipa param calculation.
NOTE: If the quaipa param calculation was performed starting from a PRMTOP or TOP file, then additional structure input is required (see below).
options (–options): Path to a json file that contains options that must be provided in order to perform fragmentation. These options are specified in the example options.json file and subsequently described in the bulleted list.
include_uncertain_atoms: If true, the fragmenter will perform the local uncertainty analysis in order to determine substructures that are least likely to be well described by the committee of QuaiPa models that were used to parameterize the molecule.
threshold: The threshold value should be a float between 0 and 1. A value of 0 would include every atom and a value of 1 would include only the atoms with the highest uncertainty. This example deploys a value of 0.8.
user_specified_atoms: A list of zero-based atom indices that guarantees certain atoms are considered high priority, regardless of the outcome of the uncertainty analysis. Provide an empty list to only consider results from the uncertainty analysis.
inclusion_distance: All atoms within a graph (bonding) distance of, e.g., 2 around the user-specified atoms and the atoms identified through the uncertainty analysis will also be considered as high priority.
structure (–structure, optional): Path to an SDF, PDB, or JSON file (with a single molecule) that contains the structure that will be fragmented. This argument must be provided if the input file does not correspond to a results.json file from a quaipa param calculation that started with an SDF or PDB file. At present, only oligopeptide molecules are suitable for fragmentation by this program.
cap_type (–cap_type, options: (OXT/NH2, ACE/NME), optional): The type of N- and C- terminal caps to add to individual residue fragments. The default value is OXT/NH2. If OXT/NH2 is provided, OXT (adds a terminal O atom) and NH2 (adds a second H atom) are used to terminate the N- and C- ends of the peptide, respectively. If ACE/NME is provided, ACE (acetyl) and NME (N-methylamine) groups are added to the N-terminus and C-terminus, respectively.
output_directory (-o or –output_directory, optional): Name of the directory where the fragmentation results will be written. If this is not provided explicitly, then the results will be written to a directory called “fragmentation_results_$current_time”.
When quaipa fragment is run, the results are written in the form of a json file with names like “fragment_$i.json”.
If this fragment is associated with user-specified atoms or those identified by the local uncertainty analysis, then
the corresponding json file includes an is_uncertain: true field. These structures are the ones that are likely to
generate the most impactful training data for newly-trained QuaiPa models. An example usage script is shown below, and
the output “fragment_$i.json” files can be used as input for the quaipa combine and quaipa dataset
workflows described below.
QuaiPa Combine¶
Individual amino acids that are generated by the quaipa fragment tool represent the most finely-grained substructures for generating new training data. However, in some cases, it can be important to include larger structures whose conformational space is more representative of the full molecule’s relevant conformations. To this end, the quaipa combine CLI supports the construction of linear and cyclic oligopeptides.
In order to use quaipa combine, the user must specify a few options (the available options can be displayed by executing quaipa combine -h). The general form of the command is:
quaipa combine input
[--options options json]
[--output_directory, -o directory name]
input (positional argument): Path to the output results.json file from a quaipa fragment job.
options (–options): Path to a json file containing the options that are necessary to carry out the fragment combination. These options are specified in the example json file below, and subsequently described in the bulleted list.
fragment_indices: An ordered list of fragment indices describing the sequence of the oligopeptide. These indices correspond to the indices that are appended to the fragment_$i.json files that are generated by the corresponding fragmentation job.
cyclic: If true, the oligopeptide will be cyclized by forming a bond between the two terminal residues.
output_directory (-o or –output_directory, optional): Name of the directory where the combination results will be written. If this is not provided explicitly, then the results will be written to a directory called “combine_results_$current_time”.
When quaipa combine is run, the results are written in the form of a json file named “combined_fragments.json” inside the generated output directory. The output “combined_fragments.json” files can be used as input for the quaipa dataset workflow described below.
QuaiPa Dataset¶
Quaipa is an extensible force field parameterization solution and the generation of additional training data subsets is automated through the quaipa dataset CLI. Executing this command performs a series of workflow steps consisting of:
Solvation of the molecule and conformer generation through molecular dynamics simulations.
Filtering conformers based on:
The uncertainty in energy predictions among members of a committee of QuaiPa models.
Conformational similarity with respect to previously-selected conformations.
Quantum mechanical calculations at the DFT level with pre-configured settings that are consistent with existing datasets.
Generation of a QuaiPa dataset.
QuaiPa training datasets can contain any number of individual structures. However, since each individual training dataset is independently accessible during QuaiPa model training, smaller datasets support more fine-grained control over which training data to include for future QuaiPa models. The quaipa dataset CLI includes multiple options (the available options can be displayed by executing quaipa dataset -h), and the general form of the command is:
quaipa dataset moldata_dir
[--models, -m models json]
[--n_confs num confs to generate]
[--solvent solvent string]
[--name name for the dataset]
moldata_dir (positional argument): A path to a directory containing json files describing each individual molecule that will be included in the dataset. These files are generated as output from the quaipa fragment and quaipa combine methods. All of the molecules that the user wants to include in the training dataset should be consolidated in this single directory.
NOTE: For QuaiPa, training data should consist of fragments and modestly-sized molecules.
models (-m or –models, optional): A path to a json file that contains configuration and checkpoint files for each model.
NOTE: If not provided explicitly, QuaiPa will perform inference on the committee of base models that are distributed with the package.
The models json file should be formatted in the same way as described in the quaipa param section.
n_confs (–n_confs): An integer number of the maximum conformations to generate for each structure in the training dataset.
solvent (–solvent, options: (water, hexane), optional): Solvent type to use during explicit-solvent conformer generation. The default is water.
name (–name, optional): Name of the dataset to be saved. If not provided, then the name will be chosen to be the name of the moldata_dir input directory.
When quaipa dataset is run, all of the input structures are first validated before the calculations proceed. In general, these structures should correspond to the output files generated by quaipa fragment or quaipa combine. It is worth noting that the ability to generate meaningful training data using only small or intermediate-sized molecules is considered a great benefit of the QuaiPa architecture.
During the conformer generation step, explicit solvent molecular dynamics simulations are performed, and solute conformations are stored periodically along the trajectory. The user has an option to select between water and hexane (apolar) as a solvent, so that the solute conformations that are generated are representative of those that are important under biological conditions. After completion, frames are selected on the basis of the uncertainty in their bonded parameter predictions. This is evaluated proportionally to the standard deviation of bonded energies predicted using each QuaiPa model that is passed into the workflow using the models argument. Again, if this argument is not passed explicitly, then the workflow will deploy the pre-trained committee of base models. Conformations with larger uncertainty are assigned higher priority, and they are furthermore screened against all previously-selected conformations to ensure that there is minimal redundancy in the selected conformations. Next, quantum mechanical calculations are performed on each of the conformations before they are ultimately stored in a QuaiPa dataset as an encoded graph.
An example usage is shown below along with the expected directory structure. The script is executed from the current_working_directory, each of the fragment_$i.json files contains the information necessary to include the molecule in the dataset, the results will be written to a directory called example_name, and each graph will include training data for up to 35 conformations.
QuaiPa Train¶
Training new QuaiPa models is made simple through the quaipa train CLI. User-generated datasets can be included in the training data using json files that include paths to the output that is generated by the quaipa dataset workflow step. The result is a newly-trained QuaiPa model that can then be deployed for inference.
NOTE: Training new models must be performed on a GPU card and requires Nvidia drivers greater than version 550.54 [you can check your driver version using the Linux command nvidia-smi].
When training new models, the base model training data is included by default, and the only input necessary is to specify additional training data that the user would like to include and some basic training options (the available options can be displayed by executing quaipa train -h), and the general form of the command is:
quaipa train datasets
[--options options json]
[--output output directory name]
datasets (positional argument): This is a json file that specifies the user-generated training data that will be included while training the new model. The json file should be formatted as follows:
options (–options, optional): This is a path to a json file that contains options. The only option that the user should specify is the maximum number of epochs to include in the training process. If the options path is not provided, then the training process will run with a maximum number of 1000 epochs. For example, if the user only wants to train for up to 500 epochs, the options json file would look like this:
output (–output, optional): This is a path specifying where the newly-trained model output files will be written. If it is not included, then the trained model will be written to a directory called “model”.
With these, quaipa train can be executed with a simple script such as:
QuaiPa Utilities¶
Rescore¶
The quaipa-utils rescore command returns the bonded energy decomposition for each frame in a provided trajectory. The output is a json file containing the energies from bonds, angles, proper dihedrals, and improper dihedrals, as well as the total energy, for each member of the committee provided by the param json file with the –param option. This allows the user to inspect trajectories for structures with large variations in energies across the committee to judge the model’s accuracy and determine if more training data is required. An option to further deconstruct the peptide into its individual fragments allows for performing the analysis on a residue-by-residue basis.
When using the quaipa-utils rescore CLI, the user must provide a few input arguments (the available options can be displayed by executing quaipa-utils rescore -h). The general form of the command is:
quaipa-utils rescore trajectory
[--param, -p params json]
[--topology, -t topology file]
[--csv]
[--options options json]
[--output_directory, -o directory name]
trajectory (positional argument): Path to the input trajectory file in either DCD or TRR format.
param (-p or –param, optional): Path to a json file containing committee and peptide MolData info, such as the output results.json file from a quaipa param calculation.
topology (-t or –topology, optional): Topology file containing the system from the input trajectory. This may include the peptide of interest plus any other molecules included in the trajectory, such as solvent molecules.
csv If provided, will also return csv files containing fragment decompositions.
options (–options, optional): Path to a json file that contains options for rescoring the trajectory. Possible options are:
subset: The subset of atoms from the peptide to use for rescoring. If “automatic”, the full peptide will be used. If a list of atom indices is provided, only those atoms will be used for bond energy decomposition. Default is “automatic”.
stride: Only every stride-th frame will be read upon loading the trajectory. For example, stride = 1 results in all frames being loaded while stride = 2 will skip every other frame. Default is 1.
output_directory (-o or –output_directory, optional): Name of the directory where the rescore results will be written. If this is not provided explicitly, then the results will be written to a directory called “rescore_results_$current_time”.
Merge¶
The quaipa-utils merge utility combines two or more QuaiPa dataset directories into a single dataset. This allows for organizing datasets, such as ones containing similar structures that were generated from separate quaipa dataset jobs.
When using the quaipa-utils merge CLI, the only arguments the user must provide are the paths to the QuaiPa dataset directories. A name for the merged dataset can also be provided with the –name option. The general form of the command is:
quaipa-utils merge dataset_path dataset_path ... dataset_path
[--name, -n merged dataset name]
dataset (positional argument): Paths to the QuaiPa dataset directories to be merged.
name (–name, optional): Name of the merged dataset. If not provided, the new dataset will be written to a directory called “merged_dataset_$current_time”.
If the names of the dataset directories to be merged contain a common prefix, the standard wildcard character (*) can be used to find all matching directories. For example, if the directories are named dataset_1, dataset_2, and dataset_3, the full command could be either quaipa-utils merge path/to/dataset_1 path/to/dataset_2 path/to/dataset_3 or simply quaipa-utils merge path/to/dataset_*.