MOZYME
Main MOZYME function
Conventional semiempirical methods use matrix algebra methods, most of which scale as the third power of the number of atoms. A consequence of this is that modeling of systems of more than 1,000 atoms is impractical.
By using Localized Molecular Orbitals (LMO), the Self-Consistent Field equations can be solved in a time proportional to the size of the system. Increasing the speed of the SCF moved the slow step to other parts of the calculation, so changes were made to all the remaining time-consuming steps to make them more efficient. Changes were also made that reduced the memory demand. The result of all these changes is that routine modeling operations can now be carried out on over 90% of all the entries in the Protein Data Bank (PDB), previously only about 10% of all entries could be used.
Geometric functions
Most of the common modeling functions have been enhanced by the MOZYME function, these include:
- Single-point calculations.
A general impression of the change in behavior can be obtained by comparing the resources required for various single-point, i.e., single SCF, calculations. Other operations, such as geometry optimization, consist of repeated SCF calculations. All calculations shown here were done using a Windows XP Pro computer with a 4000+ or 2.41 GHz Athlon processor.
The test systems used here were constructed from a single large protein, cut into pieces of the appropriate size.
Comparison of MOPAC and MOZYME computer resources required for a 1SCF calculation
|
No. of atoms |
Time for 1SCF (minutes) |
Memory (megabytes) |
Ratio MOPAC/MOZYME |
|
MOZYME |
MOPAC |
MOZYME |
MOPAC |
time |
memory |
|
400 |
0.2 |
2.3 |
17 |
101 |
12 |
6 |
|
800 |
0.4 |
25.4 |
33 |
391 |
64 |
12 |
|
1000 |
0.9 |
59.0 |
43 |
607 |
65 |
14 |
|
1500 |
2.3 |
222.6 |
78 |
1,424 |
97 |
18 |
|
2000 |
2.9 |
(527)¹ |
102 |
(2,532) |
(182) |
(25) |
|
3000 |
5.6 |
(1,781) |
164 |
(5,696) |
(318) |
(35) |
|
5000 |
15.8 |
(8,244) |
367 |
(15,822) |
(522) |
(43) |
|
10000 |
56.9 |
(65,956) |
1,007 |
(63,288) |
(1,159) |
(63) |
|
15000 |
230.3 |
(222,600) |
1,026 |
(142,400) |
(966.4) |
(139) |
|
18000 |
308.4 |
(384,653) |
1,262 |
(205,056) |
(1,247) |
(162) |
1: Numbers in parentheses are estimated
After the first SCF calculation is done, all subsequent SCF calculations run much faster:
- Geometry optimization:
This includes single point, restricted, and unrestricted, for molecular assemblies and for solids.
- Transition state location.
This operation runs at about half the speed of geometry optimization.
- Transition state refinement.
This operation runs at about 20% of the speed of geometry optimization.
- Transition state characterization.
The
FORCE calculation is a rapid operation now, running in a
small fraction of the time required for geometry optimization.
- Intrinsic Reaction Coordinate.
Unlike the other operations, this is very time consuming,
taking 10 – 50 times longer than a geometry optimization.
- Reaction paths, grids, etc.
These can be run using only the atoms of interest, in which case
it is very fast, or using all atoms at every step, in which case it is slow. The time for each point is
only slightly less than a geometry optimization.
Accuracy
Properties of proteins (structure, energetics, reactions, etc.) are reproduced with good accuracy using PM6. For details of PM6, see:
http://www.springerlink.com/content/ar33482301010477/fulltext.pdf
For accuracy of PM6 compared to other methods, see:
http://openmopac.net/manual/index_accuracy.html
For accuracy of modeling protein properties, see: HTTP<add URL here>
Limitations
The current MOZYME function is restricted to closed-shell systems
for which a Lewis structure can be generated. Only Restricted Hartree Fock methods can be used, ROHF and
UHF are not available. This means that well-known radicals, such as ethyl, C2H5·, and
triphenylmethyl, (C6H5)3C·, cannot be modeled. An exception to this rule
concerns biomolecules that contain transition metals. These are normally treated as open-shell systems, but
if the main focus of interest is on geometries and energetics, they can be modeled using closed shell methods.
- Electronic properties:
Only the ground electronic state can be modeled. This excludes all excited electronic states.
- Structures:
Thus far, no structures have been found that cannot be modeled using MOZYME.
The maximum number of atoms attached to a given atom is 15.
- Upper size limit for geometry optimization:
15,000 atoms. Above 10,000 atoms, optimization becomes slow.
- Upper size limit for single point calculation (usually for generating forces):
18,000 atoms.
- Upper size limit for transition state operations, the IRC and the DRC:
The same as geometry optimization.
Utilities
- Detection of errors in the starting geometry. Of their nature, proteins are large molecules,
and making sure that every atom has the correct valency is tedious. Every program for adding hydrogen
atoms that was tested has faults, and even one fault can ruin a simulation. All faults resulting in incorrect
valences can be identified by keyword CHARGES. This prints out the atom numbers and types of all charged atoms.
These can then be quickly checked using a GUI, and any corrective action then taken.
- Identification of residues.
The residue sequence can be calculated and printed. This allows unrecognized residues
to be identified. If they represent user errors, then corrective action can be taken. If they represent genuine hetero groups,
they can be re-defined as modified residues using keyword XENO. If they are the result of severe distortions generating spurious bonds,
these bonds can be broken using keyword CVB.
- Identification of ionized residues.
During a geometry optimization, hydrogen atoms might migrate to form a salt bridge.
All resulting ionized residues are indicated by a trailing "+" or "" sign. Partial charges on residues can also be printed.
- Ramachandran angles.
All Ramachandran angles are printed when residues are identified.
- Atom and residue gradients.
Total forces acting on atoms and on residues can be printed. This allows the
rapid identification of geometric distortions, particularly in X-ray structures.
- Restart option.
Because calculations involving proteins often take many days, a robust restart
function is provided. During a long run, the restart file(s) can be used to run a zero-SCF, this allows a detailed
examination of the state of the running system to be made, without stopping the run.
Write PDB files. When PDBOUT or RESEQ is used, the geometry is printed out in PDB format.
- Re-define chains and residue numbers
- Resequence. When hydrogen atoms are added, they are usually put at the end of the data set.
This results in a non-standard PDB format. The standard PDB format can be generated using RESEQ. This puts all atoms in the same residue together.
If two geometries, from entirely different PDB files are compared, the atoms in each file must first be put into the same sequence,
so that there is an exact one-to-one correspondence. Unavoidable ambiguities in the PDB format definition require some extra re-arranging
of atoms to ensure maximum coincidence. This is done using keyword GEO_REF.
Getting started using MOZYME
Both because MOZYME is a new function in MOPAC, and because the issues involved in manipulating proteins are very different from those involving small molecules, all users are strongly recommended to spend some time familiarizing themselves with the new functions.
General guidelines
The initial geometry should be uncharged. In many proteins there are charged sites such as carboxylate anions and ammonium cationic sites; often these form internal salt bridges. Nevertheless, in order to avoid any ambiguity about the electronic state of the system, all charges should be neutralized during the preparation of a starting geometry
by adding or removing hydrogen atoms. Even systems, such as bacteriorhodopsin,
that have a definite net charge, should be first represented as the neutral species. The only exceptions are when a strong monovalent
ion such as potassium or tetra-alkylammonium cation, or a halide anion is present, in which case the ion should be correctly represented.
Preconceived ideas of charge are often wildly inaccurate. These include generalizations such as "all acid functions should be ionized" and "all arginine residues should be protonated." If these ideas were implemented, the resulting net charge on a protein would likely be nonsense.
This policy of requiring the initial structure to be uncharged might appear to be restrictive, but during a geometry optimization salt bridges will often form spontaneously, so the initial uncharged species is quickly replaced by a more realistic system. After the geometry is optimized, any ions the user considers important can then be added. If the entire protein has to have a certain net charge, protons can be added or removed as necessary. Thus bacteriorhodopsin would have the retinal Schiff base nitrogen atom protonated to give bR+.
The end result is a well-defined protein with the correct charged sites.
- Check hetero groups. Most utilities that add hydrogen atoms do a good job of handling the common residues, but a poorer job with hetero systems, such as the heme group. Because of this, almost all the errors in adding hydrogen atoms occur with hetero groups. Therefore, these groups should be checked carefully to verify that the degree of saturation is correct.
- Identify and characterize uncommon residues. If a residue is not recognized by MOPAC, check that it is correct.
If it should be one of the 20 common residues, either add the missing atoms, or use
XENO to allow the residue to be recognized.
If it is a modified residue, such as the retinal group in bR, use XENO to re-define it as the parent residue
Do not ignore unrecognized residues – they usually indicate a fault in the data set.
- Trust that the charged sites printed are correct. The method for calculating charged sites is very robust.
If it says that a site is charged, believe it. In older versions of MOPAC, charged sites were not printed,
and about 20 users reported errors in MOPAC, saying that the net charge it wanted was incorrect.
In every case, errors were found in the user’s data set, due to simple errors in assigning hydrogen atoms.
Although the charges reported by MOPAC are always correct, minor problems might be encountered in extended, delocalized,
p systems. Thus if the hydrogen on the phenol oxygen in tyrosine is missing,
the charge might be placed on the Cg rather than on the oxygen,
that is, the Lewis structure MOPAC generated was for the quinone rather than the phenolic resonance structure.
Starting Exercises
To help with this, the following exercises are suggested:
- Verify that MOZYME works:
This is a simple example to show what a MOZYME output file looks like. The
system is Cys-Phe-Glu, but in the data set only the Cartesian coordinates are
given. By adding RESIDUES the various
amino-acid residues in the tripeptide are identified. Information on the
residues is printed shortly after the empirical formula is written in the output
file. More new information on the residues in printed in the output. This
includes the total forces acting on the residues (a measure of how far the
residue is from being optimized), and the net charge on each residues.
Most of these will be small, the exception being ionized residues, where the
charge will be near to +1 or -1.
- Generate a PDB file:
MOPAC can convert a data set into PDB format. The simplest way to do this
is to add PDBOUT and
0SCF to the keyword list.
0SCF prevents any semiempirical calculations from being run, so this operation
is very fast, even for large systems. If information on the secondary structure
(helices and beta sheets, etc.) is available, add that information to the
comments section of the data set (lines at the start of the data set that begin
with an asterisk, "*").
- Read in a PDB file:
MOPAC should automatically recognize and read in PDB files. All the text
in the PDB file up to the first atom (a line starting with the word "ATOM")
should be deleted, and the standard three lines of a MOPAC input data set put
there instead. If the PDB header text is needed, for example if PDBOUT
is present, then convert the header lines into MOPAC comments by adding an
asterisk ("*") to the start of each line.
- Re-sequence a data set to conform to the PDB format:
If atoms are added or removed, so that the sequence of atoms in the data set
does not follow that of the PDB format, the atoms can be re-sequenced, using
keyword RESEQ. This example is artificial, in that all atoms of each type
are together. When RESEQ is used, the job is stopped after resequencing.
-
Recognize an uncommon residue:
In this example, a modified lysine group is correctly recognized by specifying
the modification using
XENO.
-
Identify a fault in a data set:
A data-set will not run because the charge specified (zero) is incorrect.
By using CHARGES, the ionized sites can be
identified. By analyzing these sites, any faults in the data set can be
identified, and corrective action taken. In this case, the only fault was
an incorrect charge.
- Optimize a polypeptide:
In general, optimizing geometries is best done in two stages. First,
optimize the positions of all hydrogen atoms, then optimize the positions of all
atoms. In this example, all atoms are initially marked for optimization,
but by switching off all optimization flags (NOOPT)
then turning on the flags for hydrogen (OPT-H), only the hydrogen
atoms are optimized. Once that's done, all atoms can then be optimized by
switching on all the optimization flags (OPT).