Notes on working with computational models (back to Examples)

Introduction

It might sound obvious to both computational chemists and experimental chemists that there are large differences in how chemical systems, in particular biochemical systems, are handled.  Most computational chemists are interested in quantities such as accuracy, systems that are interesting because of special properties, modeling phenomena such as NMR and circular dichroism, etc. On the other hand, experimentalists need something that can represent their specific system of interest, that can be used for modeling phenomena in that system, and, hopefully, can provide insights into the processes that occur in their system.  These notes are intended for use by experimentalists because they are the ones who will be using computational modeling methods as a tool. 

Why should experimentalists use computational methods, and what is the downside of doing so?  Computational methods, in particular the methods available in MOPAC, can provide a detailed insight into biochemical processes that cannot be obtained by any other means.  For example, the following types of phenomena can be explored in great detail:

On the other hand, computational modeling requires a serious commitment of effort on the part of experimentalists before significant useful results are obtained.  This commitment has two components: the investment of time and effort required to obtain the basic skills necessary for working with computational models, and the investment of time and effort required to set up and run simulations of interest.  Both components are definitely non-trivial, and to help in this work a lot of effort has been put into ensuring that the program works correctly, and if a error caused by a user's actions is detected, then a polite warning or error message is generated, along with advice on how to correct the fault.  In addition, a suite of utilities, built into MOPAC, allows many of the operations that need to be performed on biomacromolecules, such as adding hydrogen atoms correctly, to be done easily.

What's hard and what's easy about computational modeling

 Experimentalists interested in computational modeling might approach the topic with fear and concern about the intellectual difficulties of handling quantum theory.  This is a common, but unjustified, worry.  All the theory is embedded in the software, all of it.  If, after you've used the software for a while and found that it's useful, you might be interested in learning some of the theory that's involved.  Great!  There's a whole world of theory to explore.  But it's important to realise that knowing how the program works is much less important than knowing how to use it.  After all, everyone knows how to use a TV set - grab the remote control and press the "On" button.  But very few know how a TV set works.  It either works or it doesn't.  It's the same with computational chemistry software.

But the fact that running a simulation using MOPAC is as easy as using a remote control on a TV set does not mean that doing research using MOPAC is easy.  A large skill-set is needed before a research project can be carried out efficiently.  Work on identifying these skills and finding what worked and what didn't work took about 20 man-years of effort. And rather than re-inventing the wheel, please read the following sections on how to carry out a comp. chem. project:

Basics: The file system

Of greatest importance is a well-designed folder system on the computer, and a systematic but not rigid file-naming convention.  "Systematic" means that files representing similar systems should have similar names.  Examples would be, for example: steps in a mechanism, transition states in that mechanism, mutations being used to investigate binding or specificity.  Each of these systems can be related to others in that group, and, if a consistent naming convention is used, going from one system within a group to another is very easy.  The exception, and why a rigid naming convention is sometimes unwanted, is in processes where the "group" consists of a set of different systems.  An example would be a group consisting of the original PDB structure, the hydrogenated structure, the optimized geometry of that system, etc.  Each of these would require a unique name, e.g. "Original PDB structure", "Hydrogenated structure", "Optimized geometry".

A folder convention would need a two or, at most, a three level structure.  The top-most level would refer to the project.  One level down would be the main level, and this would contain folders that refer to entire groups of systems, for example "Starting model", "GS" (for ground state systems), "TS" (transition states), "FORCE" (test for a valid transition state), "IRC" (for Intrinsic Reaction Coordinates), "work" (A sandbox for testing ideas), "Literature" (Original journal articles, etc.).  On the main level there would also be spreadsheets, research notes, etc.

Have a good backup system.  This is priceless when the hard drive crashes, and also when a mistake is made and an important file is damaged, and the damage is not discovered for weeks or months.

Care and handling of computational systems

Filenames, folders, sandboxes, and tables

It's a good idea to spend time deciding what names to use for files and folders.  At the start of a project this might not seem that important, but if the project is poorly organized, severe problems will occur later on.  Eventually, a systematic approach will become essential, so it's better to set up the organization of the project at the start.  Here are some guiding principles.

Filenames should be completely systematic.  This can result in long filenames, so only include essential material.  For example, if the project involves one protein, then the folder name for the project should include the protein's name, and within the project the protein's name should not be mentioned in any file.  That saves valuable space in folders and filenames.  The parent file is likely to be a PDB file, in which case the file could contain the protein's name, e.g., "Chymotrypsin 8GCH original PDB file.pdb" For obvious reasons such a file should never, ever, be modified.  Other files could have names such as "Hydrogenated PDB.arc", "Optimized PDB.arc", "Step 1.arc", "Step 1-2.arc", "Compare Steps 1 and 2.mop", "Compare Step 1 and Step 1-2.mop", "FORCE for Step 1-2.mop", and "IRC for Step 1-2.mop".

Each type of calculation should have its own folder.  Thus all files used in the preparation of a chemically-sensible starting model would be in a folder that had a name such as "Make starting model"  All the steps in a mechanism could be in the same folder, "GS", or "Ground State", with each step being indicated by a systematic name, e.g., "Step 1 arc", Step 2.arc", etc.  Similarly for transition states, "TS",  vibrational frequencies, "FORCE", Intrinsic Reaction Coordinates, "IRC", comparison of various systems, "Compare Systems", etc.

At all stages in a project issues of how to run a specific type of calculation will arise.  After spending a significant amount of effort on carefully building a file structure for the project, it would not be a good idea to mess up such a structure with files generated during testing various ideas. Instead, one or more special folders should be set up for this purpose.  These folders can be thought of as sandboxes where ideas can be tested.  Such folders could have names like Work-1, Work-2, etc.  Any files in these folders older than a few days should be regarded as unimportant, and in principle could be deleted after only a quick examination.  Of course, a better practice would be to delete all files in a sandbox as soon as the issue is resolved, to avoid even the need for the quick examination.

Each project should also be represented by a table, an Excel spreadsheet is ideal for this purpose.  This would be the computational chemistry equivalent of a lab notebook.  The location of this table is of great importance, because as the cells in the table start to get populated all hyperlinks inserted will point to other relative locations.  If the table is moved, the hyperlinks would need to be updated; this is a tedious task and one to be avoided.  A good location for this table would be the main or top-most level of the project.  This emphasizes its importance.  Initially the table would be empty.  Fill in the top horizontal headings, and the left vertical headings.  This step will focus attention on what the project is all about.  Then as data are generated, fill in the appropriate cell in the table, and add hyperlinks to the appropriate files.  Typically, these would be an OUT or ARC file and a HTML file, so both numeric data and graphical data can quickly be examined.  When a cell has an interesting feature, add a comment to that cell.  Eventually all data relating to the project would be represented by a set of tables.

Examples of notebooks for projects

Each example includes a ZIP file containing all the files used or mentioned in the project.  This allows the project to be reconstructed.

Chymotrypsin: Entire reaction mechanism consisting of six steps and four transition states.  An additional transition state for a hypothetical unlikely reaction is included.

MTH1: Analysis of binding site and substrate specificity. Includes construction of the starting model

Good policy

Delete unnecessary files

A sequence of events that occurs quite often is:  A problem is identified.  Attempts are made to solve it (Use a sandbox!).  Eventually the problem is solved.  At that point, and while the issue is still clear in your head, the entire sequence should be examined.  If the problem was one of understanding then edit one or more files and one or more cells in a table to add appropriate comments, or if it is an important point, write notes on it in the top level.  In most cases, the next step would be to delete all the files from the sandbox.  If the issue was both important and complicated, create a new folder, and put in it the minimal set of files that would show what the problem was and how it was solved.  This would allow the issue to be re-visited easily at a later date, if necessary, and not clutter up the project. 

Do not save files for failed experiments, i.e., for failed attempts to solve the problem.  In practice, most attempts will fail.  All that matters is the one attempt that worked!

Complete one job before starting another

This policy is particularly important in the preparation of a good starting model.  While all steps in preparing a good starting model might seem obvious, the sequence in which the steps are carried out might not seem either important or obvious. Almost always the starting point is a PDB file, and, in the first few steps, the reference point would be that PDB file.  Eventually, using the original PDB file as the reference would become inconvenient, and reference would be changed to the chemically sensible starting model. So what makes the original PDB file an inappropriate reference system for modeling? It is usually not the system of interest.  For example, an inhibitor might be docked in the binding site.  There might be positional or symmetry disorder.  The original structure might be dimeric, and only one system is needed.  There are any number of reasons that the original PDB file should not be the reference point.  Regardless, here is a sequence of steps that can efficiently convert a raw PDB file into a chemically sensible starting model.

First step: preparing a hydrogenated system

All these steps are done in one job.  This job uses the original, unedited PDB file, and takes very little time to run, so if something needs to be changed, make the change and re-run the job. 

Specifying the PDB file: Use the GEO_DAT keyword to define the PDB file to be used.  This both protects the PDB file from being accidentally modified and makes the data-set very small so editing it is very easy.

Preliminary editing of PDB file:  If the file contains more than one entire system, and only one is wanted, then use CHAINS=(a), where "a" is the chain of interest, typically chain A.  This will select all atoms that have the chain-letter selected, and exclude all others.

Hydrogenation:  Keyword ADD-H will hydrogenate the PDB file specified, and produce a system in which all ionizable sites are neutral.  Although almost always some sites will need to be ionized this system makes a good starting point for the next keyword.

Modifying hydrogenation: Keyword SITE=(text) allows sites to be ionized and allows the saturation of carbon atoms to be changed.  At this point, a useful strategy is to run the current system, and include the keyword HTML. If salt-bridges are wanted, include SITE=(SALT), otherwise do not include the SITE keyword.  Look as the system using a GUI.  If HTML was used, double-click on the HTML web-page created by the run. Identify the sites that need to be ionized or de-ionized or need a different saturation.  Then write or edit the SITE keyword, and run the job again.  Look at the revised system, and, if necessary, make more changes to the SITE keyword.  At this point, the objective is to get the hydrogenation as correct as possible, but achieving a perfect hydrogenation is often difficult.  Aim for as good a hydrogenation as can be achieved without an undue amount of effort.  If, later on, any faults are found in the hydrogenation, they can be corrected at that point. 

At this point, the job to convert the PDB file into a hydrogenated system can be run.  Examples of common keywords used in a job of this type are:

GEO-DAT="8GCH.pdb" ADD-H HTML SITE=(SALT,"[HIS]57:F.NE2"(0))

Detecting errors using keywords CHARGES and RESIDUES:  A useful test to detect errors in the PDB file starts with the hydrogenated system from the previous step. Use keywords GEO-DAT="Hydrogenated 8GCH.pdb", CHARGES, RESIDUES, and, optionally, OUTPUT to minimize the size of the output.  Run the job, and examine the output.

Check the charges on the various sites by searching the output for the text "Ion Atom No."  Charged atoms that take part in salt-bridges are usually correct.  But if an ionized atom is either more than about 4 Ångstroms from a counterion or does not take part in salt bridge formation, then have a closer look at that ion.

If there is a line "Residue names that have changed", look at the next few lines under "Original residue name Calculated residue name"  If the changes look reasonable (use a GUI to examine the system used in this job) then continue.  If any problems arise, take whatever corrective action is appropriate.

Re-sequencing: Normally, resequencing is not necessary; by default, resequencing is done automatically when hydrogen atoms are added unless NORESEQ is present.  Modifying this sequence is normally not recommended, because if a hydrogen atom is added or deleted, automatic resequencing will be done again, and any manual resequencing changes will be lost.  But if the default re-sequenced system is definitely not acceptable, the system can be resequenced using an editor.

Up to this point, all the steps have been very fast.  The next steps takes considerably more time.

Optimizing the positions of the hydrogen atoms:  Starting with the hydrogenated system, run a job that optimizes the positions of all hydrogen atoms, using an appropriate set of keywords for example: "GNORM=4 NOOPT OPT-H MOZYME CHARGE=n HTML OUTPUT EPS=78.4".  The resulting geometry will be that of the original PDB file with hydrogen atoms added and their positions optimized.

Optimizing the entire system: Starting with the geometry from the previous step, replace "NOOPT OPT-H" with "OPT"  Keyword "OPT" turns on all optimization flags.

Bad policy

Do not do any work in folders other than the sandboxes, e.g., folders with names such as Work, Work-1, etc.  No work should be done in any of the other folders below the top-most level of the project.  When a useful result is obtained, copy the relevant files to the appropriate folders in the project, and update the table, i.e., the lab notebook.  If, later on, a change has to be made to files in the project, copy the files to a sandbox, make the change, and then copy the files back.

Except at the start of a project, do not use RESIDUES or RESIDUES0 unless there is a compelling reason.  This is particularly important when a project has been underway for a while. If either of these keywords is used, the COMPARE option might not work because some files might have one naming system for the atoms and some files a different naming system.

When something does go wrong, do not try to correct the error straight away.  Instead, find out what went wrong, then, starting with the system immediately before the fault, take corrective action and run the jobs again.  Only if the fault is very minor and can be easily, emphasis on easily, corrected should results files be edited.  If results files are edited, it's very easy to make a typographic error, and that error could cause severe problems later on. 

When a job stops unexpectedly, do not ignore the output.  Very often there will be a warning or error message at the end of the output that will explain what went wrong.  By design, MOPAC is allowed to make some corrections to a system, but if there is more than one possible correction, it will always stop.  Then it's up to the user to decide how best to proceed.

When things go wrong

There are two main reasons why things go wrong.  Most of the time, you've made a mistake. Examine what's going on, identify the error, correct it, and carry on.  Less often, but more interesting, is the possibility that you have misunderstood the system.  When that happens, re-examine the chemistry that's occurring at that point.  Things get really interesting when you suspect that you're correct and that the output is wrong, i.e., that the method being used is making a mistake.  An example of such a phenomenon would be if you think an acid residue should be ionized and then when a geometry optimization is run a proton always moves back onto the carboxylate group.

In such a situation, i.e., when there is reason to think that the results are wrong because of a fault in the method used, proceed as follows:  Design a small system that illustrates the phenomenon, and model that system using MOPAC, and also using a higher-level method, e.g., B3LYP with D3 and a good basis set.  If the same qualitative result is obtained, then there is a high probability that the calculated results are correct, and that you have been using an incorrect assumption.  This is particularly interesting when the incorrect assumption is currently accepted as being correct.

When jobs take an excessive amount of time, and the system isn't simply very big, examine it for faults.  Many types of faults can slow down the SCF calculations, and such slowing can be a red flag that there is a fault in the system.