# Molecular Modeling of Biophysical Systems

## Background

The primary goal of molecular modeling is to be able to predict the structure, dynamics, and interactions of a molecular system in fine detail. In general, this involves using quantum mechanics and classical particle physics to derive the higher-order chemical and structural properties of a system. To this end, molecular dynamics (MD) simulation, has evolved from rather humble beginnings in the 1970s into a powerful, emergent tool in biology for investigating the structure, function, and dynamics of biomolecules and biomolecular complexes. In 2013 the Nobel Prize in chemistry was jointly awarded to three of the early pioneers of MD simulation -Martin Karplus, Ariel Warshel, and Michael Levitt- for “the development of multi-scale models of complex chemical systems.” As the cost of processing power continues to decrease, and efforts to more effectively utilize existing hardware such as graphics processors continues to increase, MD simulation can be expected to become even more widely used and accepted as an important complementary tool for investigating biomolecular systems.

## Quantum mechanics

Quantum mechanics is the first-principle theory in interpreting the characterization of microscopic system. A particle is described by wave function Ψ(x, y, z, t) in quantum mechanics. While Ψ doesn’t have clear physical meaning, |Ψ|^{2}, the module square of Ψ represents probability density of such particle in a point of space-time (x, y, z, t). Thus Ψ determines all the observable of this particle.

For a static system, the time-independent Schrödinger equation can be used to solve time-independent wave function ψ:

\[H\psi = E\psi \]

H is The Hamiltonian operator. E is energy of the system. The Schrödinger equation implies that, a system which is described by a wave function ψ has a specific energy. That is where the energy level comes from.

Schrödinger equation can only be solved exactly for hydrogenic atoms (atom with only one electron and any atomic number). For more complex system, approximation methods will certainly be necessary.

*Ab initio* methods

The Latin term *Ab initio* means “from the beginning”. Such methods derive from quantum mechanics directly without inclusion of experimental data. *Ab initio* solution is not exact solution, but is still the most accurate and reliable result. Different *ab initio* methods apply different form of approximation, which separate them from each other.

#### Hartree-Fock self-consistent field method

Hartree-Fock self-consistent field method is the basis of many *ab initio* methods. It actually includes two ideas. One is Hartree-Fock approximation, the other is self-consistent methods.

The hamitonian operator for a molecule has formulation as below:

\({\rm{H}} = {T_e} + {T_N} + {V_{Ne}} + {V_{ee}} + {V_{NN}}\)

T_{e} and T_{N} represent the kinetic energy term of electrons and nucleus. T_{N} can be seen as zero because movement of whole molecule will not influence the energy.

V_{NN} can be treated as constant by Born-Oppenheimer approximation, which is the first approximation in molecular modeling. The content of this approximation is that the nuclei can be seen as fixed, because electrons move much faster than nuclei and respond to any movement of nuclei instantaneously. Thus we can repeat the computation and get a set of solutions for different distances between nuclei.

Then the hamitonian operator will become (in atomic unit):

The presence of electron-electron repulsion, V_{ee}, makes exact solution unachievable. Such electron correlation problem is in the central position of many *ab initio* methods.

The wavefunction in Hartree-Fock method has the form of slater determinant

Each χ represents a molecular orbital, which is a linear combination of atomic orbital ψ, which is also called basis set.

\(\chi = \mathop \sum \limits_r {c_r}{\psi _r}\)

Thus Ψ actually contains a lot of coefficients. The problem is to determine these coefficients.

According to variation theory, we have such Rayleigh ratio

For any Ψ, E cannot be lower than the true energy. Therefore we can optimize Ψ to get the lowest E. The corresponding Ψ will be the solution from variation theory.

Based on slater determinant wavefunction and variation theory, Hartree-Fock method moves forward to apply central-field approximation to tackle electron correlation. By this approximation, the electron moves in the field of all the other electrons and nucleus. Such field can be expressed as charge centered on nucleus. The problem thus becomes easier and Schrödinger equation also becomes Hartree-Fock equation.

\({\rm{F}}{\chi _i} = {\varepsilon _i}{\chi _i}\)

ε_{i} is orbital energy of χ_{i}. F is Fock operator

\({\rm{F}} = h + \mathop \sum \limits_r \left( {2{J_r} - {K_r}} \right)\)

h is one electron hamiltonian. J_{r} is coulomb operator, which represent coulomb repulsion between electrons. K_{r} is exchange operator, which arise from Pauli principle and has no direct analog in classical theory.

The trick is, F, actually J_{r} and K_{r}, depends on wavefunction χ – the state of all the electrons determines the central field and energy of any electron. To solve Hartree-Fock equation, we need first guess a set of molecular orbitals χ. Then we calculate Fock operator F from χ. The Fock operator directs to a new set of χ and this will cause a new Fock operator. The calculation will be repeated until the set of χ does not change significantly. Then the result is considered as self-consistent. Combined with central field approximation, that’s how self-consistent field comes from.

In HF calculation, basis set selection may be the most common decision one needs to make, and really determines the validity of a calculation. One choice is Slater-Type Orbital (STO), which has form

STO is not feasible for calculation, the commonly used is Guassian-Type Orbital (GTO), which is the modified version of STO and has better character for calculation

Several such kind of primitive Gaussian functions are often combined linearly to form contracted Gaussian functions with fixed coefficients, which can be decided by a least square fit to STO. A STO-3G basis set means each atomic orbital is described by a linear combination of three primitive Gaussian functions. We still consider this combination as one basis function.

A minimal basis set use one basis function for every atomic orbital. A minimal basis set for hydrogen only consist one “1s” basis function; while carbon will need five basis functions for minimal basis set—“1s, 2s, three 2p”.

Minimal basis set often can only give qualitative result. To make improvement, double-zeta basis set can be applied. In a double-zeta basis set, each basis function in minimal basis set is replaced by two. Thus the number of basis set will be doubled and lead to a more accurate result.

Applying double-zeta basis set to all the atomic orbitals is expensive, and not very efficient. Instead, a split-valence basis set is often used. In split-basis set, each valence orbital is represented by two basis functions, while inner orbital is only represented by one. The commonly seen 6-31G basis set means a STO-6G basis function is applied on each inner orbital while a STO-3G and STO-1G basis functions are applied on each valence orbital.

To make the calculation more accurate, polarization functions can be added to describe the distortion of s and p orbitals. 6-31G* is often used to represent adding six d orbitals to heavy atoms, while 6-31G** continue adds three p orbitals to hydrogen.

*Ab initio* methods are often the only choice in predicting process involving electron transfer such as chemical reaction. Besides, some parameters in molecular mechanics force field are also provided by *ab initio*.

### Semiempirical methods

*Ab initio* methods such as Hartree-Fock are expensive and cannot be used in molecules with dozens of atoms. To make calculation feasible for larger system, several simplifications are introduced to Hartree-Fock methods. Parameters based on experiment data are substituted into the Hartree-Fock equations. The combination of *ab initio* methods and experimental data forms the so-called semiempirical methods.

In semiempirical methods, minimal basis set is always used. Meanwhile, the inner electrons are parameterized. Thus the number of basis functions decreases a lot. While still not enough, the simplification on two-electron integrals, which derive from coulomb and exchange operator, differentiate various semiempirical methods. With different level of approximation, the surviving integrals will be parameterized.

The most severe approximation is Complete Neglect of Overlap Differential (CNDO), which has only limited usage. Nowadays the most commonly used semiempirical methods in organic system are Austin Model 1 (AM1) and Parameterization method 3 (PM3), which are improved version of Modified Neglect of Overlap Differential (MNDO).

While computation becomes easier in semiempirical methods, it’s not easy to use them. Different semiempirical methods derive from different database to obtain parameters. If the molecule to be computed is similar to the ones in database, the result will often be reliable. Otherwise the computation may have a large deviation. Therefore, it is necessary to realize the advantage and disadvantage of different semiempirical models before using them.

Semiempirical methods are often used to predict conformation and energy of medium-sized molecules. They are also widely used in first guess in *ab initio* calculation.

## Molecular mechanics

*Ab initio*, even semiempirical methods, are too computation-intensive for biomolecular system. Instead, molecular mechanics is the basis of various methods in simulating behavior of such large system. Building on classical theory, molecular mechanics totally avoid complexity of quantum mechanics.

Molecular mechanics treats molecule as charged particles connected with springs. The energy of molecule is represented by a potential energy function which is the sum of various interaction energy functions. These functions contain parameters obtained from experimental data and *ab initio* calculation. A combination of potential energy function and parameters is called force field, which is the essence of molecular mechanics.

The common form of potential energy is

**E = E _{s} + E_{b} + E_{tor} + E_{ele}+ E_{vdW}**

Each term on the right hand is shown by diagram below in the same order

_{} _{ } _{} _{} _{}

E_{s} and E_{b} are bond stretch energy and bond bend energy, which can be described by harmonic oscillator

E_{tor} is the torsional energy of bond rotation, which often has the form of Fourier series

where A is amplitude, τ is dihedral angel, φ is displacement in τ, n is periodicity factor and represents rotation symmetry.

E_{elec} and E_{vdW} are non-bonded interaction energy. E_{elec} is electrostatic interaction energy and often described by coulomb function. E_{vdW} is van der Waals energy and can be described by Leonard-Jones function

Besides these common interaction energy terms, some force fields may contain energy terms for hydrogen bond or dipole-dipole interaction. Cross terms like stretch-bend and bend-torsion can also be added to increase accuracy.

There is a lot of force field developed; most of them focus on organic and biomolecular system. MM force field, include MM1, MM2, MM3, MM4, may be the best force field for organic molecules. Amber and CHARMM are the most widely used force field for protein and nucleic acid, also the software with the same name. Molecular mechanics is even more sensitive to parameters than semiempirical methods. So it’s quite important to apply appropriate force field for particular system.

The first step in molecular mechanics calculation is giving parameters to all the atoms. Force field often classifies a kind of atom to different atom types to accommodate different chemical environment. For protein, α carbon and carbonyl carbon belong to different atom types. As biomolecules are often made from limited units (like amino acids for protein and nucleotides for nucleic acid), their parameters can be assigned automatically and directly from force field.

However, there are often no directly corresponding parameters for small molecule ligands in force fields designed for biomolecule. The most similar atom type in force field may be assigned to such ligand atoms. Sometimes it is necessary to “transplant” parameters from other force field (easy to transplant is actually an advantage of molecular mechanics). Among all the parameters, partial charge in each atom may be most critical and often be treated separately. Both experience based methods and *ab initio* methods can be used to calculate partial charge.

Molecular mechanics can be used to calculate optimized conformation and predict vibrational spectrum, but cannot treat any process involving electron transfer. Because of its calculation efficiency, molecular mechanics become the basis of large scale simulation like molecular dynamics simulation.

## Molecular dynamics simulation

A generic MD algorithm will be comprised of several discrete steps.

*Calculation of Forces**Time displacement**Calculation of new coordinates *

To begin an MD simulation, initial conditions must be defined, and these include: atomic positions as a set of cartesian coordinates, potential energy as a function of atom positions, and initial velocities for all atoms in the system. From these initial conditions, the forces on each atom are computed as an additive sum of pair-wise non-bonded and bonded interactions, plus constraints and scaling parameters. The computed results are used to update the position and velocity of each atom, and this information is (optionally) stored. All of these calculations are performed for a single time-displacement, or time-step, of the simulated trajectory.

### Force calculation

Force calculation in MD is entirely classical, deriving from Newton’s Second Law of motion. A simulated molecular system is treated as a discrete field of atoms each of which have pair-wise interaction potentials from which forces can be calculated. This is known as the all-atom interaction potential energy, and it is constructed using a parameterized “force field” for bonded interactions, and explicitly calculated for longer-range, non-bonded interaction potentials. Force field parameters are derived either from a fitting against experimental data, or quantum mechanical calculations.

For a classical system of atoms, the Hamiltonian energy function is the sum of the kinetic and potential energies:

\[H(\mathbf{r,p})=\frac{p^2}{2m}+U(\mathbf{r})\]

The kinetic portion depends only the mass and momentum of each atom, while the potential energy depends on pair-wise interactions that are position dependent. In one dimension, the derivative of the potential energy *U*(**r**) is opposite to the force, acting on a particle, or

\[\mathbf{F(\mathbf{r})}= - \frac{dU(\mathbf{r})}{d\mathbf{r}}\]

Recalling that

\[\mathbf{v(\mathbf{r})}= \frac{d\mathbf{r}}{dt} and \mathbf{a(\mathbf{r})}= \frac{d\mathbf{v}}{dt}\]

The connection to between energy calculation and position is through Newton's Second Law: \(\mathbf{F}=m\mathbf{a}\), or

\[m\frac{d^2\mathbf{r}}{dt^2}= - \frac{dU(\mathbf{r})}{d\mathbf{r}}\]

Hence the forces on each atom in a system, and thus the motion of the system can be calculated as a sum of intra-molecular kinetic and pair-wise inter-molecular potential energies. For biomolecules, the bonded potential energies are a parameterized set of semi-empirical functions collectively known as the "force field." There are a wide variety of force fields that are continually under refinement, often incorporating quantum mechanical calculations or empirical data to obtain more accurate parameters. and non-bonded potentials describe the longer range electrostatic and the close range van der Waals attraction and repulsion, collectively described as the Lennard-Jones potential.

#### Bonded potentials

Bonded potentials describe the interactions of covalently bonded atoms, and are of order \(O(N)\) in terms of computation. Some common bonded potentials are:

Vibrational \[U_{bond} = \sum_{bonds}{A_r(r - r_{eq})^2}\]

where *A _{r}_{ }* and

*r* are defined by the force field, and

_{eq}*r*is the bond length between atoms.

Bond-angle \[U_{angle} = \sum_{angles}{A_\theta(\theta - \theta_{eq})^2}\]

where \(A_\theta\) and \(\theta_{eq}\)are definined by the force field, and \(\theta\)is the bond angle between atoms.

Torsion angles \[U_{torsion} = \sum_{torsions}{\frac{V_n}{2}(1+ \cos{n\phi+\gamma}))}\]

#### Non-bonded potentials

Non-bonded interaction potentials involve interations of *all pairs* of atoms in the system, and are modeled as a sum of Coulombic potentials and a modified Lennard-Jones potential. These potentials constitute the bulk of computational complexity in a simulated biomolecular system, usually of order \(O(N^2)\) at worst and \(O(NlogN)\) at best in terms of computation. For each pair of atoms, *i, j*

\[U_{NB}(i,j) = U_{Coulomb}(i,j) + U_{LJ}(i,j)\]

where

\[U_{Coulomb}(i,j) = \frac{1}{4\pi\epsilon_0}\frac{q_iq_j}{r_{ij}}\]

and

\[U_{LJ}(i,j) = 4\epsilon_{ij}\left[\left(\frac{\sigma_{ij}}{r_{ij}}\right)^{12}-\left(\frac{\sigma_{ij}}{r_{ij}}\right)^{6}\right]\]

where *q *is the charge, \(\epsilon\) is the permittivity, and \(\sigma\) is the Lennard-Jones parameter.

### Discrete step-wise motion in time

In order to calculate the new positions of all atoms in simulation MD, an integration algorithm must be implemented. The most common methods are variations on the Verlet algorithm. The standard Position Verlet, which has the advantage of not depending on velocity, is derived from combining Taylor expansions of position vectors.

First three terms of the Taylor expansion, forward and backwards in time:

\[\mathbf{r}(t+\Delta{t})=\mathbf{r}(t)+\mathbf{v}(t)\Delta{t}+\frac{1}{2}\mathbf{a}\Delta{t}^2 +O(\Delta{t}^4)\]

\[\mathbf{r}(t-\Delta{t})=\mathbf{r}(t)-\mathbf{v}(t)\Delta{t}+\frac{1}{2}\mathbf{a}\Delta{t}^2 +O(\Delta{t}^4)\]

Summing and rearranging terms, we get

\[\mathbf{r}(t+\Delta{t})=2\mathbf{r}(t)-\mathbf{r}(t-\Delta{t})+\mathbf{a}\Delta{t}^2 +O(\Delta{t}^4)\]

Another common integrator is a variation on Position Verlet called “Leap-frog” Verlet, and has an advantage over Position Verlet in that the previous position in time is not needed, and relies on velocities instead. In this integrator, velocity going forward in time is expanded in a similar way but half time-steps, whereas position is still expanded going forward in a full time step.

\[\mathbf{v}(t+\frac{1}{2}\Delta{t})=\mathbf{v}(t-\frac{1}{2}\Delta{t})+\mathbf{a}\Delta{t}+O(\Delta{t}^3)\]

Substituting into the forward expansion of position, we have

\[\mathbf{r}(t+\Delta{t})=\mathbf{r}(t)+\mathbf{v}\left(t+\frac{1}{2}\Delta{t}\right)\Delta{t}+O(\Delta{t}^3)\]

With leap-from Verlet, the positions are accurate to an order of \(O(\Delta{t}^3)\) as opposed to \(O(\Delta{t}^4)\) for position Verlet.

### Computational Considerations

In molecular dynamics simulation, computational complexity must be considered, and there is a trade-off to be managed between representing a system in a reasonably accurate way, while maximizing the size and trajectory length scales of the systems being modeled. One standard way of reducing complexity is to apply harmonic constraints to bonded interactions, which oscillate on a time scale of ~1fs, so as to increase simulated time step to about 2fs. Another important technique is to impose periodic boundary conditions and long-range non-bonded cutoffs.

In MD, calculation of non-bonded interactions take up most of the CPU time, and in general, CPU cost is on the order of \(O(N^2)\) where N is number of atoms in the system being simulated and \(O\) is algorithmic notation for worst-case order of growth. In most cases, a modern, multi-core CPU can simulate about 200ns/day with a 3000 atom system.

#### Harmonic constraints

Since valence bonds vibrate at a high frequency, this imposes a small integration time-step ~1fs for simulation. In order to speed up simulations, a harmonic constraint is typically imposed in order to double the minimum time step to about 2fs. The constraint has the form:

\[r_i^2-a_i^2=0\]

where \(a_i\) is constant and \(r_i\) is the atomic bond length.

The modeling trade-off is that with this constraint, the coordinates of two bonded atoms are not truly independent.

#### Periodicity and long range cutoffs

One important technique for reducing the size and hence, computational complexity of a given system to impose periodic boundary conditions. This technique reduces the effective size of a simulated system while avoiding unwanted phase boundaries such as between water and vacuum. Several geometric shapes exhibit desirable triclinic periodicity for space-filling unit cells, the simplest of which is a cube. There are however, others, such as the rhombic dodecahedron and the truncated octahedron which are more desirable for modeling small molecules or globular proteins in particular. Periodicity is established by replicating the unit cell in each dimension, forming a lattice such that when a particle leaves the simulated unit cell, it’s image enters from the correspondingly opposite side. This is important for keeping the density of the unit cell constant, but in order to avoid self-interaction, a long range potential cutoff has to be chosen that is at most half the size of the unit cell, bringing the potential to 0 for \(r>r_{cut}\). Particle Mesh Ewald is the commonly used algorithm for calculating the periodic interactions.

#### Temperature and Pressure

Temperature and pressure are two important thermodynamic variables that are straightforward to calculate in a simulated MD system. Temperature is simply the ensemble average of the kinetic energies of all particles divided by the number of degrees of freedom. For *N* molecules in a system:

\[T=\frac{1}{3Nk_B}\left<\sum_{N}mv^2\right>,\]

where \(k_B\) is the Boltzmann constant.

Pressure is a little more complex in that it involves the virial expansion of the ideal gas equation, and pair-wise potential calculation of force between atoms.

\[P=\frac{1}{3V}\left[\left<\sum_{N}mv^2\right>+\left<\sum_{N}\mathbf{F}\cdot\mathbf{r}\right>\right],\]

where \(\mathbf{F}\) is the force acting on an atom, and \(\mathbf{r}\) is it's position.

Most production MD runs occur after an equilibration of temperature and pressure in which each is held fixed and velocities and positions are rescaled accordingly. The most common temperature rescaling method is the Berendsen thermostat, which rescales velocities by a factor, \(S_Temp\), but coupling to a fixed temperature.

\[S_{Temp}=\left[1+\frac{\Delta{t}}{\tau_T}\left(\frac{T_0}{T}-1 \right ) \right]^\frac{1}{2},\]

where the relaxation time constant, \(\tau_T\) is much larger than the time step \(\Delta{t}\) so as to weakly couple the system to the target temperature, \(T_0\).

Pressure coupling uses a similar scaling factor, but scales positions and unit-cell box size instead of velocities.

#### Treatment of solvent

Biomolecules need to be modeled in an aqueous environment because hydrophobicity is a well known driver of conformational changes and a mediator of binding. However, explicitly modeling solvent molecules is extremely computationally expensive. Efforts have been made to improve what are known as implicit solvents, in which the global effect of water is simulated by treating the surrounding media as a field with a high dielectric constant.

Recall, \[U_{Coulomb}(i,j) = \frac{1}{4\pi\epsilon_0}\frac{q_iq_j}{r_{ij}}\]

**Implicit Solvent**

In the implicit solvent approximation, the biomolecule only interacts with itself, but the electrostatics are modified in order to account for the lack of an atomistic representation of solvent. To achieve this, the dielectric constant in the electrostatic calculation is modified such that vacuum \(\epsilon=1\), protein \(\epsilon = 2-20\), and in water \(\epsilon = 80\).

**Explicit solvent**

This method is computationally more complex but arguably more accurate because water and ions are modeled explicitly and interactions are calculated. However, if hydrogen bonding, or any kind of specific protein-water interaction is of interest, then using explicit water model is a must. Most water molecules used in MD use a rigid, 3-site model of water where each atom is assigned a charge a Lennard-Jones potential. Certain models, such as the commonly used TIP3P are non-polarizable, but other models, such as SPC/E do add a polarization term to the energy function. This adds computational complexity, so it is another aspect to consider when designing the system to be modeled.

### Types of simulated systems

With the advancement of multi-core cpu, and multi-threaded software, as well as the advent of GPU-computing, molecular dynamics simulation can address an ever growing set of possible systems. These can range from single-atom or single-molecule simulations, to simulations of small globular proteins in explicit water, to lipid-bilayers, all the way to the simulation of large, trans-membrane voltage-gated ion channels and ion-conduction.

In general, the system size and system type to be simulated using MD is limited only in terms of useful tractability in terms of CPU time.

### Simulation Software

Many software packages exist for calculating molecular dynamics, each of which has their own devoted adherents, although none are specifically more advantageous than another. Some of the most popular are:

Most can be set up to run on any single modern computer, or scaled up to a cluster. GROMACS in particular is under active development for better utilization of graphics processors, which can greatly speed up computations.

## References

- Atkins, P. W.; Friedman, R. S.,
*Molecular quantum mechanics*. Oxford university press: 2011. - Sherrill, C. D.,
*An introduction to Hartree-Fock molecular orbital theory*. School of Chemistry and Biochemistry, Georgia Institute of Technology 2000. - Ramachandran, K. I.; Deepa, G.; Namboori, K.,
*Computational chemistry and molecular modeling: principles and applications*. Springer Science & Business Media: 2008. - Young, D. C.,
*Computational Chemistry: A Practical Guide for Applying Techniques to Real-World Problems*. John Wiley & Sons: New York, 2004. - Alder, B. J. and Wainwright, T. E. J. Chem. Phys.31, 459 (1959)
- McCammon, J. A., Gelin, B. R., and Karplus, M. Nature (Lond.)267, 585 (1977)
- Berendsen, H. J. C., van Gunsteren, W. F. Practical algorithms for dynamics simulations.
- Practical Considerations for Building GROMOS-Compatible Small-Molecule Topologies, Justin A. Lemkul , William J. Allen , and David R. Bevan * J. Chem. Inf. Model., 2010, 50 (12), pp 2221–2235
- Fatemeh Khalili-Araghi, Emad Tajkhorshid, Benoit Roux, and Klaus Schulten.
*Biophysical Journal*, 102:258-267, 2012.

### Contributors

- Me
- Qinhong Yu