2
Theory
Protein and nucleic acid sequence determination are now routine in molecular biology laboratories. As a result, the rate of publication of sequence information has increased dramatically. Several organizations have compiled databases to centralize the published information and to facilitate its use in research (EMBL database, PIR/NBRF database, GenBank database). On the other hand, structural information from xray crystallographic or NMR results is obtained much more slowly. For instance, if the protein is not a simple mutation of one whose structure is already known, it can take from one to five years to perform a complete structural determination. Because of the disparate rates of sequence and structure determination, there are many proteins for which sequence information is known but the threedimensional structure is not.
It is advantageous to develop a method whereby the conformation of a newly characterized protein can be predicted from its amino acid sequence. Since a priori folding rules for proteins have not yet been developed, it is necessary to base any structural prediction on the conformations of reference proteins. Thus, it is assumed that the structures of the unknown and reference proteins are homologous and that a model of the unknown protein can be built from the reference proteins.
Early work dealing with building a protein by homology used only one known structure (Browne et al. 1969, Shotton and Watson 1970). Amino acid similarities between the known and unknown proteins were used to determine where one protein would resemble the other. Sequence alignment was done, and the coordinates of the reference protein were used to predict those of the unknown protein.
More structural approaches have been suggested by Greer and Blundell (Greer 1980, 1981, 1985; Blundell 1987, 1988). In their methods, more than one reference protein is used, and a greater emphasis is placed on the conformational similarities between them. Less emphasis is placed on sequence alignment alone as a basis for the model. By determining which portions of the molecules do not vary from one member of a protein family to another, there is greater confidence that extrapolation to a new member will be accurate.
While several reference structures are used in the traditional homology model building process, only one set of coordinates can be used in any one peptide segment. Consensus uses a method developed by Havel (Havel and Snow 1991; Havel 1993) to examine all of the reference proteins at once to impose structural constraints that can in turn be used to generate conformations consistent with the data.
Structural similarities among related proteins is not uniform over the entire molecule. Protein chain topology and side chain conformations are more highly conserved in the central core and are less conserved at the surface. The conformations of core regions resemble one another to a great degree. Surface loops can be of varying lengths, and they are more flexible and free to move in the solvent. A modeling method that reflects the physical reality of differential variability will therefore give a better picture of the range of conformations available to a model than one that yields a single static structure.
Consensus analyzes an alignment of amino acids for a set of related reference and proteins, and a single model protein for which the structure is to be predicted. Residues are designated as being equivalent by aligning them vertically and enclosing them in sequence boxes or summary boxes. Equivalent atoms are found in the reference proteins, and interatomic distances are calculated. The range of distances between corresponding atoms are a reflection of the variability seen in the family of related reference proteins. Thus they are tighter within and between conserved regions and looser in the loops.
Distance constraints of this kind are appropriate as input data to a distance geometry program such as DGII. In the distance geometry method (Havel 1991), a set of conformations consistent with a large number of constraints is generated (this is discussed more completely in the next section). These conformations will show a greater similarity to each other and the reference set of proteins in the conserved regions and a wider variation in the loops. In addition, the structures will not be composites of particular reference proteins, but will instead be weighted averages of all the reference proteins.
The constraints are calculated by analyzing the given sequence alignment. For each residue of the model protein that is designated as being in a structurally conserved region (SCR), a list of reference protein residues is compiled. For each of the reference proteins, equivalent atoms are found, and interatomic distances are calculated. For each set of distances between equivalent atoms, the minimum and maximum distance is found. Then the average of the extremes, and the range above and below this average is calculated. For each reference protein, k, the maximum distance between equivalent atoms i and j is and the minimum is . Then the average and range are defined as:
Eq. 1
and
Eq. 2
Then, the upper distance bound, u, and the lower distance bound, l, are defined as:
Eq. 3
and
Eq. 4
where is the precision, is the tolerance and n_{ij }is the number of proteins in which equivalent atoms are found. Note that if the values for the precision and tolerance parameters are increased, it will widen the range of possible values for a given distance constraint, and increase the range of possible conformations available to the predicted structure.
There are many structural features that all proteins have in common. These relate to the standard geometries of the peptide chain and the side chains of the twenty amino acids. Standard bond lengths, angles and preferred dihedral angles can be used as constraints to the distance geometry calculation. Because the predicted conformation of the model protein has more to do with a choice of values for the many dihedral angles, and not so much to do with the fairly rigid bond lengths and threecenter bond angles, exact constraints can be used for these. Thus the range is set to zero in the above equations. Also, all residues are considered, not just those within SCRs. Therefore loop conformations, although less constrained than SCRs, still have the standard constraints applied to them.
The constraints themselves are applied to a template protein molecule with the same amino acid sequence as the model protein that has standard covalent geometry and arbitrary dihedral angles for all rotatable bonds. This scratch molecule is built during the sequence analysis phase of the Consensus run and is left undisplayed.
The overall, or global shape of the model protein can be imposed by setting distance constraints for alpha carbons, both within and between structurally conserved regions. It is desirable to maintain the flexibility of setting the precision and tolerance values for these independently to make it possible, for example, to force less variation in individual SCRs while allowing the relative orientations to be more flexible, or vice versa.
Side chain orientations can be imposed by setting constraints between atoms of the side chains, and backbone atoms of the same amino acid residue and immediately adjacent residues. The constraints are termed local because there is no effect on the side chain conformation by atoms which are distant in sequence, even if they are in close proximity in a threedimensional sense. Inconsistencies between global and local constraints must be taken care of in the later stages of the distance geometry calculation.
Table 1 . Most Distal Corresponding
Atoms for Amino Acid Side Chains

Because of the heterogeneous atomic composition of the various amino acids, it is sometimes not obvious which atoms correspond to one another. In general, atoms of the same element and position along the side chain can be thought of as being equivalent.
Table 1 depicts the complete list of atoms designated as being equivalent. The greek letters in the table represent the last atomic position (most distal to the backbone), taken to be common between the two amino acids of the row and column. Thus, the C_{} and C_{} atoms of the side chains are considered equivalent between histidine and arginine (symbols H and R).
Unlike the global constraints, not all the reference proteins are necessarily used in constructing the local constraints. The sequence alignment is consulted, and if one or more of the reference proteins has an amino acid at the same position along the chain that is identical to the model protein, then it or they alone are used to figure the interatomic distance or range of distances. Atoms from nonequivalent amino acids are only used if none of the aligned residues are identical to that of the model protein.
The construction of molecular models consistent with the geometric information derived from homology data may be likened to assembling a 3D jigsaw puzzle: One knows which pairs of pieces fit together, and yet it is difficult to reconstruct the whole. Unlike jigsaw puzzles, however, the solution structures of molecules are seldom unique, both because the available data is seldom sufficient to fully characterize the structure, and because most molecules of interest are flexible in solution, especially in the loops. Unfortunately, systematic search techniques are too time consuming to be used with molecules containing more than about ten rotatable bonds, and with large molecules the number of possible conformations is usually too large to enumerate completely anyway.
Under these circumstances, the best one can do is to compute a representative sample, or ensemble, from the set of all conformations consistent with the given geometric information. Provided this ensemble is sufficiently large and random, any geometric properties that are uniformly present in all its members are necessary consequences of the geometric information used as input. Thus, these calculations enable us to better understand the geometric implications of the homology data. They also provide a means of catching errors in the data, since most serious errors cause this information to become geometrically inconsistent, so that no conformation can agree with it. Because most of the geometric information that is obtained from a family of reference proteins consists of bounds on the interatomic distances, these calculations are known as distance geometry calculations.
Although conformational ensembles can be computed in a single step by means of high temperature simulated annealing starting from random coordinates or conformations, it is computationally more efficient to break the overall calculation down into several steps. The DGII program package is based on the EMBED algorithm, which consists of the following three steps:
1. Given the incomplete and imprecise set of distance bounds that are experimentally available, the program computes a complete and more precise set of bounds via a process known as bound smoothing.
2. It makes a random guess at the values of the distances from within the bounds obtained by bound smoothing; and finds atomic coordinates such that the distances calculated from these coordinates are a "bestfit" to this guess. This step is called embedding because this bestfit is computed by a projection method.
3. It minimizes the deviations of the coordinates from the distance bounds as well as the chirality constraints (see below) by any of the many available methods; this step is called optimization.
By repeating steps 2 and 3 many times with different random number generator seeds in step 2, the desired conformational ensemble is obtained. Subsequently, selected members of the ensemble may be refined versus potential energy functions (using Discover).
The fundamental premise on which distance geometry is based, is that the set of all possible conformations can be aptly described by means of suitable distance and chirality constraints.^{1} The first of these are simply lower and upper bounds on the interatomic distances; the second include the handedness of the asymmetric centers in the molecule (including prochiral centers) together with the planarity of any sp^{2} hybridized centers. This description, called a distance geometry description, is essentially the computer analogue of a CPK model. A large body of mathematical results exists that can be used to better understand these descriptions, much of which may be found in the recent book by Crippen and Havel (1988). The purpose of this section is to familiarize you with the structural information obtained from homology data in these terms.
The local covalent structure of a molecule is easily defined by distance constraints. Unless one is dealing with a highly strained ring system, it is sufficient to use only exact distance constraints in which the lower and upper bounds are equal. For example, the distances among covalently bonded pairs of atoms, are determined with high precision by the order of the bond and the types of atoms it connects. Similarly, the bond angles can usually be determined from the covalent structure, while for fixed bond lengths there is a onetoone relation between the bond angle and the geminal distance, so that these distances can also be determined. This relation between the geminal distance and the bond angle is given explicitly by the law of cosines:
Eq. 5
where and are called the lower and upper triangle inequality limits, respectively.
Similarly, when the bonded and geminal distances are held fixed, there is a onetoone relation between the absolute value of the torsion angle and the vicinal distance, given by:
Eq. 6
where and are the cis and trans limits on the 1,4 distance, here known as the tetrangle inequality limits. In particular, by setting all the vicinal distances across a double bond to the appropriate cis or trans limits, you can force the double bond to have any desired stereochemistry. You can even use inexact distance constraints to allow a specified amount of wobble in the torsion angle about the double bond.
The chirality of an ordered quadruple of points numbered is given in terms of their Cartesian coordinates by the sign of the following determinant:
Eq. 7
It is wellknown that the absolute value of this determinant is six times the volume of the tetrahedron whose vertices are the given points, and hence the value of the determinant itself is known as the oriented volume. Note also that the sign of oriented volume changes whenever the , or coordinates are multiplied by , (that is, whenever the tetrahedron is reflected in a plane). This shows that the chirality of the quadruple of atoms bonded to an asymmetric carbon atom is capable of distinguishing between a molecule and its mirror image. In the special case of a planar quadruple of atoms, the volume and hence the chirality is zero.
More generally, if we know the chirality of every quadruple of atoms in a molecule that is constant throughout all of the molecule's conformations, then (together with the above mentioned distance constraints) we know everything there is to know about its stereochemistry. Thus, if you are computing an unknown structure and you constrain the oriented volumes in it to have the correct signs, you can be sure that it also has the correct stereochemistry. These sign conditions on the oriented volume are called chirality constraints. Although the chirality of a quadruple changes sign if any pair of atoms in it are swapped, for most purposes we can just number the atoms in any order and use this same order for the atoms in each quadruple.
It should be noted that since the atoms bonded to a prochiral carbon are distinguishable by virtue of their relationship with the rest of the molecule, we usually assume that their chiralities are also known. In fact, since all atoms are distinguishable by virtue of the chosen numbering, the chiralities of all fixed quadruples can be constrained without harm. It should also be noted that a zero chirality constraint does not actually imply that the quadruple is always perfectly planar (which is never the case anyway); it only implies that it has "locally planar symmetry". This means that the minimum and maximum values of the oriented volume over all possible conformations of the molecule are equal in magnitude and opposite in sign.
As shown above, the absolute value of a torsion angle can be constrained to any range of values by means of suitable 1,4 distance constraints, including its cis and trans limits. Moreover, since the chirality of a chain of four bonded atoms A_{1}A_{2}A_{3}A_{4} is equal to the sign of the torsion angle about the 2,3 bond, by a suitable combination of distance and chirality constraints we can obtain any range of values with a given sign. This is sufficient to specify the rotameric state (gauche^{+}, gauche^{} or anti) about single bonds.
Sometimes however, the range of values for a torsion angle in all the reference proteins are wide enough so that an amino acid side chain is not restricted to a single rotamer state, and thus the above conditions are insufficient. Fortunately, there is more than one way to define a torsion angle! In the case of a carboncarbon single bond such as X_{1}Y_{1}Z_{1}C_{1}C_{2}X_{2}Y_{2}Z_{2}, for example, one could define the torsion angle about the C_{1}C_{2} bond with respect to any of the nine pairs of 1,2 substituents, e.g., X_{1}C_{1}C_{2}X_{2}, X_{1}C_{1}C_{2}Y_{2}, X_{1}C_{1}C_{2}Z_{2}, etc. It has been found that if one properly constrains the values of all of the vicinal distances along with all of the corresponding chiralities, one can obtain almost any desired range in the value of the corresponding torsion angle.
It should be noted that in enforcing chirality constraints in general and torsion constraints in particular, it is often more effective to constrain the oriented volume to the exact range of values consistent with the given distance and angle ranges, and not just the sign specified by the chirality constraints. These ranges are called the volume limits.
Since two atoms cannot be in nearly the same place at the same time, in order to obtain reasonable conformations it is also necessary to impose lower bound constraints, on the distances between all pairs of atoms separated by more than three bonds. For the sake of simplicity, these lower bounds are generally set to the sum of suitable hard sphere radii, i.e., .
Most of the preceding distance constraints, of course, merely ensure that the structures which satisfy them are not grossly unreasonable on energetic grounds; thus, they define what constitutes a valid conformation. In order to get the correct conformation, it is necessary to know something more about the conformationdependent distances. (The vicinal distances constitute a special case, which was dealt with above.) Unlike steric and covalent distance constraints, there is no general theoretical method by which these distance constraints can be derived from the covalent structure, and hence it is up to the program, to determine them by examination of the conformations of the reference proteins. Analysis of the portions of conformations within the conserved regions yield distances that are adjacent in space, but distant in sequence. The overall shape of the model protein being built is determined largely by these global distance constraints.
This step of the EMBED algorithm predicts the range of values that the interatomic distances can assume in any conformation of the molecule which satisfies the given distance bounds. Hence these ranges, known as the Euclidean limits, are necessarily at least as tight as the given bounds. The calculation of the Euclidean limits is a difficult and unsolved problem, but it is possible to calculate loose approximations. The term "loose" means that these approximate limits are wider than the actual Euclidean limits, so that no conformations are excluded by them that were not excluded by the given bounds. In addition to its role in the EMBED algorithm, bound smoothing can sometimes detect inconsistencies in the bounds, which provides a valuable tool for identifying erroneous sequence alignment data, such as a spatial gap from a sequence deletion that is too wide to span with the remainder of the peptide chain. The DGII package includes two programs for bound smoothing, one of which does triangle bound smoothing (which is fast), and the other of which does tetrangle bound smoothing (which is much slower but produces a better approximation). Most problems addressed by Consensus will probably be too large for tetrangle bound smoothing.
Triangle inequality bound smoothing is based upon the wellknown triangle inequality among the distances:
Eq. 8
for all triples of atoms . It follows that if and , then
Eq. 9
so if , then , and hence can be replaced by the upper limit on without eliminating any conformations that satisfy the constraints and .
Similarly, since the distances also obey the inverse triangle inequality
Eq. 10
it follows that if then we can replace by its lower limit . Note these two substitutions are valid even if no lower and/or upper bound is available on the distance , since then we effectively have and/or . Moreover, if such a lower limit exceeds the upper bound (or an upper limit) on the same distance, then we have essentially proved by contradiction that there exists no conformation which satisfies the bounds, i.e., that the bounds are geometrically inconsistent (as defined above). We call this a triangle inequality violation.
It has been shown that the exhaustive application of these two substitutions results in a unique set of complete bounds on all the distances, called the triangle inequality limits. These can be efficiently computed by converting the problem to a shortest paths computation in a directed digraph (network), and then applying Floyd's shortestpaths algorithm (Dress and Havel 1988). This algorithm is , meaning that the amount of computer time it requires can increase as at most the cube of the number of atoms . In the event that a triangle inequality violation is found, the shortest paths make it possible to isolate a generally small set of bounds wherein the erroneous constraint must be found.
Unfortunately, the triangle inequality limits are generally a rather poor approximation to the actual Euclidean limits, and (as a consequence) triangle inequality bound smoothing is not a very effective approach to locating errors in the bounds. A somewhat more effective, although also much more timeconsuming, approach looks at four atoms at a time, rather than three. In this case, the algebraic form of the relations among the distances is far more complicated, so that the tetrangle inequality limits are best described pictorially as in Figure 1.
Figure 1
. Tetrangle Inequality Limits

In Figure 1, the solid lines indicate distances at their lower bounds, while the dashed lines indicate those at their upper bounds. The light dotted lines indicate upper tetrangle inequality limits (in the first row) or lower tetrangle inequality limits (in the second).
By exhaustively substituting the bounds by these limits (as well as the triangle inequality limits), one eventually arrives at a complete set of limits, which experience has shown to be unique (independent of the order in which the substitutions are made). Particularly among groups of atoms that are confined to be very nearly planar by the distance bounds, these limits can be a substantial improvement over the triangle inequality limits. Thus tetrangle inequality bound smoothing provides a more powerful test for the consistency of the distance bounds, particularly among planar groups of atoms or if some large lower bounds are present in the input. Furthermore, if the bounds are found to be inconsistent, it is again possible to execute a "traceback" that finds a subset of the given bounds within which the erroneous bound(s) must lie. A complete account may be found in Easthope and Havel (1989).
Unlike triangle inequality bound smoothing, however, no closed algorithm is known for tetrangle bound smoothing, and in addition each "pass" over all quadruples of atoms requires time proportional to the fourth power of their number. Hence exhaustive tetrangle inequality bound smoothing is too time consuming to be used with much more than 100 atoms. By restricting the quadruples scanned in each pass to various subsets of the total, it is still possible to significantly improve the limits on the distances, particularly among those that are separated by a relatively small number of covalent bonds. For example, with proteins the sequential strategy restricts the quadruples to those contained in adjacent pairs of amino acids, thus reducing the total time required for each run through the quadruples to a linear function of the number of atoms.
This step consists of three parts:
1. Choosing a matrix of random "trial distances" which obey the triangle inequality from between their limits by a procedure known as metrization.
2. Converting this random distance matrix into a set of random atomic coordinates by a procedure known as embedding.
3. Improving a weighted least squares fit between the trial distances and the coordinates by a procedure known as majorization.
For reasons to be explained in the next section, it may sometimes be desirable to compute 4D coordinates in this step. Since all the mathematics on which these procedures are based carries over without change to higher dimensions, however, it is sufficient to restrict ourselves to three dimensions in what follows.
With sufficiently complete and precise sets of distance constraints, reasonable coordinates can be obtained simply by choosing the distances independently from between their lower and upper limits with a uniform distribution. In order to obtain the widest possible sampling of conformation space, however, a procedure known as metrization is advisable (Havel 1990). The basic idea behind it is simple enough: Since Euclidean distances obey the triangle inequality, a better initial approximation should be obtained by taking this into account. Moreover, since the triangle inequality induces correlations among the distances, the structures that best fit these distances should exhibit largescale, correlated differences among their parts and hence should be more diverse.
To see how metrization itself works, it is necessary to know several things:
 First, as shown by Dress and Havel (1988), the triangle inequality limits are equal to the extremes that the distances can assume in any metric space^{2} consistent with the bounds.
 Second, since the set of all metric spaces is convex, all values of the distances between their triangle inequality limits actually occur in some metric space consistent with the limits. It follows that if we arbitrarily set any distance to any value between its limits, we may then set the corresponding upper and lower bounds to that value and recalculate the new triangle inequality limits. Repeating this procedure until all the distances have been set thus gives us a metric space whose distances lie within their triangle inequality limits (and hence the given bounds as well).
One problem with this procedure is that if the triangle inequality limits among atoms must be recalculated from scratch each time one of the ~ distances is set, the amount of time required overall increases as . Fortunately, if the distances are chosen in a certain order it is possible to improve this to an algorithm. This order is dictated by the use of a datastructure known as a shortestpaths tree, which enables the limits from one atom to all the others to be efficiently recomputed after each change. Thus, the distances from one atom to all the other atoms must be set first, followed by the distances from the next atom to all the remaining atoms, and so on. With this order, the algorithm is known as prospective metrization. It is, however, also possible to compute the shortest paths tree from each new atom to all the preceding atoms, fix these distances, and proceed to the next atom, and so on; in this case the algorithm is called retrospective metrization. It is possible to restrict the optmization to a given number of atoms, which is referred to as partial metrization.
Because the previously chosen distances affect subsequent choices, in order to obtain good sampling it is necessary to use a different random numbering of the atoms for each new distance matrix chosen. In fact, in DGII the order of the atoms is permuted yet again each time a new shortest paths tree is computed, just to make absolutely sure no bias arises from the use of the same order in every tree. This is all done automatically by the program; the only sign of it you should ever see is a report of the random number seed used to generate these permutations (along with the random distances themselves). This makes it possible to repeat a calculation exactly, should the need arise.
 Finally, under some circumstances it may be necessary to control the overall density of the atoms; this can be accomplished by skewing the distances towards higher or lower values each time one is chosen. This is done by using an exponential probability density, , where the exponent is chosen so that the expected value of the distance is given by:
Eq. 11
and are the current limits on . Thus, as the dilation factor the expected value , while as the expected value approaches the geometric mean of the lower and upper limits, . If , then and we obtain a simple uniform density. (Proofs of these statements are left to the reader.) By averaging multiple independent selections, the distribution can also be biased towards the mean.
The basic idea behind embedding is to compute coordinates whose distances are a close fit to the trial distances. Then, since the trial distances obey the given distance constraints, the distances calculated from these computed coordinates should come close to obeying the distance constraints. Of course, if one had to rely on optimization to compute these coordinates, one might as well just fit the distance constraints directly, but it turns out that these matrix bestfits can often be computed exactly by eigenvalue methods (Crippen and Havel, 1978).
First, suppose we have been given a matrix of the squared distances among points numbered in 3D space:
Eq. 12
Consider the eigenvalue decomposition of the metric matrix:
Eq. 13
where are the eigenvectors and is a diagonal matrix of the eigenvalues in decreasing order. Since can be written as the product of the rank three matrix with its transpose, it itself is positive semidefinite of rank (at most) three, so that and . If we now let be the diagonal matrix of square roots and , then the 3D Cartesian coordinates given by for and have the same metric matrix as the points , i.e.:
Eq. 14
.
As a consequence for , while for we have:
Eq. 15
Thus, the distances obtained from both sets of coordinates are also the same, so the coordinates differ only by translation and rotation. The orthogonality of the eigenvectors, however, implies that the new coordinates are principal axis coordinates.
Of course, the principal axis coordinates can be found more efficiently by diagonalizing the inertial tensor. The advantage of the above method is that it can be applied even if we do not have any coordinates for the points, but only the distances among them. This provides a means of converting those distances into coordinates directly, which (unlike triangulation and related techniques) is very insensitive to errors in the distances. Although the coordinates obtained from arbitrary distances are usually dimensional, it can be shown (Crippen and Havel 1988) that if only the three largest eigenvalues are used, the 3D coordinates obtained in this way constitute an optimum fit to the distances in the sense that the RMS difference between the metric matrix calculated from the random distances and the metric matrix calculated from the resultant coordinates is minimum.
Since the distances to the zeroth atom affect every element of the metric matrix, the zeroth atom contributes more to this fit than the other atoms, and so once again further steps are necessary in order to obtain good sampling. This could be done by randomly renumbering the atoms as was done in metrization, but a more elegant and efficient method of eliminating this preference for any one atom is to invent a new atom at the centroid to serve as the zeroth atom. This is possible because the distances to the centroid can be calculated directly from the distances among the atoms by means of the formula:
Eq. 16
As can be seen, all of the atoms except the appear symmetrically in this formula, and hence no one atom is preferred as a whole. In addition, the use of the centroid as the origin further improves the stability of the algorithm.
While the most efficient way of calculating the largest eigenvalue of a matrix to machine precision is inverse iteration, such precision is not worthwhile in the present context. For this reason a variant on the classical power method is employed, which uses Tchebychev polynomials instead of simple matrix products to accelerate the convergence. Starting from a random vector , the product of these matrix polynomials with may be computed recursively as follows:
Eq. 17
As , the polynomial product converges to an eigenvector whose eigenvalue is maximum in magnitude. The next largest eigenvalue is then found by applying the same procedure to the deflated matrix:
Eq. 18
and so on. The overall time required by this procedure increases as only , on the average.
If the available distance bounds are very loose, the trial distances are often far from Euclidean, in which case one or more of the three eigenvalues of maximum magnitude may be negative. When this occurs, that eigenvalue is skipped, the metric matrix deflated with respect to it, and the next eigenvalue is computed. This continues until three nonnegative eigenvalues have been found.
The fit obtained by embedding, though much better than completely random, is not very intuitive, and probably does not yield the best possible starting coordinates for optimization. Majorization is a powerful method of minimizing a weighted leastsquares fit of coordinates to estimated distances :
Eq. 19
where it is usually assumed that the weights have been normalized so that . This is done by means of successive Guttman transformations:
Eq. 20
Here, is the generalized or MoorePenrose inverse of the singular matrix whose offdiagonal elements are the negative weights and whose diagonal elements are the row sums . The matrix function is defined similarly for the scaled weights:
Eq. 21
Three weighting schemes have been used to obtain more chemically reasonable starting coordinates:
1. Constant weights
2. Inverse square weights
3. Inverse squared range plus squared average weights
The first has the advantage of speed and simplicity, the second has the advantage of enforcing the relative deviations, while the third enforces tight distance constraints more strongly.
With constant weights, the generalized inverse is given analytically by:
Eq. 22
Otherwise, this same formula gives an excellent starting point for an iterative calculation of the generalized inverse via:
Eq. 23
Unless the same weights are going to be used for many majorizations, however, it is more efficient to find a leastsquares solution to the linear system of equations:
Eq. 24
to obtain the new coordinates for each Guttman transform. Because is positive semidefinite, this is efficiently accomplished by a simple linear conjugate gradient algorithm.
In order to reduce the violations of the constraints by the embedded coordinates to an acceptable level, further optimization is necessary. This involves minimizing a function which measures the total violation of the constraints by the coordinates, which is called an error function. These minimizations can be done using any gradient method, e.g., conjugate gradients, but with large problems (as in most Consensus applications) the inevitable presence of many local minima necessitates the use of global optimization techniques such as simulated annealing. The coordinates obtained by embedding are much better than random. Therefore, high temperatures, prolonged equilibration and multiple stages involving different error functions are not necessary to obtain good convergence.
The error function used by DGII is given by:
Eq. 25
where
Eq. 26
Eq. 27
Eq. 28
Here, , and enforce the hard sphere lower bounds, the remaining lower and upper bounds, and the chirality constraints, respectively. The parameters are the weights of the various terms relative to the upper distance bounds, while (typically 1.0) prevents the upper bound errors from getting too large when small upper bounds are present. The function is the oriented volume, as defined above, and are the lower and upper limits on its value derived from the distance, chirality and torsion angle constraints.
The facts that this error function is relatively smooth and that its gradient can be evaluated very fast, are both important advantages in global optimization. Although a number of other features (e.g., inverse sixth power averaging) could be added to the function to enable it to represent the homology data more precisely, they would detract from these advantages and are therefore best included in the subsequent local refinements. In small problems (< 100 atoms), a full matrix version of the above function can be used to evaluate the error with respect to the full set of distance limits obtained by bound smoothing; in large problems, the sparse matrix version which takes account of only those constraints given explicitly as input (possibly tightened by smoothing) is preferred, especially since it uses a fast check for hard sphere overlaps. Other variations on this function exist for 3D and 4D, as described below.
The simulated annealing procedure employed by DGII differs from most others in several important respects. Most previous applications of this methodology have either attempted to eliminate the residual violations in a structure that has already been minimized (for example, using conjugate gradients), or else have attempted the de novo computation of a structure from completely random coordinates. These calculations typically use a rigidly controlled temperature of 1000K or more. In contrast, the version utilized by DGII merely imposes an upper bound on the temperature, which is gradually reduced to zero according to the schedule:
Eq. 29
Here, is the initial upper bound on the temperature (typically 200K), and is the extent of the annealing on the th step given by , where is the total number of steps taken by the annealing. Thus the entire annealing schedule is specified by the single parameter.^{3}
To start the annealing, the error function is scaled to an equivalent energy (in kcal) that is sufficient to cause the system to heat up naturally to the initial upper bound . The rate of heating is limited by a second parameter, , which is usually a factor of two at each step.
The only other really important parameter involved in the annealing is the step size (in seconds). Generally speaking, it is desirable to make this as large as possible without risking instability. Since the temperature is controlled by scaling the velocities as necessary to keep the temperature below its bound, instability is manifested by a very small value for this scale factor over many iterations. Once again, the actual step size can only be determined by previewing, but unlike the initial energy the same step size appears to work fairly well for all problems (provided that the same atomic masses are used; see below).
One minor nuisance is that only one mirror image of the embedded coordinates generally converges well, but there appears to be no good way to automatically determine which one before performing the annealing. If one fails, the other mirror image can easily be tried, though this increases the computer time required for annealing by an average of 50%. Alternatively, if one knows the largescale handedness of some structural elements (for example, helices), it is often possible to identify the correct mirror image by looking at the embedded coordinates prior to optimization.
In addition to the above protocols, two specialized methods of avoiding local minima during the annealing are available with DGII:
1. The computation of the distances in 4D.
2. The use of very heavy atoms in calculating the accelerations.
In the former, one embeds and majorizes 4D coordinates, and then applies the 4D version of the error function to them. This error function includes an additional dimensionality error term to drive the fourth coordinates, , towards zero, and is given by:
Eq. 30
where is a dimensionality weight. The initial values of the fourth coordinates should be scaled so that drastic fluctuations in the dimensionality error are not observed. Though it may seem strange, this artifice fixes many problems with the local chirality of the molecule at lower temperatures and with shorter annealing times.
The use of very heavy atoms is also nonphysical, but can be justified as follows: Since the contribution of the negative gradient to the velocities at any given step size is inversely proportional to the masses, when large masses are used the velocities at each time step represent a running average of the negative gradient over a much larger region of conformation space. As such, the annealing tends to follow the largescale contours of the energy surface, rather than its local fluctuations. In addition, the use of heavy atoms improves the stability of the annealing, so that much larger step sizes can be used. With masses of 1000 Daltons for all the atoms, in fact, a step size of 0.2 psec or more can be used. This in turn can give maximum atomic displacements of 0.5Å or more on each step at 200K. Finally, by making the atoms in some parts of the molecule much heavier than others it is possible to hold those parts very nearly rigid during the annealing, should this be desired.
The coordinates obtained from the above annealing procedure are generally close to a minimum (hopefully near the global one), and in the event that a fourth dimension is used the coordinates should also be quite close to being 3D. To make them perfectly 3D while also making the residual violations of the constraints as small as possible, the coordinates are usually projected down into 3D (if necessary) and the error function minimized to a precise minimum via a conjugate gradient procedure. The number of parameters to play with here is quite large, but the defaults generally work well enough for most problems so that a detailed discussion is not needed. The recommended number of iterations is 250, with early termination only if the RMS gradient norm falls below 0.001 or so.
Finally, it is very important to look carefully at the convergence analysis that is printed at the end of each calculation. This analysis lists those violations of the hard sphere, distance and chirality constraints that contribute the most to the final error. If the same constraints are consistently violated in all structures of the ensemble, the evidence for an error in one or more of them is strong. Even if this is not the case, when the maximum violations of the constraints consistently exceed 0.5Å, a review of the data is called for. With approximately 1000 atoms and simulated constraints that are known to be geometrically consistent, maximum violations below 0.1Å are not unusual.
^{1}
The term restraints is used almost interchangeably with constraints; strictly speaking, constraints are geometric conditions, while restraints are the pseudopotentials used during dynamics and minimization to enforce consistency with the constraints.
^{2}
A metric space is essentially a distance matrix whose elements obey the triangle inequality.
^{3}
Some other schedules are available, but they appear to be less effective and hence are not discussed here.
Last updated May 18, 2001.
Copyright © 2001, Accelrys, Inc. All rights
reserved.