EPMR
A PROGRAM FOR
CRYSTALLOGRAPHIC MOLECULAR
REPLACEMENT
BY EVOLUTIONARY SEARCH
Version 2.4
Authors: Charles R. Kissinger &
Daniel K. Gehlhaar
Agouron Pharmaceuticals, Inc.
Copyright (C) 1997, 1998, 1999
Agouron Pharmaceuticals, Inc.,
11099 North Torrey Pines Road, La
Jolla, CA 92037-1020
I.
INTRODUCTION
EPMR is a
program that finds crystallographic molecular replacement solutions using an
evolutionary search algorithm. The program directly optimizes the three
rotational and three positional parameters for the search model with respect to
the correlation coefficient between Fo and Fc. The program operates as follows:
- An
initial set of random solutions (random orientations and positions for the
search model) is generated.
- The
correlation coefficient is calculated for each orientation.
- A
fraction of the highest scoring orientations are retained and used to regenerate a complete set of new trial
orientations. This is done by applying
random alterations to the orientation angles and translations for each 'surviving' solution.
- The
correlation coefficients for the new population are calculated, the population
is again regenerated from the top scoring solutions, and this procedure is
repeated for a number of cycles.
- At the
end of this evolutionary optimization, a traditional conjugate gradient
minimization is performed. This is simply a local, rigid-body refinement of the
search molecule.
The use of
an evolutionary algorithm allows for efficient searching of the six-dimensional
space. It is several orders of magnitude faster than a brute-force, systematic
search.
The
program calculates structure factors very quickly using the method of Huber and
Schneider [J. Appl. Cryst. (1985) 18, 165-169]. A traditional structure factor
calculation is done only once - for the search model set at the origin of a P1
box - and subsequent structure factor calculations are done by transforming
reflection indices according to the rotations and translations applied to the
model, interpolating into the grid of P1 structure factors and summing over the
symmetry operators of the crystal. This is much faster than an FFT calculation.
The
program finds a single 'solution' for the search model in each run. In cases
where you are looking for multiple molecules in the asymmetric unit, you must
look for each solution sequentially. This can be done in two different ways.
You can have the program find the first solution and write it out. You then
enter it as partial structure in subsequent runs. Or you can direct the program
to find the first solution, keep it as partial structure, and keep going to
look for as many solutions as you like. In this way, multi-body searches can be
totally automatic.
Because of
the stochastic nature of the optimization process, you will not get the correct
solution on every run, even with a very good search model. The success rate is
USUALLY very high assuming an adequate model, but you probably want to try at
least 10 runs if you have a difficult problem. For search models that are poor
and at the limit of detectability, the search efficiency can be quite low. If
you have a molecular replacement problem that has not yielded a solution by any
other means, a reasonable last resort is to set up EPMR to do as many runs as
your patience and computing resources will allow. As long as the true solution
represents the global maximum in the correlation coefficient between Fo and Fc,
even if by the slimmest of margins, EPMR will eventually find it.
EPMR
includes the following features:
- an
option to enter fixed partial structure
- a
translation search mode
- an
option to bypass the evolutionary search and do only local (conjugate-gradient)
optimization of the model
-
independent refinement of different segments of the search model in the final conjugate-gradient
minimization procedure
II. USAGE
The
program requires three input files. The first is a file containing the cell constants
and space group number in the order:
a b c alpha beta gamma space_group_number
These are
free-format and can be divided between any number of lines. The program
currently contains information only for the 65 enantiomorphic space groups. For
rhombohedral space groups, the hexagonal (obverse), not rhombohedral setting is
used and your data needs to be indexed accordingly. An example of appropriate
contents for this file is:
40.76
18.49 22.33 90 90.61 90 4
The next
input file is a standard PDB format file containing the search model (orthogonal
Angstrom coordinates). Any lines in this input file that are not ATOM or HETATM
records (e.g. REMARK lines, etc.) are ignored. Currently only scattering
factors for C, N, O, S, P and FE are incorporated into the program. Atoms of
any other type are treated as carbons, except for hydrogens, which are ignored.
If the
input PDB file contains multiple segment identifiers (columns 73-76, as in
X-PLOR), each segment will be refined separately in the final conjugate-gradient
minimization procedure. This will not affect the evolutionary search itself,
where the entire contents of the file are treated as a single rigid body.
The final
input file contains the observed structure factors. The only requirement is
that the file have H K L Fo as the first four items on each line, separated by
spaces. "fin" format files work.
These
three files are all that are necessary to run the program. The command line:
epmr example.cel example.pdb example.hkl
will run
the program in its default mode. In this mode, the program will search for a
single molecule in the asymmetric unit. It will run the evolutionary search
procedure up to ten times, or until a solution with a correlation coefficient
of 0.45 is obtained. Data in the resolution range 15 to
4
Angstroms will be used in the search. The top solution found will be written to
a file called 'epmr.1.best.pdb'.
The
operation of the program can be controlled by a set of command line options. (If
the program is run without any command line arguments, the available options
will be listed.) The possible options are:
-m integer
The number of identical molecules in the
asymmetric unit for which to search.
The
default value is 1. The flag -m2 on the command line would cause the program to search for one solution, save it
as partial structure and continue searching for a second solution.
-h
real_number
High resolution limit for diffraction data
used in the search (in Angstroms).
The default value is 4.0 Angstroms. We do
not generally recommend that this value be set to less than 5.0.
-l
real_number
Low resolution limit for diffraction data
(Angstroms).
The default value is 15.0 Angstroms. The
efficiency of the search appears to be aided slightly by the inclusion of low
resolution data. If you have accurately measured low-resolution data, you might
even try a value of 25 or 30.
-p integer
The population size (number of trial
solutions generated in each cycle).
The default value is 300. Increasing this
value beyond 300 seldom causes a dramatic improvement in search efficiency, but
it never has a negative effect. If you have a fast machine you might want to
try -p600.
-g integer
The number of 'generations' (cycles of
optimization).
The default value is 50. Increasing this
value has much the same effect as increasing the population size, but the
degree of improvement is even less predictable. We do not generally recommend
that you change this value.
Setting this value to zero (-g0) will
bypass the evolutionary search and feed your input model directly to the
conjugate gradient optimizer. This allows you to use EPMR as a convenient
rigid-body refinement program.
-n integer
The number of runs.
The default value is 10. The program will
stop before the completion of the number of runs specified here if a solution
is obtained that has a correlation coefficient that exceeds a specified
threshold ( flag -t, below).
-t
real_number
The threshold value of the correlation
coefficient that indicates an acceptable solution (which will stop the run).
The default value is 0.45. If the -m flag
(discussed above) has a value greater than 1, the default threshold is changed
to 0.3. These values generally work well for a good search model. The program
will stop when a solution correlation coefficient exceeds this threshold, so if
you want the program to continue for
a specified number of runs no matter what, set this value to 1.0. It is best to
set this value relatively high to avoid having the program stop on an
incompletely converged solution. In searches for multiple molecules in the
asymmetric unit, the correlation coefficient for the top solution for the first molecule is scaled by 1.5 to
obtain the threshold for the next molecule (unless that threshold would be
lower than the originally specified threshold, in which case the original value
is kept).
-T
Translation only mode.
This will cause the program to search only
translation space, keeping the orientation of the search model unchanged. This
could be useful, for instance, when you have a search model that has been
pre-oriented by another program or through knowledge of non-crystallographic
symmetry. Note that the orientation WILL be optimized during the final
rigid-body optimization after the evolutionary search, so the orientation is
likely to change slightly and could change significantly during this step. This
is a feature and not a bug. If you think otherwise, let us know.
-b
real_number
The minimum 'bump' distance - the smallest
unpenalized distance between the center of mass of a solution and that of any
symmetry mates.
The default value is 0.0 (no packing
restrictions). This is applied to all trial solutions that are generated during
the course of the search. When a solution violates this minimum distance, the
correlation coefficient calculated for that solution is scaled down by the
ratio of the shortest observed
distance over the minimum allowed distance.
In the case of searches for multiple
molecules in the asymmetric unit, this also sets the minimum distance between a
solution and any previously found solutions. (This applies both to previous
solutions found during the run and to partial structure entered with the -s
option. See the description of the -s option below for instructions on entering
multiple fragments of partial structure for use with this option.)
This option imposes a simple penalty on
solutions that pack poorly without slowing the program significantly. This
option can be helpful in some searches. but decreases the efficiency of others,
particularly if a largevalue is used. This option appears to be less useful in
single molecule searches than in
searches for multiple molecules in the asymmetric unit, but definitely should
be tried if you are having trouble getting a solution that packs well. It is up
to you to decide what an appropriate minimum intermolecular distance should be.
(Remember, it is the distance between centers of mass.) It is best to be conservative - the program
will not search efficiently without some room to move solutions around through positions
that pack poorly.
In contrast to the packing restriction used
in previous versions of EPMR, the packing penalty is in effect during both the
evolutionary search and the final conjugate gradient minimization.
-w integer
The quantity of solutions you want written
out to PDB files.
The default value is 1. A zero here means
no coordinates will be written out. A '1' means only the top solution from all
of the runs will be written out. A '2' means the solution obtained from each
run will be written out. The name of the output PDB files is controlled by the
-o flag below.
-o name
The file name prefix for the output
coordinate files.
The default is 'epmr'. If you specify -w1
or -w2 above, the program will name the output files for each run as 'prefix'.'molecule_number'.'run_number'.pdb
(e.g., epmr.1.1.pdb). The molecule number depends on how many molecules you are
looking for in the asymmetric unit (option -m). The top solution from all of
the runs will be 'prefix'.'molecule_number'.best.pdb. If you run multiple jobs
in the same directory, you will have to use this flag to avoid writing over
other solutions.
-s name
The static partial structure flag.
If you have partial structure to input,
include this flag and follow it with the name of the PDB file containing the
correctly positioned partial structure. You can separate the partial structure
into as many files as you wish and use this flag multiple times on the command
line. It is only necessary to divide the partial structure up this way,
however, if you are using the -b flag (see above) and are inputing multiple
"pieces" of partial structure (e.g., multiple monomers). The minimum
packing distance calculation will treat the partial structure within each
separate file as a separate fragment.
-S
Do the initial structure factor calculation
by direct summation rather than FFT.
This will be MUCH slower, but if you have a
very large search molecule and a workstation with little available memory, the
FFT calculation may cause excessive swapping. Summation uses little memory.
This only affects the single, initial FFT calculation and has no effect on the
speed of the subsequent evolutionary
search.
-e integer
Set the seed value for the random number
generator to a specific value.
This option is not necessary for normal
operation of the program, but can be useful for testing purposes. If the seed
is not set here, or is set to a value of zero, the seed is obtained from the
system clock at run time.
The
program will print some information about the settings, do the initial FFT
structure factor calculation, and then start the "evolution". It will
report the best score for each generation. At the end of the evolutionary search,
a conjugate-gradient minimization is performed on the best scoring
solution
from the final generation. If there are multiple segment names in the input PDB
file, the search model will first be optimized as a single rigid body, and then
the separate segments will be optimized.
The final
orientation for the run will be reported on a line that begins with 'Soln',
followed by the run number, theta1, theta2, theta3 (Eulerian angles, defined as
in X-PLOR), x translation, y translation, z translation in orthogonal Angstroms
(If you intend to make use of these values outside of the program, they must be
applied to the search model after it has been centered at the
origin),
and then the correlation coefficient and R factor. You can use the UNIX command
'grep Soln log_file_name' to print the solution generated by each run. The
command 'grep Soln log_file_name | sort +11 -rnb +13 -nb' will list the
solutions sorted by correlation coefficient, then R-factor.
For a
single molecule in the asymmetric unit, expect a correlation coefficient of 0.5
or more for a correct solution with a good search model. For the first of two
in the asymmetric unit, expect a cc near 0.30. Even poly-Ala or C-alpha models
will have quite high correlation coefficients if they are
accurate
models. Depending on the quality of your model, however, correct solutions can
have correlations well below 0.3. The best indicator of a correct solution is
that it (or a symmetry-related solution) comes up repeatedly in multiple runs.
If you
publish results obtained using EPMR, please cite Charles R. Kissinger, Daniel
K. Gehlhaar & David B. Fogel, "Rapid automated molecular replacement by
evolutionary search", Acta Crystallographica, D55, 484-491 (1999).
NOTE:
We
frequently release new versions of the program. The newest version of EPMR can
be obtained by anonymous ftp from ftp.agouron.com (directory pub/epmr/).
We are
interested in any and all questions, comments, criticisms, suggestions, and
complaints about the program. Please direct these by e-mail to Chuck Kissinger
at crk@agouron.com.
III.
ACKNOWLEDGEMENTS
David
Fogel (Natural Selection, Inc.) contributed numerous ideas that were essential
to the successful development of this program. The structure factor calculation
routines in the program were adapted from those in the XtalView software, which
make use of Lynn Ten Eyck's FFT routines and were kindly made available by
Duncan McRee (Scripps Research Institute).
IV.
EXAMPLES
----------------------------------
EXAMPLE 1:
Search for one molecule in the asymmetric unit, perform up to 10 runs of the
evolutionary search procedure (or until a cc above 0.45) is obtained, write out the best solution obtained.
epmr
example1.cell example1.pdb example1.hkl > example1.log
__________________________________
EXAMPLE 2:
Search for two identical molecules in the asymmetric unit, write out all
solutions, do up to 10 runs (default) for each molecule.
epmr -m2
-w2 example2.cel example2.pdb example2.hkl > example2.log
-----------------------------------
EXAMPLE 3:
Search for one molecule in the asymmetric unit (default), use data from 8 to 3.5 Anstroms, do up to
twenty runs, write out only the top solution (default) to a file with the
prefix 'example3_solution', and use static partial structure.
epmr -l8.0
-h3.5 -s example3_partial.pdb -o example3_solution -n 20 \
example3.cel example3.pdb example3.hkl >
example3.log
___________________________________
EXAMPLE 4:
Search for two identical molecules in the asymmetric unit, write out all
solutions, up to 10 runs (default) for each molecule. Input two fragments of
partial structure. Disallow solutions that pack within 15.0 angstroms (center-to-center
distance) of any symmetry mates or partial structure that was input or
generated during the run.
epmr -m2
-w2 -s example4_partial1.pdb -s example4_partial2.pdb -b15.0 \
example4.cel example4.pdb example4.hkl >
example4.log
__________________________________
EXAMPLE 5:
Do a translational search for one molecule, use partial structure, write out
all solutions, do up to 10 runs (default) for each molecule.
epmr -T -s
example5_a.pdb \
example5.cel example5_b.pdb example5.hkl > example5.log