Universal Archive

A Universal Archive file for Crystallography

The IUCr has for a long time been keenly aware of the need for a universal data transfer protocol. In 1978, a standard file structure for crystallographic data was commissioned, and a candidate protocol was created in 1981. The file considered, known as SCFS (Standard Crystallographic File Structure), was intended to fulfil nine criteria:

It should be extendable to include all types of crystallographic data.
It should be compatible with current and future methods of data transmission.
It should be easy to program for both reading and writing.
It should not require re-read facilities, since these are not supported by all computers.
An output listing should be easy to read visually.
The only records that must be included are those required for data management (e.g. END). All other data are optional.
Provision should be made for the inclusion of derived data where required.
Comments may be included.
It is not primarily intended for manual editing (although it should be capable of being edited by hand).

The new file structure was described by Brown (1983), in an important paper that well explains the rationale behind the file structure chosen.

This initiative was important not least for its specification of data types that should be recognised across all applications.

However, some of the design choices made in this file structure have not proven popular over the years. In particular, criterion 3, that the file should be easy for a program to read and write, was taken to imply that fixed-format data records should be used. This meant that a precise format had to be specified for each type of data record expected, thus adding to the overall complexity of the standard and making extensions less easy to implement. Further, the criteria 5 (human readability) and 9 (facility for manual editing) were compromised by the rigid layout thus imposed.

A new study was therefore commissioned by the IUCr in 1987, and this resulted in the definition of the Crystallographic Information File (CIF), which has now been adopted by the Union as its official universal exchange file. The CIF format satisfies the same criteria addressed by the SCFS proposals, but with some changes in emphasis, and with significant changes in implementation.

Most notably, data items are not presented in fixed format, but are simply character strings delimited by white space and identified by a preceding key word, following the convention that we saw to be advantageous in our earlier discussion. For specific items of data that are not considered as members of an array, the key immediately precedes the value. For data arrays (which may be multi-dimensional), the data types are first declared after a 'loop_' keyword, and the array elements are identified from their position in a following list.

Here are some examples of these ideas in practice. The following CIF contains information about the crystal unit cell.

 
     _cell_length_a                  8.709(2)
     _cell_length_b                  8.934(1)
     _cell_length_c                 12.011(2)
     _cell_angle_beta               96.23(1)
     _cell_angle_alpha              92.14(2)
     _cell_angle_gamma             113.06(1)
     _cell_volume                  851.515
     _cell_space_diagonal_longest   14.431

Note that, although the data are laid out in a fairly neat fashion, this is not required by the CIF rules. So long as each entry is separated from the others by any white space (space, tab or end-of-line characters), the integrity of the file is maintained. Nevertheless, neat formatting aids visual inspection of the file contents, and is a style to be encouraged. The order of entries is irrelevant. Thus, while one might expect the beta value to appear between the values of alpha and gamma, this is not essential. The data names, or keys, are chosen to be fairly self-explanatory. However, their precise meaning is recorded in an external document, known as the CIF Core Dictionary (Hall, Allen and Brown, 1991), and the units and other conventions that apply to the particular data name are listed in the Dictionary. The last entry, referring to the space diagonal of the cell, is not a recognised data name in the CIF Core Dictionary. Nevertheless, it is legitimate to include it in the file as a local data item, which only specific applications will recognise. Note, though, that the construction of the data name is intended to indicate its nature.

The second example shows a looped data structure. It is convenient to collect together repetitive and related data into a multi-dimensional array. The example illustrates how the position coordinates and equivalent isotropic U values of atoms in a cell might be represented in a CIF.

loop_
     _atom_site_label
     _atom_site_fract_x
     _atom_site_fract_y
     _atom_site_fract_z
     _atom_site_U_iso_or_equiv
 
#    Atom label     x          y          z         U_eq
 
          C(1)   .3559(3)   .9938(3)   .0315(2)    .051(1)
          C(2)   .5131(2)  1.0377(2)   .1012(2)    .042(1)
          C(11)  .2095(4)  1.0428(5)  -.1319(3)    .098(2)
          C(12)  .5039(4)  1.2471(5)  -.0916(3)    .113(2)
          O(1)   .2294(2)   .8641(2)   .0590(2)    .074(1)
          O(2)   .5957(2)   .9710(2)   .0337(1)    .055(1)
          N(1)   .3602(3)  1.0927(3)  -.0591(2)    .066(1)
          H2     .590(2)   1.158(2)    .110(2)       ?
          H(O2)  .717(5)   1.066(5)    .037(3)       ?
          H111   .212       .982      -.217          ?
          H112   .123       .985      -.082          ?
          H113   .208      1.144      -.158          ?
          H121   .600      1.217      -.099          ?
          H122   .491      1.284      -.178          ?
          H123   .547      1.343      -.019          ?

Note how, again, the layout has been chosen to facilitate readability, and how this is further enhanced by a comment line containing headings for the columns (see point 7 below). This is for the convenience of a human reader (permitting 'visual browsing' to be a legitimate 'application' that uses CIF), but does not hinder the machine-readability of the file. It is, however, very important to note that missing values --- in this case, of certain U values --- are indicated by dummy placeholder strings. Since a CIF reading program will identify data within a loop by counting and referring back to the order of datanames listed, it is essential that such a mechanism be employed.