# Complete Cross Validation in Crystallography

## History of Cross Validation: R_{free}

In the 1990s it became clear that the R1-value is not a good quality measure for crystal structures, in particular for macromolecular structures which suffer from a low data to parameter ratio [1, and references therein].

Axel Brunger introduced the concept of cross-validation into protein
crystallography, and nowadays, the usage and importance R_{free} is
generally accepted for macromolecular structures[2].

As A Brunger already pointed out that the R_{free} is actually a
compromise between complete cross-validation and feasibility given the
computer power at that time.

## Complete Cross Validation: R_{complete}

Refinement is subject to bias: errors in the data result in overfitting,
*i.e.* an artificially low R1-value. It has been shown that
R_{complete} shows bias in the opposite direction [3], so that the true
R-value equals the mean between R1 and R_{complete}. The gap between R1
and R_{complete} is therefore a measure for the amount of overfitting of
the model.

The speed of modern computers allow to carry out proper, complete cross
validation and therefore get a fully valid, unbiased R-value for crystal
structures from *all* reflections (hence the term '*complete* cross
validation').

The latest version of SHELXL prints the
required numbers that are necessary for the calculation of
R_{complete}. Together with my program `crossflaghkl` it is
easy to calculate R_{complete} at any stage during model
refinement.

Jens Lübben's GUI R_complete make calculation of
R_{complete} even more convenient.

For more information and uses, please see Luebben & Gruene, PNAS (2015), Vol. 112, 8999-9003. Note that the description of the very experiments are available in the Supplement to this publication.

The benefit of R_{complete} for charge density studies was
demonstrated by Krause
*et al.*, IUCrJ (2017), Vol. 4, 420-430.

**Nota bene and Caveat** R_{cross}
is mathematically identical to R_{complete} except for a random
disturbance of the parameters in R_{cross} applied by the authors. However, this disturbance
has a detrimental effect and can destroy the validity of R_{complete}:
In the best case, the parameters return to the same minimum during refinement,
*i.e.* the disturbance has now effect. In other cases, the
parameters of different sets of calculations fall into different minima, so that
the R-value is calculated from *different* structures. In this case, the
calculated R-value has has no meaning. See Fig. S4 in Luebben & Gruene.

It is a common pitfall to believe that randomization is necessary to make the parameters independent from the bias of previous refinement. The fact this is not the case is known as Tickle's conjecture. Luebben & Gruene contains a description of experiments that demonstrate the validity of Tickle's conjecture.

## References

- G. J. Kleywegt, T. A. Jones,
*Where freedom is given, liberties are taken*, Structure (1995), volume 3, 535-540 - A. T. Brünger,
*Cross-Validation in Crystallography*, Meth. Enzymol. (1990), volume 343, 366-396 - I. J. Tickle, R. A. Laskowski, D. S. Moss,
*Rfree and the Rfree ration. II*,, Acta Crystallogr. (2000) D56, 442-450.

Last modified: Jul 31, 2019 16:53