Complete Cross Validation in Crystallography

History of Cross Validation: Rfree

In the 1990s it became clear that the R1-value is not a good quality measure for crystal structures, in particular for macromolecular structures which suffer from a low data to parameter ratio [1, and references therein].

Axel Brunger introduced the concept of cross-validation into protein crystallography, and nowadays, the usage and importance Rfree is generally accepted for macromolecular structures[2].

As A Brunger already pointed out that the Rfree is actually a compromise between complete cross-validation and feasibility given the computer power at that time.

Complete Cross Validation: Rcomplete

Refinement is subject to bias: errors in the data result in overfitting, i.e. an artificially low R1-value. It has been shown that Rcomplete shows bias in the opposite direction [3], so that the true R-value equals the mean between R1 and Rcomplete. The gap between R1 and Rcomplete is therefore a measure for the amount of overfitting of the model.

The speed of modern computers allow to carry out proper, complete cross validation and therefore get a fully valid, unbiased R-value for crystal structures from all reflections (hence the term 'complete cross validation').

The latest version of SHELXL prints the required numbers that are necessary for the calculation of Rcomplete. Together with my program crossflaghkl it is easy to calculate Rcomplete at any stage during model refinement.

Jens Lübben's GUI R_complete make calculation of Rcomplete even more convenient.

For more information and uses, please see Luebben & Gruene, PNAS (2015), Vol. 112, 8999-9003. Note that the description of the very experiments are available in the Supplement to this publication.

The benefit of Rcomplete for charge density studies was demonstrated by Krause et al., IUCrJ (2017), Vol. 4, 420-430.

Nota bene and Caveat Rcross is mathematically identical to Rcomplete except for a random disturbance of the parameters in Rcross applied by the authors. However, this disturbance has a detrimental effect and can destroy the validity of Rcomplete: In the best case, the parameters return to the same minimum during refinement, i.e. the disturbance has now effect. In other cases, the parameters of different sets of calculations fall into different minima, so that the R-value is calculated from different structures. In this case, the calculated R-value has has no meaning. See Fig. S4 in Luebben & Gruene.

It is a common pitfall to believe that randomization is necessary to make the parameters independent from the bias of previous refinement. The fact this is not the case is known as Tickle's conjecture. Luebben & Gruene contains a description of experiments that demonstrate the validity of Tickle's conjecture.

References

  1. G. J. Kleywegt, T. A. Jones, Where freedom is given, liberties are taken, Structure (1995), volume 3, 535-540
  2. A. T. Brünger, Cross-Validation in Crystallography, Meth. Enzymol. (1990), volume 343, 366-396
  3. I. J. Tickle, R. A. Laskowski, D. S. Moss, Rfree and the Rfree ration. II,, Acta Crystallogr. (2000) D56, 442-450.

Last modified: Jul 31, 2019 16:53