# Data Analysis

## multidimensional data

### PCA - principal component analysis

PCA (in general and now especially in R) is a tool, which tries to project a cloud of data points into a new space, which emphasizes correlations and similarities. This means, the shape of a cloud is improved for understanding the nature of the data. It uses the eigenvalues and eigenvectors of the variance-covariance matrix and takes the eigenvector with the larges eigenvalues for the first axis, the second largest for the next and so on. Then the points around the first axis have the largest standard deviations. It also translates the data's center of gravity to origin and scales the axis. It is no problem to project the points to less then the number of dimensitions, because in general the dimensitions with the smallest eigenvalues don't reveal much information.
In the following example I use the data from the seminar, I take the first 6 dimensitions and a subsample. Then I visualize it using rpanel (see graphics).
```> load("protein_all.R")
> head(blist)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1395  426  819  787  605  848 1443 1509 1430  1499
[2,]  782  819 1423  605  848  900 1509 1430 1499  1442
[3,]  642 1423 1401  848  900  645 1430 1499 1442  1449
[4,] 1069 1401 1266  900  645  726 1499 1442 1449  1389
[5,]  488 1266 1422  645  726  554 1442 1449 1389  1420
[6,] 1224 1422  730  726  554 1184 1449 1389 1420  1416
> pca <- blist[,1:6]
> library(rpanel)
> rp.plot3d(pca[,1],pca[,2], pca[,3])
```
This 3D plot doesn't divulge much information. Now I use `prcomp` for the PCA. I take a subsample and draw a pairwise plot with `pairs`, the command `biplot` would be possible, too, but just for the first two dimensitions. Subsample with `sample`, I've already explained that. As an enhancement somebody could also draw density estimations for each plot, just colours or additional information about each pair into the plot - see `?pairs`-examples for more.
```> pca.prcomp <- prcomp(pca)
> plot(pca.prcomp)
> summary(pca.prcomp)
Importance of components:
PC1     PC2     PC3     PC4      PC5      PC6
Standard deviation     1286.350 735.324 664.664 308.646 286.1123 265.1802
Proportion of Variance    0.574   0.187   0.153   0.033   0.0284   0.0244
Cumulative Proportion     0.574   0.761   0.914   0.947   0.9756   1.0000

> pca.prcomp.sam <- pca.prcomp\$x[sample(length(pca.prcomp\$x[,1]), 1000),]

> head(pca.prcomp.sam)
PC1       PC2        PC3        PC4        PC5        PC6
[1,]  1537.986 -445.7424   81.65688   53.53643   51.99708  119.11222
[2,] -2035.648 -118.8843  -29.96359 -370.29551  190.12007 -232.20885
[3,]  1566.890 -476.0046  110.56378   35.99743  -62.13603  -68.30757
[4,] -2023.303 -354.9968 -110.04953 -316.06372 -203.38162 -114.12394
[5,] -1364.616 -483.2570   52.74121  631.05140  219.95764  283.68527
[6,]  1408.336 -603.0983   91.03420   69.88446 -168.54962   -9.46777

> pairs(pca.prcomp.sam)
```
Instead of pairs, i use the `splom` routine from the lattice package:
```> library(lattice)
> splom(as.data.frame(pca.prcomp.sam),cex=0.3)
```
output of pairs:
The return object of `prcomp` has more than just the transformed coordinates. Just have a look: (rotation matrix shown)
```> names(pca.prcomp)
[1] "sdev"     "rotation" "center"   "scale"    "x"

> pca.prcomp\$rotation
PC1          PC2         PC3         PC4          PC5         PC6
[1,] -0.05468035 -0.991925665  0.10466361 -0.01896061  0.005546653 -0.04181857
[2,] -0.66138822  0.103918118  0.69029667  0.17042054 -0.213852149  0.02193744
[3,] -0.66207559 -0.046949467 -0.69381753  0.24718781 -0.004433389  0.13018229
[4,] -0.19749134  0.021074967  0.13632346 -0.31363272  0.847461748  0.35413561
[5,] -0.21801303  0.050829522 -0.05201498 -0.23623690  0.249386732 -0.91059294
[6,] -0.18627692  0.007242444 -0.09931579 -0.86909272 -0.416931267  0.16195983
```
As default, scaling is disabled. You can enable it using the scale argument `prcomp(...,scale=TRUE)`and the data will be scaled.
For more, look at that more comprehensive overview of other methods like PCO, MDS, SOM, CA, MST
But wait, we want a 3D plot of the first three principal components, too!
```> library(rpanel)
Lade nĂ¶tiges Paket: tcltk
Loading Tcl/Tk interface ... done
Package `rpanel', version 1.0-3
type help(rpanel) for summary information

> rp.plot3d(pca.prcomp.sam[,1],pca.prcomp.sam[,2],pca.prcomp.sam[,3])
```
When you rotate the result, you can see a dint on one side, two hotspots and with more points or pre-filtered data, there are many possibilities to play around. Once again, there are also tools for density estimations in 3D which show cuts through 3D countour plots. But more about R's graphic technics in graphics chapter.

## Clustering

Here, clustering is used as a method to divide a huge amount of data into smaller blocks or samples, where each of them has something in common. This means, all points of each part are near to each other using a certain distance or criteria and each block could have a special semantic meaning.

### estimation-maximization

I focus on the EM-Algorithm, because we used it in the seminar. It is a tool to automatically classify data into clusters. In general, it starts with a randomly selected classification and refines the selection in two steps, which are iterated then. This loop is continued until the quality of the selection is good enough (BIC) or the maximum number of iterations is reached. The number of components can be given or the algorithm tests for several different numbers of components and selects the best one. There are also different models available. The implementation of the EM-Algorithm in R is provided by the `mclust` package, which has some tools built around the algorithm. The main command to start the algorithm is `?Mclust`, there are routines to evaluate the algorithm step by step and methods like `?hc` provide an agglomerative hierarchical clustering.
The following example is just a sneak preview what could be done using our data (blist) from the seminar with the EM-algorithm:
```> library(mclust)
Attache Paket: 'mclust'
The following object(s) are masked from package:stats :
density

> blist[sample(length(blist[,1]),1000),1:6] -> bsam

> head(bsam)
[,1]  [,2] [,3] [,4] [,5] [,6]
[1,]   680 -1155 -628  104 -112   80
[2,]  1039  1333 1400  750  664  897
[3,]   376  -990  770  137  168  660
[4,] -1250   116   26  213  986  540
[5,]   843  -902 -907  158  -94   23
[6,]   775  -927 -966   22   42    7

> length(bsam[,1])
[1] 1000

> EMclust(bsam) -> bsam.em
> summary(bsam.em)
Fehler in summary.EMclust(bsam.em) : Argument "data" fehlt (ohne Standardwert)
> summary(bsam.em,data=bsa)
Fehler in summary.EMclust(bsam.em, data = bsa) :
objekt "bsa" nicht gefunden
> summary(bsam.em,data=bsam)

classification table:

1   2   3   4   5   6   7   8   9
185 315  10 102  27 224  14 105  18

uncertainty (quartiles):
0%          25%          50%          75%         100%
2.855812e-09 5.186490e-04 6.030938e-03 5.275997e-02 5.530643e-01

best BIC values:
VEI,9     VEI,8     VEV,6
-87802.60 -87812.57 -87838.95

best model: diagonal, equal shape

Warning message:
BIC maximized at upper limit on G in: summary.EMclust(bsam.em, data = bsam)

> plot(bsam.em)
EII VII EEI VEI EVI VVI EEE EEV VEV VVV
"A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

```
output of last plot command, shows all models and the BIC for each number of components. The warning tells you what the graph makes obvious - we need more components, because the default value searches only from 1 to 9.
now the clustering in 3D
```> bsam.pca <- prcomp(bsam)

> library(rpanel)
Lade nĂ¶tiges Paket: tcltk
Loading Tcl/Tk interface ... done
Package `rpanel', version 1.0-3
type help(rpanel) for summary information

> summary(bsam.em,data=bsam,G=9)

classification table:

1   2   3   4   5   6   7   8   9
185 315  10 102  27 224  14 105  18

uncertainty (quartiles):
0%          25%          50%          75%         100%
2.855812e-09 5.186490e-04 6.030938e-03 5.275997e-02 5.530643e-01

best BIC values:
VEI,9    VEV,9
-87802.6 -87969.0

best model: diagonal, equal shape

> names(summary(bsam.em,data=bsam,G=9))
[1] "bic"            "options"        "classification" "uncertainty"
[5] "n"              "d"              "G"              "z"
[9] "mu"             "sigma"          "decomp"         "pro"
[13] "loglik"         "Vinv"           "modelName"
> summary(bsam.em,data=bsam,G=9)\$classification
[1] 4 2 6 6 1 1 4 6 2 6 6 8 6 8 5 1 2 1 9 2 1 4 4 6 1 8 4 2 8 2 6 6 2 6 2 6 3
[38] 2 6 2 6 2 1 2 8 2 4 1 6 4 6 2 2 1 1 1 8 8 2 6 6 1 6 2 2 2 1 2 2 2 1 8 1 8
[75] 1 1 2 8 2 6 6 8 4 1 8 2 4 6 2 4 6 8 2 2 1 4 4 6 1 2 6 2 1 4 1 8 2 6 1 4 2
[112] 4 1 2 4 1 6 2 2 4 2 6 6 4 4 2 2 5 1 2 2 4 2 2 2 2 1 5 6 6 8 2 6 2 5 6 2 2
[149] 8 1 4 2 2 6 6 6 1 4 8 6 2 8 2 1 2 6 2 2 6 3 6 4 6 1 2 2 8 4 8 2 1 4 2 4 4
[186] 4 8 8 1 1 2 1 2 6 4 1 6 8 6 2 6 2 1 1 2 8 1 2 8 6 1 8 6 6 1 6 6 1 1 2 2 6
[223] 2 2 1 6 2 6 2 1 1 2 9 2 2 1 8 2 8 8 4 1 2 2 6 6 4 2 8 1 2 6 2 2 6 7 1 6 4
[260] 8 6 2 2 2 2 4 8 4 2 8 2 4 6 1 1 2 7 2 6 2 8 1 2 6 6 1 6 8 2 4 6 1 6 8 8 6
[297] 6 6 4 4 2 8 1 4 6 8 5 8 2 2 1 6 2 6 2 6 2 2 2 2 9 8 6 1 5 2 8 7 6 6 8 6 8
[334] 2 2 4 2 5 1 6 2 1 6 2 1 6 4 2 8 1 2 1 8 1 2 8 5 2 8 2 6 1 2 2 1 1 4 8 2 8
[371] 2 2 6 2 2 4 1 8 8 6 6 2 1 8 1 8 2 9 8 2 6 6 6 1 6 1 6 1 2 6 6 2 6 8 2 6 4
[408] 6 4 1 2 2 4 2 2 2 1 2 6 6 1 2 1 1 1 1 8 6 1 1 6 4 1 8 6 8 2 4 2 6 8 2 6 6
[445] 6 5 1 6 2 1 4 2 2 1 6 6 1 9 3 1 4 2 2 4 2 5 4 2 6 5 1 2 2 1 8 2 9 6 6 2 2
[482] 2 5 4 2 2 2 1 2 4 6 6 6 6 2 1 2 6 2 6 9 6 9 6 2 2 2 6 7 7 6 6 1 6 2 1 2 6
[519] 2 2 5 5 2 6 1 4 5 8 4 2 5 1 2 6 8 6 2 8 4 3 8 1 4 2 6 6 2 2 1 2 8 2 6 2 9
[556] 6 6 6 7 6 2 1 4 2 2 5 6 2 1 4 6 2 9 6 2 2 2 4 2 1 1 2 4 1 2 6 2 1 6 4 6 4
[593] 4 6 4 2 1 6 2 6 1 6 1 2 6 8 2 1 6 2 2 2 2 2 2 2 6 1 2 5 2 6 1 2 4 6 1 8 6
[630] 1 6 1 1 1 6 6 2 8 1 8 6 2 1 8 1 6 7 1 2 4 2 1 8 2 6 2 2 4 4 6 1 2 8 6 2 6
[667] 6 6 1 2 6 4 1 6 8 6 4 6 1 2 8 2 2 2 6 1 2 2 2 6 2 1 1 4 7 1 2 1 1 2 8 2 4
[704] 6 6 7 4 4 4 1 2 2 2 6 1 6 1 2 6 6 6 2 8 1 2 1 2 6 1 2 1 5 1 2 6 6 2 4 5 6
[741] 7 2 6 2 4 2 2 1 2 6 1 2 2 1 1 2 2 3 8 5 8 2 2 1 2 8 6 2 4 6 2 5 6 6 1 1 8
[778] 9 8 6 6 5 8 2 9 6 2 2 2 1 4 2 1 2 2 2 1 2 3 2 2 2 6 2 2 4 1 6 1 1 8 9 2 1
[815] 2 1 1 1 4 2 2 6 2 2 2 4 4 8 4 2 2 9 1 4 8 8 4 6 4 6 8 2 2 4 8 6 2 5 6 2 6
[852] 2 1 4 1 1 8 1 6 1 6 2 6 6 2 1 2 4 7 2 8 9 2 1 2 1 2 1 6 2 9 6 6 2 2 7 6 1
[889] 3 8 1 6 8 6 1 4 7 6 8 2 2 2 2 2 6 6 4 6 2 2 2 4 6 1 4 2 1 2 2 2 2 2 2 2 6
[926] 1 2 8 6 1 2 1 6 8 2 6 6 6 4 4 6 2 8 6 6 6 2 4 2 8 1 1 8 6 8 2 1 1 4 1 2 2
[963] 2 3 8 8 2 2 6 6 4 4 2 3 5 5 1 8 2 1 8 2 1 2 9 2 8 9 6 7 1 2 1 5 6 1 3 1 6
[1000] 6

> rp.plot3d(bsam[,1],bsam[,2],bsam[,3],col=summary(bsam.em,data=bsam,G=9)\$classification)
```
The last command plots a 3D view which could be freely rotated. The colors are the classifications. It is also possible, to refine the plot using PCA as shown above, even clustering the data after PCA produces better clusters. There are also routines to apply the model to new and different datasets. This is useful if the clustering is only done on a subsample as shown here and you want to classify the whole dataset without doing a the timeconsuming calculation (O(n^2)).
```> bsam.pca <- prcomp(bsam,scaling=TRUE)
> rp.plot3d(bsam.pca\$x[,1],bsam.pca\$x[,2],bsam.pca\$x[,3],col=summary(bsam.em,data=bsam,G=9)\$classification)
```
```> library(lattice)

> splom(as.data.frame(bsam.pca\$x), col=summary(bsam.em,data=bsam,G=9)\$classification, cex=0.3)
```
more about lattice and other visualizations in the graphics chapter. Here just an example of dividing the data into equal sets and plotting them separately.
```> bsam.depth <- equal.count(bsam, number=6, overlap=0.1)
> xyplot(bsam.pca\$x[,1] ~ bsam.pca\$x[, 2] | bsam.depth, col=summary(bsam.em,data=bsam,G=9)\$classification)
```

## markov chains

markov chains ...