Data Analysis
multidimensional data
PCA - principal component analysis
PCA (in general and now especially in R) is a tool, which tries to project a cloud of data
points into a new space, which emphasizes correlations and similarities. This means, the shape
of a cloud is improved for understanding the nature of the data. It uses the eigenvalues and
eigenvectors of the variance-covariance matrix and takes the eigenvector with the larges
eigenvalues for the first axis, the second largest for the next and so on. Then the points
around the first axis have the largest standard deviations. It also translates the data's
center of gravity to origin and scales the axis. It is no problem to project the points to less
then the number of dimensitions, because in general the dimensitions with the smallest
eigenvalues don't reveal much information.
In the following example I use the data from the seminar, I take the first 6 dimensitions and a
subsample. Then I visualize it using
rpanel (see graphics).
> load("protein_all.R")
> head(blist)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1395 426 819 787 605 848 1443 1509 1430 1499
[2,] 782 819 1423 605 848 900 1509 1430 1499 1442
[3,] 642 1423 1401 848 900 645 1430 1499 1442 1449
[4,] 1069 1401 1266 900 645 726 1499 1442 1449 1389
[5,] 488 1266 1422 645 726 554 1442 1449 1389 1420
[6,] 1224 1422 730 726 554 1184 1449 1389 1420 1416
> pca <- blist[,1:6]
> library(rpanel)
> rp.plot3d(pca[,1],pca[,2], pca[,3])
This 3D plot doesn't divulge much information. Now I use prcomp for the PCA. I
take a subsample and draw a pairwise plot with pairs, the command
biplot would be possible, too, but just for the first two dimensitions. Subsample
with sample, I've already explained that. As an enhancement somebody could also
draw density estimations for each plot, just colours or additional information about each pair
into the plot - see ?pairs-examples for more.
> pca.prcomp <- prcomp(pca)
> plot(pca.prcomp)
> summary(pca.prcomp)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 1286.350 735.324 664.664 308.646 286.1123 265.1802
Proportion of Variance 0.574 0.187 0.153 0.033 0.0284 0.0244
Cumulative Proportion 0.574 0.761 0.914 0.947 0.9756 1.0000
> pca.prcomp.sam <- pca.prcomp$x[sample(length(pca.prcomp$x[,1]), 1000),]
> head(pca.prcomp.sam)
PC1 PC2 PC3 PC4 PC5 PC6
[1,] 1537.986 -445.7424 81.65688 53.53643 51.99708 119.11222
[2,] -2035.648 -118.8843 -29.96359 -370.29551 190.12007 -232.20885
[3,] 1566.890 -476.0046 110.56378 35.99743 -62.13603 -68.30757
[4,] -2023.303 -354.9968 -110.04953 -316.06372 -203.38162 -114.12394
[5,] -1364.616 -483.2570 52.74121 631.05140 219.95764 283.68527
[6,] 1408.336 -603.0983 91.03420 69.88446 -168.54962 -9.46777
> pairs(pca.prcomp.sam)
Instead of pairs, i use the splom routine from the lattice package:
> library(lattice)
> splom(as.data.frame(pca.prcomp.sam),cex=0.3)
output of pairs:
The return object of prcomp has more than just the transformed coordinates. Just
have a look: (rotation matrix shown)
> names(pca.prcomp)
[1] "sdev" "rotation" "center" "scale" "x"
> pca.prcomp$rotation
PC1 PC2 PC3 PC4 PC5 PC6
[1,] -0.05468035 -0.991925665 0.10466361 -0.01896061 0.005546653 -0.04181857
[2,] -0.66138822 0.103918118 0.69029667 0.17042054 -0.213852149 0.02193744
[3,] -0.66207559 -0.046949467 -0.69381753 0.24718781 -0.004433389 0.13018229
[4,] -0.19749134 0.021074967 0.13632346 -0.31363272 0.847461748 0.35413561
[5,] -0.21801303 0.050829522 -0.05201498 -0.23623690 0.249386732 -0.91059294
[6,] -0.18627692 0.007242444 -0.09931579 -0.86909272 -0.416931267 0.16195983
But wait, we want a 3D plot of the first three principal components, too!
> library(rpanel)
Lade nötiges Paket: tcltk
Loading Tcl/Tk interface ... done
Package `rpanel', version 1.0-3
type help(rpanel) for summary information
> rp.plot3d(pca.prcomp.sam[,1],pca.prcomp.sam[,2],pca.prcomp.sam[,3])
When you rotate the result, you can see a dint on one side, two hotspots and with more points or
pre-filtered data, there are many possibilities to play around. Once again, there are also tools
for density estimations in 3D which show cuts through 3D countour plots. But more about R's
graphic technics in
graphics chapter.
Clustering
Here, clustering is used as a method to divide a huge amount of data into smaller blocks or
samples, where each of them has something in common. This means, all points of each part are
near to each other using a certain distance or criteria and each block could have a special
semantic meaning.
estimation-maximization
I focus on the
EM-Algorithm, because
we used it in the seminar. It is a tool to automatically classify data into clusters. In
general, it starts with a randomly selected classification and refines the selection in two
steps, which are iterated then. This loop is continued until the quality of the selection is
good enough (BIC) or the maximum number of iterations is reached. The number of components can
be given or the algorithm tests for several different numbers of components and selects the best
one. There are also different models available. The implementation of the EM-Algorithm in R is
provided by the
mclust package, which has some tools built around the algorithm.
The main command to start the algorithm is
?Mclust, there are routines to evaluate
the algorithm step by step and methods like
?hc provide an agglomerative
hierarchical clustering.
The following example is just a sneak preview what could be done using our data (blist) from the
seminar with the EM-algorithm:
> library(mclust)
Attache Paket: 'mclust'
The following object(s) are masked from package:stats :
density
> blist[sample(length(blist[,1]),1000),1:6] -> bsam
> head(bsam)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 680 -1155 -628 104 -112 80
[2,] 1039 1333 1400 750 664 897
[3,] 376 -990 770 137 168 660
[4,] -1250 116 26 213 986 540
[5,] 843 -902 -907 158 -94 23
[6,] 775 -927 -966 22 42 7
> length(bsam[,1])
[1] 1000
> EMclust(bsam) -> bsam.em
> summary(bsam.em)
Fehler in summary.EMclust(bsam.em) : Argument "data" fehlt (ohne Standardwert)
> summary(bsam.em,data=bsa)
Fehler in summary.EMclust(bsam.em, data = bsa) :
objekt "bsa" nicht gefunden
> summary(bsam.em,data=bsam)
classification table:
1 2 3 4 5 6 7 8 9
185 315 10 102 27 224 14 105 18
uncertainty (quartiles):
0% 25% 50% 75% 100%
2.855812e-09 5.186490e-04 6.030938e-03 5.275997e-02 5.530643e-01
best BIC values:
VEI,9 VEI,8 VEV,6
-87802.60 -87812.57 -87838.95
best model: diagonal, equal shape
Warning message:
BIC maximized at upper limit on G in: summary.EMclust(bsam.em, data = bsam)
> plot(bsam.em)
EII VII EEI VEI EVI VVI EEE EEV VEV VVV
"A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
output of last plot command, shows all models and the BIC for each number of components. The
warning tells you what the graph makes obvious - we need more components, because the default
value searches only from 1 to 9.
now the clustering in 3D
> bsam.pca <- prcomp(bsam)
> library(rpanel)
Lade nötiges Paket: tcltk
Loading Tcl/Tk interface ... done
Package `rpanel', version 1.0-3
type help(rpanel) for summary information
> summary(bsam.em,data=bsam,G=9)
classification table:
1 2 3 4 5 6 7 8 9
185 315 10 102 27 224 14 105 18
uncertainty (quartiles):
0% 25% 50% 75% 100%
2.855812e-09 5.186490e-04 6.030938e-03 5.275997e-02 5.530643e-01
best BIC values:
VEI,9 VEV,9
-87802.6 -87969.0
best model: diagonal, equal shape
> names(summary(bsam.em,data=bsam,G=9))
[1] "bic" "options" "classification" "uncertainty"
[5] "n" "d" "G" "z"
[9] "mu" "sigma" "decomp" "pro"
[13] "loglik" "Vinv" "modelName"
> summary(bsam.em,data=bsam,G=9)$classification
[1] 4 2 6 6 1 1 4 6 2 6 6 8 6 8 5 1 2 1 9 2 1 4 4 6 1 8 4 2 8 2 6 6 2 6 2 6 3
[38] 2 6 2 6 2 1 2 8 2 4 1 6 4 6 2 2 1 1 1 8 8 2 6 6 1 6 2 2 2 1 2 2 2 1 8 1 8
[75] 1 1 2 8 2 6 6 8 4 1 8 2 4 6 2 4 6 8 2 2 1 4 4 6 1 2 6 2 1 4 1 8 2 6 1 4 2
[112] 4 1 2 4 1 6 2 2 4 2 6 6 4 4 2 2 5 1 2 2 4 2 2 2 2 1 5 6 6 8 2 6 2 5 6 2 2
[149] 8 1 4 2 2 6 6 6 1 4 8 6 2 8 2 1 2 6 2 2 6 3 6 4 6 1 2 2 8 4 8 2 1 4 2 4 4
[186] 4 8 8 1 1 2 1 2 6 4 1 6 8 6 2 6 2 1 1 2 8 1 2 8 6 1 8 6 6 1 6 6 1 1 2 2 6
[223] 2 2 1 6 2 6 2 1 1 2 9 2 2 1 8 2 8 8 4 1 2 2 6 6 4 2 8 1 2 6 2 2 6 7 1 6 4
[260] 8 6 2 2 2 2 4 8 4 2 8 2 4 6 1 1 2 7 2 6 2 8 1 2 6 6 1 6 8 2 4 6 1 6 8 8 6
[297] 6 6 4 4 2 8 1 4 6 8 5 8 2 2 1 6 2 6 2 6 2 2 2 2 9 8 6 1 5 2 8 7 6 6 8 6 8
[334] 2 2 4 2 5 1 6 2 1 6 2 1 6 4 2 8 1 2 1 8 1 2 8 5 2 8 2 6 1 2 2 1 1 4 8 2 8
[371] 2 2 6 2 2 4 1 8 8 6 6 2 1 8 1 8 2 9 8 2 6 6 6 1 6 1 6 1 2 6 6 2 6 8 2 6 4
[408] 6 4 1 2 2 4 2 2 2 1 2 6 6 1 2 1 1 1 1 8 6 1 1 6 4 1 8 6 8 2 4 2 6 8 2 6 6
[445] 6 5 1 6 2 1 4 2 2 1 6 6 1 9 3 1 4 2 2 4 2 5 4 2 6 5 1 2 2 1 8 2 9 6 6 2 2
[482] 2 5 4 2 2 2 1 2 4 6 6 6 6 2 1 2 6 2 6 9 6 9 6 2 2 2 6 7 7 6 6 1 6 2 1 2 6
[519] 2 2 5 5 2 6 1 4 5 8 4 2 5 1 2 6 8 6 2 8 4 3 8 1 4 2 6 6 2 2 1 2 8 2 6 2 9
[556] 6 6 6 7 6 2 1 4 2 2 5 6 2 1 4 6 2 9 6 2 2 2 4 2 1 1 2 4 1 2 6 2 1 6 4 6 4
[593] 4 6 4 2 1 6 2 6 1 6 1 2 6 8 2 1 6 2 2 2 2 2 2 2 6 1 2 5 2 6 1 2 4 6 1 8 6
[630] 1 6 1 1 1 6 6 2 8 1 8 6 2 1 8 1 6 7 1 2 4 2 1 8 2 6 2 2 4 4 6 1 2 8 6 2 6
[667] 6 6 1 2 6 4 1 6 8 6 4 6 1 2 8 2 2 2 6 1 2 2 2 6 2 1 1 4 7 1 2 1 1 2 8 2 4
[704] 6 6 7 4 4 4 1 2 2 2 6 1 6 1 2 6 6 6 2 8 1 2 1 2 6 1 2 1 5 1 2 6 6 2 4 5 6
[741] 7 2 6 2 4 2 2 1 2 6 1 2 2 1 1 2 2 3 8 5 8 2 2 1 2 8 6 2 4 6 2 5 6 6 1 1 8
[778] 9 8 6 6 5 8 2 9 6 2 2 2 1 4 2 1 2 2 2 1 2 3 2 2 2 6 2 2 4 1 6 1 1 8 9 2 1
[815] 2 1 1 1 4 2 2 6 2 2 2 4 4 8 4 2 2 9 1 4 8 8 4 6 4 6 8 2 2 4 8 6 2 5 6 2 6
[852] 2 1 4 1 1 8 1 6 1 6 2 6 6 2 1 2 4 7 2 8 9 2 1 2 1 2 1 6 2 9 6 6 2 2 7 6 1
[889] 3 8 1 6 8 6 1 4 7 6 8 2 2 2 2 2 6 6 4 6 2 2 2 4 6 1 4 2 1 2 2 2 2 2 2 2 6
[926] 1 2 8 6 1 2 1 6 8 2 6 6 6 4 4 6 2 8 6 6 6 2 4 2 8 1 1 8 6 8 2 1 1 4 1 2 2
[963] 2 3 8 8 2 2 6 6 4 4 2 3 5 5 1 8 2 1 8 2 1 2 9 2 8 9 6 7 1 2 1 5 6 1 3 1 6
[1000] 6
> rp.plot3d(bsam[,1],bsam[,2],bsam[,3],col=summary(bsam.em,data=bsam,G=9)$classification)
The last command plots a 3D view which could be freely rotated. The colors are the
classifications. It is also possible, to refine the plot using PCA as shown above, even
clustering the data after PCA produces better clusters. There are also routines to apply the
model to new and different datasets. This is useful if the clustering is only done on a
subsample as shown here and you want to classify the whole dataset without doing a the
timeconsuming calculation (O(n^2)).
> bsam.pca <- prcomp(bsam,scaling=TRUE)
> rp.plot3d(bsam.pca$x[,1],bsam.pca$x[,2],bsam.pca$x[,3],col=summary(bsam.em,data=bsam,G=9)$classification)
> library(lattice)
> splom(as.data.frame(bsam.pca$x), col=summary(bsam.em,data=bsam,G=9)$classification, cex=0.3)
more about lattice and other visualizations in the
graphics
chapter. Here just an example of dividing the data into equal sets and plotting them
separately.
> bsam.depth <- equal.count(bsam, number=6, overlap=0.1)
> xyplot(bsam.pca$x[,1] ~ bsam.pca$x[, 2] | bsam.depth, col=summary(bsam.em,data=bsam,G=9)$classification)
markov chains
markov chains ...
links
links