Statistical functions
R was written for statistics and therefore it has endless methods for doing this job. I can only
summarize some of the most important functions and a lot will be missing in any case.
basic information
First chapter is about some basic methods to get information about the data.
summary returns some information for each dimension of the data:
> c(23,154,22,64,33,41) -> d
> summary(d)
Min. 1st Qu. Median Mean 3rd Qu. Max.
22.00 25.50 37.00 56.17 58.25 154.00
The exact behavior of this function is determined by the
class
of the particular object. This means, a vector produces this example, a result of some fitting
something else, and so on.
The help for summary shows this for a logical vector, where things like mean and median
don't make any sense:
> unclass(attenu$station) < 25 -> d
> summary(d)
Mode FALSE TRUE NA's
logical 124 42 16
To calculate quantiles, R has one general method which is able to compute them using different
algorithms. The following example computes them using custom probabilities.
> d <- c(10,11,12,15,19,25,29,44,81,99)
> quantile(d)
0% 25% 50% 75% 100%
10.00 12.75 22.00 40.25 99.00
> quantile(d,probs=c(0.1,0.2,0.5))
10% 20% 50%
10.9 11.8 22.0
> quantile(d,probs=c(0.1,0.2,0.5))
10% 20% 50%
10.9 11.8 22.0
> quantile(d,probs=c(0.1,0.2,0.5),type=8)
10% 20% 50%
10.36667 11.40000 22.00000
models
R has support for models, linear, generalized and nonlinear.
Here is a link
to R's online manual, which shows some of its capabilities. Those data sets are data frames.
They are a bit special and consist mainly of named vectors. Here is a short example:
> dy <- c(0.1, 0.2, 0.5, 0.6, 0.67, 0.9)
> dx <- c( 0, 0.1, 0.4, 0.6, 0.8, 1 )
> data.frame(x=dx, y=dy) -> df
> df
x y
1 0.0 0.10
2 0.1 0.20
3 0.4 0.50
4 0.6 0.60
5 0.8 0.67
6 1.0 0.90
> fit <- lm(y ~ x, data=df)
> fit
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
0.1298 0.7555
> summary(fit)
Call:
lm(formula = y ~ x, data = df)
Residuals:
1 2 3 4 5 6
-0.02983 -0.00538 0.06796 0.01685 -0.06425 0.01464
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12983 0.03458 3.754 0.019880 *
x 0.75553 0.05751 13.138 0.000194 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05041 on 4 degrees of freedom
Multiple R-Squared: 0.9774, Adjusted R-squared: 0.9717
F-statistic: 172.6 on 1 and 4 DF, p-value: 0.0001938
> newdat <- data.frame(x = seq(0,1,0.1))
> predict(fit,newdat,interval="confidence") -> pred.dy
> pred.dy
fit lwr upr
1 0.1298265 0.03380449 0.2258484
2 0.2053796 0.12164929 0.2891099
3 0.2809328 0.20805487 0.3538106
4 0.3564859 0.29228714 0.4206847
5 0.4320390 0.37337348 0.4907046
6 0.5075922 0.45039349 0.5647909
7 0.5831453 0.52304869 0.6432420
8 0.6586985 0.59190479 0.7254922
9 0.7342516 0.65795575 0.8105475
10 0.8098048 0.72210872 0.8975008
11 0.8853579 0.78500848 0.9857074
> matplot(newdat$x, pred.dy, lty = c(1,2,2), type="l", ylab="predicted")
> points(df,col=3,pch=20)
The last two commands give that plot (the upper and lower lines are the confidence borders):
(more about plotting
later)