## martedì 23 dicembre 2008

### Statistical Visualizations

Inspired by this interesting post, I decided to reproduce some of the plots using R code.

The data are c & p from here:

`>original Europe Asia Americas Africa Oceania1820-30 106487 36 11951 17 333331831-40 495681 53 33424 54 699111841-50 1597442 141 62469 55 531441851-60 2452577 41538 74720 210 291691861-70 2065141 64759 166607 312 180051871-80 2271925 124160 404044 358 117041881-90 4735484 69942 426967 857 133631891-00 3555352 74862 38972 350 180281901-10 8056040 323543 361888 7368 465471911-20 4321887 247236 1143671 8443 145741921-30 2463194 112059 1516716 6286 89541931-40 347566 16595 160037 1750 24831941-50 621147 37028 354804 7367 146931951-60 1325727 153249 996944 14092 254671961-70 1123492 427642 1716374 28954 252151971-80 800368 1588178 1982735 80779 412541981-90 761550 2738157 3615225 176893 462371991-00 1359737 2795672 4486806 354939 982632001-06 1073726 2265696 3037122 446792 185986`

`png("immigration_log_scatter_BW.png", width = 560, height = 480)par( mar=c(7, 7, 3, 3) )plot( original\$Europe, log="y", type="l", col="grey20", lty=1,ylim=c(10, 10000000), xlab="Year Interval", ylab="Number of Immigrants Admitted to the United States",lwd=2, xaxt='n', yaxt='n', mgp=c(4.5,1,0) ) # xaxt='n' an d yaxt='n'- do not show x and y axisfor (i in 2:dim(original)[[2]]){lines(original[, i], type="l", lty=i, col="grey20")}axis(1, 1:dim(original)[[1]], rownames(original), las=2)axis(2, at=c(10,100,1000,10000,100000,1000000,10000000), labels=c(10,100,1000,10000,100000,1000000,10000000), las=2, tck=1, col="grey85")box()legend( 14,400, legend=colnames(original), lty=c(1:5) )dev.off()`

`png("immigration_stacked_chart.png", width = 560, height = 480)library(plotrix)par( mar=c(6, 6, 3, 3) , las=1)colori4<-c("yellow", "darkred","green","brown1", "steelblue")stackpoly( original[, 5:1], col=smoothColors(colori4), border=NA,stack=T, xaxlab=rownames(original), ylim=c(10,10000000), staxx=TRUE, axis4=F, main="Immigration to the USA - 1821 to 2006" )legend("topleft", legend=colnames(original), fill=smoothColors(colori4)[5:1] )dev.off()`

## giovedì 11 dicembre 2008

### Tips from Jason

I want to thank Jason Vertrees for the following collection of useful tips!

(1) Use ~/.Rprofile for repeated environment initialization

(2) Ever have the problem of a large data frame only being displayed across 40% of your terminal window? Then, you can resize the R display to fit the size of your terminal window. Use the following "wideScreen" function:

```# define wideScreen wideScreen <- function() { options(width=as.integer(Sys.getenv("COLUMNS"))); } # # Test wideScreen # a <- rnorm(100) a wideScreen() # notice how the data fill the screen a ```

(3) Get familiar with colorspace. For example, if you need to color data points across a range, you can easily do:
``` ## ## lut.R -- small function that returns a cool pallete of nColors ## require(colorspace) lut <- function(nColors=20) { return(hex(HSV(seq(0, 360, length=nColors)[-nColors], 1, 1))); } # Now use lut. plot( rnorm(100), col=lut(100)[1:100] ) # Now use just a range; use colors near purple; pretty # much like gettins subsections of rainbow.colors() plot( rnorm(30), col=lut(100)[71:100] ) ```

(4) Given an N-dimensional data set, (m instances in N dimensions), find the K-nearest neighbors to a given row/instance/point:
``` ## ## neighbors -- find and return the K closest neighbors to "home" ## neighbors <- function( dat, home, k=10 ) { theHood <- apply( dat, 1, function(x) sqrt(sum((x-home)**2))) return(order(theHood)[1:k] ) } # Use it. Create a random 10x10 matrix and find which rows # in D are closest (Euclidean-wise) to row 1. d <- matrix( rnorm(100), nrow=10, ncol=10) neighbors(d, d[1,], k=3)```

(5) A _VERY_ useful tip is to show the users the vast difference in speed between using for, apply, sapply, mapply and tapply. A for loop is typically very slow, where the ?apply family is great. You can use the apply vs for-loop in the neighbors function above with a timer on a large set to show the difference.

(6) Another useful tip, also in neighbors is generating difference vectors and their lengths:
``` # the difference vector between two vectors is very easy, c <- a -b # now the vector length (how far apart in Euclidean space these two points are) sqrt(sum(c**2))```

## mercoledì 3 dicembre 2008

### Retrieving the author of a script

I know that the best/recommended way to manage the authoring of R code consists in building a package containing a DESCRIPTION file.
Nevertheless, I wrote a very basic function retrieving the name of the authors of a script (or any text file) if these names are written within the first three rows of the file (easily changeable) with this format:

##
## Author:Pinco Palla, Paolino Paperino, Topo Gigio
##

The function:

```catch.the.name <- function(filename="myscript.R"){ require(gdata) str <- scan(filename, what='character', nlines=3, sep="\t", quiet=TRUE) author <- grep("Author:([^ ]+)", str, value=T) author <-sub('^.*Author:', "", author) author <-strsplit(author,",") author <- trim(author) return(author[[1]]) }```

## giovedì 23 ottobre 2008

### R Upgrade on Mac Os X 10.5.5 (Leopard)

To reinstall packages from an old version of R to a new one.
In the old version type:
```tmp <- installed.packages() installedpkgs <- as.vector(tmp[is.na(tmp[,"Priority"]), 1]) save(installedpkgs, file="installed_old.rda")```
```# To wipe the old R version rm -rf /Library/Frameworks/R.framework /Applications/R.app rm -rf /Library/Receipts/R-*```
Build from source new R version (see this FAQ).
From inside the decompressed R-?.?.? directory type:
```# See Section 2.2 of RMacOSX FAQ for the flag description ./configure --with-blas='-framework vecLib' --enable-BLAS-shlib make sudo make install```
Install BioConductor packages using the biocLite.R installation script.
In an R command window, type the following:
```source("http://bioconductor.org/biocLite.R") chooseBioCmirror() biocLite()```
If you have other Bioconductor packages missing from the old installation:

```load("installed_old.rda") tmp <- installed.packages() installedpkgs.new <- as.vector(tmp[is.na(tmp[,"Priority"]), 1]) missing <- setdiff(installedpkgs, installedpkgs.new) for (i in 1:length(missing)) biocLite(missing[i])```
Re-install the missing packages from CRAN:
```load("installed_old.rda") tmp <- installed.packages() installedpkgs.new <- as.vector(tmp[is.na(tmp[,"Priority"]), 1]) missing <- setdiff(installedpkgs, installedpkgs.new) install.packages(missing) update.packages()```
If you use some package created by Henrik Bengtsson:
```source("http://www.braju.com/R/hbLite.R") hbLite()```
If you find your X11 broken after the installation procedure (it happens every time to me, at least on Leopard) install the XQuartz App from here.

Update: If you need to install a recent version of R on old hardware (Power PC G4) and OS (Mac OS X 10.4 here) this post can be useful.

## lunedì 15 settembre 2008

### Fitting text under a plot

This is, REALLY, a basic tip, but, since I struggled for some time to fit long labels under a barplot I thought to share my solution for someone else's benefit.

As you can see (first image) the labels can not be displayed entirely:
`counts <- sample(c(1000:10000),10)labels <-list()for (i in 1:10) { labels[i] <- paste("very long label number ",i,sep="")}barplot( height=counts, names.arg=labels, horiz=F, las=2,col="lightblue", main="Before")`

The trick to fit text of whatever dimension is to use the parameter mar to control the margins of the plot.

from ?par:

'mar' A numerical vector of the form 'c(bottom, left, top, right)'
which gives the number of lines of margin to be specified on
the four sides of the plot. The default is 'c(5, 4, 4, 2) + 0.1'.

`op <- par(mar=c(11,4,4,2)) # the 10 allows the names.arg below the barplotbarplot( height=counts, names.arg=labels, horiz=F, las=2,col="skyblue", main="After")rm(op)`

## lunedì 21 luglio 2008

### Bioinformatics career survey 2008

Via Bioinformatics Zen:

## giovedì 10 luglio 2008

### Parsing problem solved thanks to R-Help mailing list

Recently I had the necessity to parse several HUGE text files (~6M lines ~ 600Mb file size) not formatted in a standard way (so not easy import via scan, read.table etc.).
Because of the size of these files I have to avoid loops and find a way to vectorize my problem.
After several hours spent trying to solve this problem without success, I decided to send an help request to the R-help list. In no time i got the answer to this very problematic (at least for me) exercise :-)

You can read the full story here.

I REALLY love the R-Help mailing list! Thanks Guys!

## domenica 1 giugno 2008

### See you in Bressanone in 2 weeks!

Computational and Statistical Aspects
of Microarray Analysis (CSAMA08)

Bressanone-Brixen - June 15th-20th 2008

## martedì 6 maggio 2008

### Replacing Tabs with spaces and back in VIM

Not specifically R-related.
If you need to remove all the blank spaces from a file and replace them with tabs for easy importing in R, you can do it in VIM in no time. In normal mode type:

`:set noexpandtab:%retab!`

The full story here

## mercoledì 16 aprile 2008

### R installing on Unix/Linux - no root access

Thanks and credit to Joern Toedling for this useful and clear how-to!

From The Bioconductor Digest, Vol 62, Issue 14:
versions of R and your favourite packages there. This is how to do it:
ftp://ftp.stat.math.ethz.ch/Software/R/
2. uncompress it to a directory you have write access to, say ~/local/R
3. change into the uncompressed directory, ~/local/R/R-devel
4. run "./configure"
5. run "make"
Afterwards you can start R by executing ~/local/R/R-devel/bin/R;
to simplify that either add the bin directory to your path or create an alias for R
You do not need to run "make install" to work with R.
For packages,
1. create a directory in which you want the packages installed, e.g. ~/local/rpacks
2. create an evironment variable R_LIBS that holds the path to that directory, e.g. "setenv R_LIBS=\${HOME}/local/rpacks" with that directory and a C-shell (use export with a Bash shell)
This environment variable tells R where to look first for installed packages and where to install packages when using "install.packages" or "biocLite".
R_LIBS is safe to use, since it only extends the path to look for packages and does not replace the default R library path.
I would recommend to add the alias for starting R and the R_LIBS
definition to your shell startup file (~/.cshrc or ~/.bashrc).

## mercoledì 27 febbraio 2008

### Classification: a quick and dirty example

`## Thanks to the UCI repository Magic Gamma telescope data setmagic04 = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data", header = F, sep=",")# split the data set in test and training setsplit.data <- function(data, p = 0.7, s = 666){ set.seed(s) index <- sample(1:dim(data)[1]) train <- data[index[1:floor(dim(data)[1] * p)], ] test <- data[index[((ceiling(dim(data)[1] * p)) + 1):dim(data)[1]], ] return(list(train = train, test = test)) } dati = split.data(magic04, p = 0.7)train = dati\$traintest = dati\$test# SVM training just for funlibrary(e1071)model <- svm(train[,1:10],train[,11], probability = T)# prediction on the test setpred <- predict(model, test[,1:(dim(test)[[2]]-1)], probability = T)# Check the predictionstable(pred,test[,dim(test)[2]])pred.prob <- attr(pred, "probabilities")pred.to.roc <- pred.prob[, 1]# performance assessmentlibrary(ROCR)pred.rocr <- prediction(pred.to.roc, as.factor(test[,(dim(test)[[2]])]))perf.rocr <- performance(pred.rocr, measure = "auc", x.measure = "cutoff")cat("AUC =",deparse(as.numeric(perf.rocr@y.values)),"\n")perf.tpr.rocr <- performance(pred.rocr, "tpr","fpr")plot(perf.tpr.rocr, colorize=T) `

## giovedì 10 gennaio 2008

### Hello World for Clustering methods

A hello world program can be a useful sanity test to make sure that the procedure/methods you are analyzing "works" at least for very basic tasks. For this purpose, I create an artificial data set from 4 different 2-dimensional normal distributions to check how well the 4 clusters can be recognized by common clustering methods.

`set1 <- matrix(cbind(rnorm(100,0,2),rnorm(100,0,2)),100,2)set2 <- matrix(cbind(rnorm(100,0,2),rnorm(100,8,2)),100,2)set3 <- matrix(cbind(rnorm(100,8,2),rnorm(100,0,2)),100,2)set4 <- matrix(cbind(rnorm(100,8,2),rnorm(100,8,2)),100,2) dati <- list(values=rbind(set1,set2,set3,set4),classes=c(rep(1,100),rep(2,100),rep(3,100),rep(4,100))) # clustering - common methods op <- par(mfcol = c(2, 2)) par(las =1)plot(dati\$values, col = as.integer(dati\$classes), xlim=c(-6,14), ylim = c(-6,14), xlab="", ylab="", main = "True Groups") party <- kmeans(dati\$values,4)plot(dati\$values, col = party\$cluster, xlab = "", ylab = "", main = "kmeans")hc = hclust(dist(dati\$values), method = "ward")memb <- cutree(hc, k = 4)plot(dati\$values, col = memb, xlab = "", ylab = "", main = "hclust Euclidean ward") hc = hclust(dist(dati\$values), method = "complete") memb <- cutree(hc, k = 4)plot(dati\$values, col = memb, xlab = "", ylab = "", main = "hclust Euclidean complete") par(op)`