sabato 12 dicembre 2009

A central hub for R bloggers

I would like to suggest to my readers to take a look and bookmark a new blog named R-bloggers which aims to be "a central hub of content collected from bloggers who write about R".
It seems a nice idea to me to have a centralized source of information for the R blogger community.
Good Luck, Tal!

lunedì 16 novembre 2009

R in Action - early thoughts

I was invited to review the book R in Action written by Rob Kabacoff. Since I consider the Quick-R website, created by the same smart guy, one of the most valuable resources about R, It is both an honor and a pleasure to have the opportunity to take an early look at his book and to express some thoughts about it.

First, this book is distributed under an early access policy that means, as it is stated on the editor's web site, that: This Early Access version of the book enables you to receive new chapters as they are being written. You can also interact with the authors to ask questions, provide feedback and errata, and help shape the final manuscript on the Author Online. This is a nice publishing approach, the editor settled up an ad-hoc forum which allows real-time feedback from early adopters. This beta-test sort of approach is convenient both to the author that can fix errata and improve contents before the final version is published and to the early adopters that can access to useful contents in advance and receive valuable explanations directly from the author.

Since only the initial part of the book is available, this short review will be at most incomplete and present only preliminary thoughts. I'm going to update the review as soon as I have the possibility to read the rest of the book.

R in Action, as mimicked in its structure, aims to guide the new adopters from the vary basics of the language through to the most advanced features by a progressive task-driven approach carefully curated by the author.

In the initial part of the book, Kabacoff covers all the basic features of the language from data manipulation to the basic statistics required to make sense of the data plus the most common and useful graphical methods for visualizing them.

The author makes large use of working example. This is one of the most effective teaching technique, in my opinion, because it encourages readers to apply immediately the knowledge acquired.

An other nice ingredient of Kabacoff method is to introduce effective high quality packages from the huge R collection to solve a proposed task. For example, in chapter three the author introduces the rename function from the awesome reshape package to rename the columns of a data.frame. This is a very trivial task, that can be easily managed by standard R (as the author shows shortly afterward); but the smoothly introduction of this useful package, explained and used more extensively in the forthcoming chapters, represents a nice touch that both means to manage the task in a more elegant way and introduces the user to a powerful tool.
In this fashion, the tasks presented in the text are addressed using several different packages in order to depict the various alternative methods available in R.
Furthermore, the numerous notes accompanying the explanations serve both to make easier the understanding of the described concepts and to provide useful insights about R features and idiosyncrasies.

To sum up, the chapters I had the opportunity to examine are a solid base for people getting started with R. I'm impatient to dig through the forthcoming chapters of the book which deal with advanced statistics and graphics!

I warmly recommend this book even in this early stage: if you are new to R programming this is a valid approach to start being familiar with the language and make effective use of it in from day one.

giovedì 29 ottobre 2009

Bioconductor 2.5 is out

For all bioinformaticians and R users out there: the Bioconductor project  for the analysis and comprehension of genomic data is out! A lot of interesting new stuff! See the full announcement here.

lunedì 26 ottobre 2009

R 2.10.0 is Out!

The new R 2.10.0 is out! Get it from here.
If you like take a look at these posts for some miscellaneous advices to make the upgrade easier.
Feel free to contribute with suggestions about how to upgrade your R installation.

mercoledì 14 ottobre 2009

The Elements of Statistical Learning

The Elements of Statistical Learning written by Trevor Hastie, Robert Tibshirani and Jerome Friedman is A-MUST-TO-READ for everyone involved in the data mining field! Now you can legally download a copy of the book in pdf format from the authors website! Grab it here!

venerdì 4 settembre 2009

R Flashmob #2

As I said before, I consider the R-Help mailing list an invaluable source of information if you want to get things done in R. Recently the stackoverflow website, a site where programmers can post and answer questions about a wide list of programming languages, has been populated with a lot of questions and answers regarding R thanks to a 'virtual' flash mob. Because of this event, stackoverflow has become a extremely  precious web 2.0 resource for the R community.

An other R Flash Mob event is scheduled for Tuesday, 8th September. I warmly recommend all my readers to take part to this event so to populate the stackoverflow site with even more useful questions and answers about our beloved R.

You can find both the event details and a letter, depicting the event, which you may forward to your colleagues/R-fanboy here.

mercoledì 5 agosto 2009

Locate the position of CRAN mirror sites on a map using Google Maps

Inspired by this post (suggested here by the always useful Revolutions blog), I attempted to plot the position of CRAN mirrors on a map taking advantage of the nice R package RgoogleMaps (check the dependencies!). Below the code:

# download.file("",destfile="cran.gml")
cran.gml <- xmlInternalTreeParse("cran.gml")
# Create a data.frame assembling all the information from the gml file
Name <- sapply(getNodeSet(cran.gml, "//ogr:Name"), xmlValue)
Country <- sapply(getNodeSet(cran.gml, "//ogr:Country"), xmlValue)
City <- sapply(getNodeSet(cran.gml, "//ogr:City"), xmlValue)
URL <- sapply(getNodeSet(cran.gml, "//ogr:URL"), xmlValue)
Host <- sapply(getNodeSet(cran.gml, "//ogr:Host"), xmlValue)
Maintainer <- sapply(getNodeSet(cran.gml, "//ogr:Maintainer"), xmlValue)
CountryCode <- sapply(getNodeSet(cran.gml, "//ogr:countryCode"), xmlValue)
lng <- as.numeric(sapply(getNodeSet(cran.gml, "//ogr:lng"), xmlValue))
lat <- as.numeric(sapply(getNodeSet(cran.gml, "//ogr:lat"), xmlValue))
cran.mirrors <- data.frame(Name, Country, City, URL, Host, Maintainer, CountryCode, lng, lat)
# cran.mirrors <- cbind(getCRANmirrors(), lng, lat) ## alternatively
# Define the markers:
cran.markers <- lat=cran.mirrors$lat, lon=cran.mirrors$lng, 
size=rep('tiny', length(cran.mirrors$lat)), col=colors()[1:length(cran.mirrors$lat)], 
char=rep('',length(cran.mirrors$lat)) )
# Get the bounding box:
bb <- qbbox(lat = cran.markers[,"lat"], lon = cran.markers[,"lon"])
num.mirrors <- 1:dim(cran.markers)[1] ## to visualize only a subset of the cran.mirrors
maptype <- c("roadmap", "mobile", "satellite", "terrain", "hybrid", "mapmaker-roadmap", "mapmaker-hybrid")[1]
# Download the map (either jpg or png): 
MyMap <- GetMap.bbox(bb$lonR, bb$latR, destfile = paste("Map_", maptype, ".png", sep=""), GRAYSCALE=F, maptype = maptype)
# Plot:
png(paste("CRANMirrorsMap_", maptype,".png", sep=""), 640, 640)
tmp <- PlotOnStaticMap(MyMap,lat = cran.markers[num.mirrors,"lat"], lon = cran.markers[num.mirrors,"lon"], 
cex=1, pch="R",col=as.numeric(cran.mirrors$Country), add=F)

## Hosts from Italy
maptype <- c("roadmap", "mobile", "satellite", "terrain", "hybrid", "mapmaker-roadmap", "mapmaker-hybrid")[4] <- row.names(cran.mirrors[cran.mirrors$CountryCode=="IT",])
# Get the bounding box: <- qbbox(lat = cran.markers[,"lat"], lon = cran.markers[,"lon"])
# Download the map (either jpg or png):
ITMap <- GetMap.bbox($lonR,$latR, destfile = paste("ITMap_", maptype, ".png", sep=""), GRAYSCALE=F, maptype = maptype)
#ITMap <- GetMap.bbox($lonR,$latR, destfile = paste("ITMap_", maptype, ".jpg", sep=""), GRAYSCALE=F, maptype = maptype)
# Plot:
png(paste("CRANMirrorsMapIT_", maptype,".png", sep=""), 640, 640);
tmp <- PlotOnStaticMap(ITMap,lat = cran.markers[,"lat"], lon = cran.markers[,"lon"], 
cex=2, pch="R",col="dodgerblue", add=F)
# tmp <- PlotOnStaticMap(ITMap,lat = cran.markers[,"lat"], lon = cran.markers[,"lon"],labels=as.character(cran.mirrors[cran.mirrors$CountryCode=="IT",]$Host),col="black", FUN=text, add=T)

CAVEAT: To reproduce the example you need the gml file you can download from here , a  Google account and a Google Maps API key. Here you can sign up for a free API key.

domenica 26 luglio 2009

Rosetta Code

Today I'd like to suggest the interesting Rosetta Code site:

Rosetta Code is a programming chrestomathy site. The idea is to present solutions to the same task in as many different languages as possible, to demonstrate how languages are similar and different, and to aid a person with a grounding in one approach to a problem in learning another.

Since the R coverage of the different tasks is still largely incomplete, I encourage everyone to populate the missing tasks with appropriate R code.

venerdì 3 luglio 2009

File Management in R: two recipes

Remove files with a specific pattern in R:

A quick basic tip which can come in handy whether you need to rapidly remove files from a directory:

junk <- dir(path="your_path",  pattern="your_pattern") # ?dir
file.remove(junk) # ?file.remove

Compress multiple files/folders in separate zip files:

This tip came handy to me when I had to compress into separate .cbz files (zip files with an other extension) a vast collection of folders containing scans for different numbers of a comic book series (to create .cbz files instead of zip files, just substitute .cbz to .zip in the following code).

for (i in 1:length(l)) zip(paste(l[i],".zip",sep=""),files=l[i]) # ?zip

Clearly, for advanced needs, you can use system() and all the unix tools installed onto your machine.

Note: This post was updated on 1/5/2012

venerdì 26 giugno 2009

Set the significant digits for each column in a xtable for fancy Sweave output

This tip may be useful in the situations when you need to set the number of digits to print for the different columns in a matrix/data.frame to be outputted as a LaTeX table.
 For example:

tmp <- matrix(rnorm(9), 3, 3)
xtmp <- xtable(tmp)
digits(xtmp) <- c(0,0,3,4)
print(xtmp, include.rownames = FALSE) # row names suppressed

See here for a nice gallery depicting a large variety of outputs you can produce using the xtable package.

venerdì 19 giugno 2009

Iran Election analyzed with R

Here you can find a very interesting post depicting the R strengths in 'real-time statistics'.
I'd like to use the occasion to thank David Smith for hosting the best, imho, blog on R! 
Follow Him on Twitter: @revodavid .

lunedì 15 giugno 2009

Replacing 0 with NA - an evergreen from the list

This thread from the R-help list describe an evergreen tip that, at least once, is proved useful in R practice.

sabato 6 giugno 2009

Two plot with a common legend - base graphics

If you need to share a common legend between two graphs using the ggplot2 package/paradigm take a look at this post from the Learning R blog.
The code below solves the same task using the R base graphics.

png( "2plot1legend.png", width = 480, height = 680)
par(mfrow = c(2, 1), oma = c(0, 0, 0, 2))
plot(hp~mpg, data=mtcars, col=cyl,pch=19)
plot(disp~wt, data=mtcars, col=cyl,pch=19)
#legend(locator(1), legend=as.numeric(levels(factor(mtcars$cyl))), pch=19, col= as.numeric(levels(factor(mtcars$cyl))) )
legend(x=5.6, y=690, legend=as.numeric(levels(factor(mtcars$cyl))), pch=19, col= as.numeric(levels(factor(mtcars$cyl))) )

domenica 31 maggio 2009

Nice Interview

Here you can read a nice interview with David Smith, REvolution Computing’s Director of Community, statistician and bloggeR.

venerdì 8 maggio 2009

Searching through mailing list archives

Romain Francois on his blog posted a very useful function to search through the R mailing list archives. Take a look at it here. Take also a look at the tm package introduced in  R News, 8(2):19–22, Oct. 2008 with an example dealing with analysis of the R-help mailing list.

mercoledì 29 aprile 2009

screen in ubuntu 9.04

As I told in a previous post, I consider the unix tool screen essential for my work. It seems that the just released Ubuntu 9.04  contains a package that make it easier to configure and use. You can find the full story here.

martedì 28 aprile 2009

Tips from the R-help list : shadow text in a plot and bumps charts

Stumbling across the R-help mailing-list I found, as often happens,  two threads in the spirit of this blog (of course, since they come from the list, the quality is higher): here you can find a function allowing  a shadow outline style for a text in a plot. From here you can follow an interesting thread depicting how to produce bumps charts in R.

venerdì 24 aprile 2009

Colors in the R terminal

Today, I'd like to suggest a new R package that you can download from here.
Still in its early development, the xterm256 package allows to print text in the R terminal using different colours. You can find more information here.
The picture below depicts a basic example of its use.

martedì 31 marzo 2009

Multiple plot in a single image using ImageMagick

Sometimes you need to add several plots/images either by row or by column to a single page/sheet.
If you generate all your plot with R base graphics you can easily accomplished the task using the par() function, e.g., using par(mfrow=c(2,2)) and then drawing 4 plots of your choice.
However, if you need to create a single image build up from different sources, e.g. external images, plots not compatible with R base graphics, etc. , you can create/retrieve the single images and then merge them together using the tools from the Unix (Linux, Mac OS X, etc.) ImageMagick suite.

## Example
# we generate some random plot
## the first plot is taken from the seqLogo help ( ?seqLogo )
## I selected this example on purpose because the seqLogo function is based on the grid graphics
and is coded in such a way that doesn't allow the use of the par() function
mFile <- system.file("Exfiles/pwm1", package="seqLogo")
m <- read.table(mFile)
pwm <- makePWM(m)
png("seqLogo1.png", width=400, height=400)
## totally unrelated
png("plot1.png", width=400, height=400)

Then you can type:
system("convert \\( seqLogo1.png plot1.png +append \\) \\( seqLogo1.png plot1.png +append \\) -background none -append final.png")

Remember that in R you have to start escape character with '\' !

Or, alternatively, from the command line:
convert \( seqLogo1.png plot1.png +append \) \( seqLogo1.png plot1.png +append \) -background none -append final.png

See man convert and man ImageMagick for the full story.

mercoledì 25 marzo 2009

Alternative implementations using ggplot2

Here and here, you can find alternative implementations of two plots  (1, 2) I created time ago using R basic graphic. The author recreates the plots taking advantage of the excellent ggplot2 package.

giovedì 12 marzo 2009

no "Infinities"

Thanks to  Pierre-Yves for the below useful tip!

if you have a dataset from which you want the max or min but they have to be real number and not "Inf" or "-Inf" there is a way to do it:

data <- c(-Inf, 1,2,3,4,5,6,7,8,9,10, Inf)
# Return Inf
# Return -Inf
# To solve the problem I went to:
range(data, finite=TRUE)
# Then you can do
myMinimum <- range(data, finite=TRUE)[1]
myMaximum <- range(data, finite=TRUE)[2]

domenica 8 marzo 2009

Dealing with missing values

Two new quick tips from 'almost regular' contributor Jason:

Handling missing values in R can be tricky. Let's say you have a table
with missing values you'd like to read from disk. Reading in the table

read.table( fileName )

might fail. If your table is properly formatted, then R can determine
what's a missing value by using the "sep" option in read.table:

read.table( fileName, sep="\t" )

This tells R that all my columns will be separated by TABS regardless of
whether there's data there or not. So, make sure that your file on disk
really is fully TAB separated: if there is a missing data point you must
have a TAB to tell R that this datum is missing and to move to the next
field for processing.

Lastly, don't forget the "header=T" option if you have a header line in
your file.

Here's the 2nd tip:

Some algorithms in R don't support missing (NA) values. If you have a
data.frame with missing values and quickly want the ROWS with any
missing data to be removed then try:

myData[rowSums(, ]

To find NA values in your data you have to use the "" function.

venerdì 23 gennaio 2009

Interesting tip about multicolor title of a plot

I'd like to suggest to take a look at this interesting post about creating a title with multi-coloured words.

mercoledì 21 gennaio 2009

Radar chart

I thank David for the following example of radar chart:

corelations <- c(1:97)
corelation.names <- names(corelations) <- c("Alp12Mn",
"AvrROE", "DivToP", "GrowAPS", "GrowAsst", "GrowBPS", "GrowCFPS",
"GrowDPS", "GrowEPS", "GrowSPS", "HistAlp", "HistSigm", "InvVsSal",
"LevGrow", "Payout5", "PredSigm", "RecVsSal", "Ret12Mn", "Ret3Mn",
"Ret1Mn", "ROE", "_CshPlow", "_DDM", "_EarnMom", "_EstChgs",
"_EstRvMd", "_Neglect", "_NrmEToP", "_PredEToP", "_RelStMd", "_ResRev",
"_SectMom", "AssetToP", "ARM_Pref_Earnings", "AvrCFtoP", "AvrDtoP",
"AvrEtoP", "ARM_Sec_Earnings", "BondSens", "BookToP", "Capt",
"CaptAdj", "CashToP", "CshFlToP", "CurrSen", "DivCuts5", "EarnToP",
"Earnvar", "Earnyld", "Growth", "HistBeta", "IndConc", "Leveflag",
"Leverag", "Leverage", "Lncap", "Momentum", "Payoflag", "PredBeta",
"Ret_11M_Momentum", "PotDilu", "Price", "ProjEgro", "RecEPSGr",
"SalesToP", "Size", "SizeNonl", "Tradactv", "TradVol", "Value",
"VarDPS", "Volatility", "Yield", "CFROI", "ADJUST", "ERC", "RC", "SPX",
"R1000", "MarketCap", "TotalRisk", "Value_AX", "truncate_ret_1mo",
"truncate_PredSigma", "Residual_Returns", "ARM_Revenue",
"ARM_Rec_Comp", "ARM_Revisions_Comp", "ARM_Global_Rank", "ARM_Score",
"TEMP", "EQ_Raw", "EQ_Region_Rank", "EQ_Acc_Comp", "EQ_CF_Comp",
"EQ_Oper_Eff_Comp", "EQ_Exc_Comp")
corelations <- c(0.223, 0.1884, -0.131, 0.1287, 0.0307,
0.2003, 0.2280, 0.1599, 0.2680, 0.2596, 0.3399, 0.0324, 0.0382, -0.173,
-0.177, -0.056, -0.063, 0.2211, 0.0674, -0.023, 0.2641, 0.2369, 0.1652,
-0.023, 0.1070, 0.0791, -0.023, 0.0434, -0.002, -0.001, -0.000, -0.108,
-0.288, 0.1504, -0.127, -0.142, 0.0852, 0, -0.031, -0.320, 0.0785,
0.0465, -0.166, 0.1416, 0.0945, -0.063, 0.1461, -0.305, 0.1215, 0.0776,
0.0449, 0.0823, -0.018, -0.261, -0.318, 0.1194, 0.3151, -0.124, 0.1037,
0.2240, -0.115, 0.1543, 0, 0.1775, -0.153, 0.1194, 0.1407, 0.1047,
0.0926, -0.403, 0.0067, -0.048, -0.136, 0.1068, 0.0381, 0.1878, -0.035,
0.0761, 0.0784, 0, 0, 0, -0.018, 0.1602, 0.0543, 0, -0.013, 0.1439, 0,
0, -0.054, 0.7426, 0.7510, 0.1657, 0.1657, 0.4949, 1.0000)
radial.plot(corelations, labels=corelation.names,rp.type="p",main="Correlation Radar", radial.lim=c(-1,1),line.col="blue")

lunedì 19 gennaio 2009

Map coordinates to actual pixel locations on a PNG device

Jason emailed me a new tip. Enjoy it!

Use grconvertX and grconvertY to map the X,Y coordinates for an entity on a graphics device to user coordinates. For example if you plotted points to an image and wanted to map those X,Y coordinates to the actual pixel locations on the PNG you would use this family of functions.

# Sample R Code for grconvertX and grconvertY

# make fake data
tDat <- cbind(rnorm(10), rnorm(10));

# Example #1 -- plot them to an X11 window
print(paste(grconvertX(tDat[, 1], "user", "device"), grconvertY(tDat[, 2], "user", "device")));

# turn off the x11 device

# Example 2-- Get the pixel coordinates of the data on a PNG image
# plot to a PNG
png(file="RTip_coordinates.png", height=1000, width=1000);
print( paste(grconvertX(tDat[, 1], "user", "device"), grconvertY(tDat[, 2], "user", "device")));

# Now, go into GIMP or photoshop. At each data point should be at the
# X,Y coordinate listed.

giovedì 8 gennaio 2009

lunedì 5 gennaio 2009

Statistical Visualizations - Part 2

Other 2 plots inspired by this post.

Europe Asia Americas Africa Oceania
1820-30 106487 36 11951 17 33333
1831-40 495681 53 33424 54 69911
1841-50 1597442 141 62469 55 53144
1851-60 2452577 41538 74720 210 29169
1861-70 2065141 64759 166607 312 18005
1871-80 2271925 124160 404044 358 11704
1881-90 4735484 69942 426967 857 13363
1891-00 3555352 74862 38972 350 18028
1901-10 8056040 323543 361888 7368 46547
1911-20 4321887 247236 1143671 8443 14574
1921-30 2463194 112059 1516716 6286 8954
1931-40 347566 16595 160037 1750 2483
1941-50 621147 37028 354804 7367 14693
1951-60 1325727 153249 996944 14092 25467
1961-70 1123492 427642 1716374 28954 25215
1971-80 800368 1588178 1982735 80779 41254
1981-90 761550 2738157 3615225 176893 46237
1991-00 1359737 2795672 4486806 354939 98263
2001-06 1073726 2265696 3037122 446792 185986

png("immigration_barplot_me.png", width = 1419, height = 736)
library(RColorBrewer) # take a look at
# display.brewer.all()
FD.palette <- c("#984EA3","#377EB8","#4DAF4A","#FF7F00","#E41A1C")
par(mar=c(6, 6, 3, 3), las=2)
data4bp <- t(original[,c(5,4,2,3,1)])
barplot( data4bp, beside=F,col=FD.palette, border=FD.palette, space=1, legend=F, ylab="Number of People", main="Migration to the United States by Source Region (1820 - 2006)", mgp=c(4.5,1,0) )
legend( "topleft", legend=rev(rownames(data4bp)), fill=rev(FD.palette) )

I find this 'bubbleplot' visualization quite interesting; unfortunately the R code I was capable to produce is quite poor and unsatisfactory. Any improvement or suggestion is more than welcome!
Anyway, this is the code:

png("immigration_bubbleplot_me.png", width=1400, height=400)
par(mar=c(3, 6, 3, 2), col="grey85")
mag = 0.9
original.vec <- as.matrix(original)
dim(original.vec) <- NULL
symbols( rep(1:nrow(original),ncol(original)), rep(5:1, each=nrow(original)), circles = original.vec, inches=mag, ylim=c(1,6),fg="grey85", bg="grey20", ylab="", xlab="", xlim =range(1:nrow(original)), xaxt="n", yaxt="n", main="Immigration to the USA - 1821 to 2006", panel.first = grid())
axis(1, 1:nrow(original), labels=rownames(original), las=1, col="grey85")
axis(2, 1:ncol(original), labels=rev(colnames(original)), las=1, col="grey85")

You can find the first part of this 'series' with Yihui contributed code (Thanks again!) here.