One R Tip A Day: contributed

Visualizzazione post con etichetta contributed. Mostra tutti i post

domenica 8 marzo 2009

Dealing with missing values

Two new quick tips from 'almost regular' contributor Jason:

Handling missing values in R can be tricky. Let's say you have a table
with missing values you'd like to read from disk. Reading in the table
with,

read.table( fileName )

might fail. If your table is properly formatted, then R can determine
what's a missing value by using the "sep" option in read.table:

read.table( fileName, sep="\t" )

This tells R that all my columns will be separated by TABS regardless of
whether there's data there or not. So, make sure that your file on disk
really is fully TAB separated: if there is a missing data point you must
have a TAB to tell R that this datum is missing and to move to the next
field for processing.

Lastly, don't forget the "header=T" option if you have a header line in
your file.

Here's the 2nd tip:

Some algorithms in R don't support missing (NA) values. If you have a
data.frame with missing values and quickly want the ROWS with any
missing data to be removed then try:

myData[rowSums(is.na(myData))==0, ]

To find NA values in your data you have to use the "is.na" function.

mercoledì 21 gennaio 2009

Radar chart

I thank David for the following example of radar chart:

corelations <- c(1:97)

corelation.names <- names(corelations) <- c("Alp12Mn",

"AvrROE", "DivToP", "GrowAPS", "GrowAsst", "GrowBPS", "GrowCFPS",

"GrowDPS", "GrowEPS", "GrowSPS", "HistAlp", "HistSigm", "InvVsSal",

"LevGrow", "Payout5", "PredSigm", "RecVsSal", "Ret12Mn", "Ret3Mn",

"Ret1Mn", "ROE", "_CshPlow", "_DDM", "_EarnMom", "_EstChgs",

"_EstRvMd", "_Neglect", "_NrmEToP", "_PredEToP", "_RelStMd", "_ResRev",

"_SectMom", "AssetToP", "ARM_Pref_Earnings", "AvrCFtoP", "AvrDtoP",

"AvrEtoP", "ARM_Sec_Earnings", "BondSens", "BookToP", "Capt",

"CaptAdj", "CashToP", "CshFlToP", "CurrSen", "DivCuts5", "EarnToP",

"Earnvar", "Earnyld", "Growth", "HistBeta", "IndConc", "Leveflag",

"Leverag", "Leverage", "Lncap", "Momentum", "Payoflag", "PredBeta",

"Ret_11M_Momentum", "PotDilu", "Price", "ProjEgro", "RecEPSGr",

"SalesToP", "Size", "SizeNonl", "Tradactv", "TradVol", "Value",

"VarDPS", "Volatility", "Yield", "CFROI", "ADJUST", "ERC", "RC", "SPX",

"R1000", "MarketCap", "TotalRisk", "Value_AX", "truncate_ret_1mo",

"truncate_PredSigma", "Residual_Returns", "ARM_Revenue",

"ARM_Rec_Comp", "ARM_Revisions_Comp", "ARM_Global_Rank", "ARM_Score",

"TEMP", "EQ_Raw", "EQ_Region_Rank", "EQ_Acc_Comp", "EQ_CF_Comp",

"EQ_Oper_Eff_Comp", "EQ_Exc_Comp")

corelations <- c(0.223, 0.1884, -0.131, 0.1287, 0.0307,

0.2003, 0.2280, 0.1599, 0.2680, 0.2596, 0.3399, 0.0324, 0.0382, -0.173,

-0.177, -0.056, -0.063, 0.2211, 0.0674, -0.023, 0.2641, 0.2369, 0.1652,

-0.023, 0.1070, 0.0791, -0.023, 0.0434, -0.002, -0.001, -0.000, -0.108,

-0.288, 0.1504, -0.127, -0.142, 0.0852, 0, -0.031, -0.320, 0.0785,

0.0465, -0.166, 0.1416, 0.0945, -0.063, 0.1461, -0.305, 0.1215, 0.0776,

0.0449, 0.0823, -0.018, -0.261, -0.318, 0.1194, 0.3151, -0.124, 0.1037,

0.2240, -0.115, 0.1543, 0, 0.1775, -0.153, 0.1194, 0.1407, 0.1047,

0.0926, -0.403, 0.0067, -0.048, -0.136, 0.1068, 0.0381, 0.1878, -0.035,

0.0761, 0.0784, 0, 0, 0, -0.018, 0.1602, 0.0543, 0, -0.013, 0.1439, 0,

0, -0.054, 0.7426, 0.7510, 0.1657, 0.1657, 0.4949, 1.0000)

require(plotrix)

par(ps=6)

radial.plot(corelations, labels=corelation.names,rp.type="p",main="Correlation Radar", radial.lim=c(-1,1),line.col="blue")

lunedì 19 gennaio 2009

Map coordinates to actual pixel locations on a PNG device

Jason emailed me a new tip. Enjoy it!

Use grconvertX and grconvertY to map the X,Y coordinates for an entity on a graphics device to user coordinates. For example if you plotted points to an image and wanted to map those X,Y coordinates to the actual pixel locations on the PNG you would use this family of functions.

#
# Sample R Code for grconvertX and grconvertY
#

# make fake data
tDat <- cbind(rnorm(10), rnorm(10));

#
# Example #1 -- plot them to an X11 window
#
x11();
plot(tDat);
print(paste(grconvertX(tDat[, 1], "user", "device"), grconvertY(tDat[, 2], "user", "device")));

# turn off the x11 device
#dev.off()


#
# Example 2-- Get the pixel coordinates of the data on a PNG image
#
# plot to a PNG
png(file="RTip_coordinates.png", height=1000, width=1000);
plot(tDat);
print( paste(grconvertX(tDat[, 1], "user", "device"),  grconvertY(tDat[, 2], "user", "device")));
dev.off()

# Now, go into GIMP or photoshop.  At each data point should be at the
# X,Y coordinate listed.

giovedì 11 dicembre 2008

Tips from Jason

I want to thank Jason Vertrees for the following collection of useful tips!

(1) Use ~/.Rprofile for repeated environment initialization

(2) Ever have the problem of a large data frame only being displayed across 40% of your terminal window? Then, you can resize the R display to fit the size of your terminal window. Use the following "wideScreen" function:

# define wideScreen

wideScreen <- function() {

options(width=as.integer(Sys.getenv("COLUMNS")));

}

#

# Test wideScreen

#

a <- rnorm(100)

a

wideScreen()

# notice how the data fill the screen

a

(3) Get familiar with colorspace. For example, if you need to color data points across a range, you can easily do:



##

## lut.R -- small function that returns a cool pallete of nColors

##

require(colorspace)

lut <- function(nColors=20) {

return(hex(HSV(seq(0, 360, length=nColors)[-nColors], 1, 1)));

}

# Now use lut.

plot( rnorm(100), col=lut(100)[1:100] )

# Now use just a range; use colors near purple; pretty

# much like gettins subsections of rainbow.colors()

plot( rnorm(30), col=lut(100)[71:100] )

(4) Given an N-dimensional data set, (m instances in N dimensions), find the K-nearest neighbors to a given row/instance/point:



##

## neighbors -- find and return the K closest neighbors to "home"

##

neighbors <- function( dat, home, k=10 ) {

theHood <- apply( dat, 1, function(x) sqrt(sum((x-home)**2)))

return(order(theHood)[1:k] )

}

# Use it.  Create a random 10x10 matrix and find which rows

# in D are closest (Euclidean-wise) to row 1.

d <- matrix( rnorm(100), nrow=10, ncol=10)

neighbors(d, d[1,], k=3)

(5) A _VERY_ useful tip is to show the users the vast difference in speed between using for, apply, sapply, mapply and tapply. A for loop is typically very slow, where the ?apply family is great. You can use the apply vs for-loop in the neighbors function above with a timer on a large set to show the difference.

(6) Another useful tip, also in neighbors is generating difference vectors and their lengths:



# the difference vector between two vectors is very easy,

c <- a -b

# now the vector length (how far apart in Euclidean space these two points are)

sqrt(sum(c**2))