lunedì 15 giugno 2009

Replacing 0 with NA - an evergreen from the list

This thread from the R-help list describe an evergreen tip that, at least once, is proved useful in R practice.

9 commenti:

  1. Or, even easier:

    data[which(data==0)] = NA

    RispondiElimina
  2. It's better to use data[data==0] <- NA.
    Example:
    set.seed(123)
    data <- matrix(rnorm(100), ncol = 10)
    data[sample(100, 20)] <- 0
    data <- data.frame(data)
    ##
    data[which(data==0)] = NA
    Error in `[<-.data.frame`(`*tmp*`, which(data == 0), value = NA) :
    new columns would leave holes after existing columns
    ## the code below does work
    data[data==0] <- NA

    RispondiElimina
  3. Package gdata has a set of functions for working with missing values in general. See the vignette.

    RispondiElimina
  4. Hey,

    Slightly off topic but do you klnow how I can subscribe to r-help via reader?

    T

    RispondiElimina
  5. Google reader that is. Thanks!

    RispondiElimina
  6. Hi Paolo,

    I have a question that I can't seem to find a good answer to. I frequently have missing data that need to be replaced by values of another variable in the data set. For example:

    d<-data.frame(x=c(1,2,3,4,5),y=c(NA,NA,6,7,8))

    Where y=NA, I want it to assume the value of x[i]. The only thing that I've found that works is a loop, but it's enormously sluggish on large data frames. This is what I've used in the past:

    for(i in 1:nrow(d)){
    d[[i,2]][is.na(d[[i,2]])]=d[[i,1]]
    }

    Do you know of a more efficient solution to this?

    Thanks!

    RispondiElimina
    Risposte
    1. have you tried na.locf(d)?

      Elimina
  7. This seems a good question to post on Stackoverflow
    in the meantime:

    d <- data.frame(x=c(1:10000), y=c(rep(NA,8000),1:2000))
    system.time( for(i in 1:nrow(d)) d[[i,2]][is.na(d[[i,2]])]=d[[i,1]] )
    # user system elapsed
    # 2.506 0.023 2.529
    d2 <- data.frame(x=c(1:10000), y=c(rep(NA,8000),1:2000))
    system.time( d2 <- ifelse(is.na(d2$y),d2$x, d2$y) )
    # user system elapsed
    # 0.001 0.000 0.001
    identical(d$y,d2)

    HIH

    RispondiElimina