mercoledì 27 luglio 2011

Word Cloud in R

A word cloud (or tag cloud) can be an handy tool when you need to highlight the most commonly cited words in a text using a quick visualization. Of course, you can use one of the several on-line services, such as wordle or tagxedo , very feature rich and with a nice GUI. Being an R enthusiast, I always wanted to produce this kind of images within R and now, thanks to the recently released Ian Fellows' wordcloud package, finally I can!
In order to test the package I retrieved the titles of the XKCD web comics included in my RXKCD package and produced a word cloud based on the titles' word frequencies calculated using the powerful tm package for text mining (I know, it is like killing a fly with a bazooka!).

library(RXKCD)
library(tm)
library(wordcloud)
library(RColorBrewer)
path <- system.file("xkcd", package = "RXKCD")
datafiles <- list.files(path)
xkcd.df <- read.csv(file.path(path, datafiles))
xkcd.corpus <- Corpus(DataframeSource(data.frame(xkcd.df[, 3])))
xkcd.corpus <- tm_map(xkcd.corpus, removePunctuation)
xkcd.corpus <- tm_map(xkcd.corpus, tolower)
xkcd.corpus <- tm_map(xkcd.corpus, function(x) removeWords(x, stopwords("english")))
tdm <- TermDocumentMatrix(xkcd.corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, "BuGn")
pal <- pal[-(1:2)]
png("wordcloud.png", width=1280,height=800)
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
dev.off()

As a second example,  inspired by this post from the eKonometrics blog, I created a word cloud from the description of  3177 available R packages listed at http://cran.r-project.org/web/packages.
require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
u = "http://cran.r-project.org/web/packages/available_packages_by_date.html"
t = readHTMLTable(u)[[1]]
ap.corpus <- Corpus(DataframeSource(data.frame(as.character(t[,3]))))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud_packages.png", width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=3,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()

As a third example, thanks to Jim's comment, I take advantage of Duncan Temple Lang's RNYTimes package to access user-generate content on the NY Times and produce a wordcloud of 'today' comments on articles.
Caveat: in order to use the RNYTimes package you need a API key from The New York Times which you can get by registering to the The New York Times Developer Network (free of charge) from here.
require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
install.packages(packageName, repos = "http://www.omegahat.org/R", type = "source")
require(RNYTimes)
my.key <- "your API key here"
what= paste("by-date", format(Sys.time(), "%Y-%m-%d"),sep="/")
# what="recent"
recent.news <- community(what=what, key=my.key)
pagetree <- htmlTreeParse(recent.news, error=function(...){}, useInternalNodes = TRUE)
x <- xpathSApply(pagetree, "//*/body", xmlValue)
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
ap.corpus <- Corpus(DataframeSource(data.frame(as.character(x))))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud_NewYorkTimes_Community.png", width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=2,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()


14 comments:

  1. The post is interesting and I could replicate your second example. But I don't know how to do it if I have a text in a txt file or a word file. Your example just works with a html table but very often we have whole texts. I will be very grateful if you can make a world cloud using a txt file.

    Noam

    RispondiElimina
  2. Dear Noam,
    You can find both the answer to your question and a nice introduction to text mining in R in the vignette of the tm package:
    install.packages("tm")
    library("tm")
    vignette("tm")
    HIH!

    RispondiElimina
  3. Thank you Paolo, I'm going to read about the tm package. This is my first meet with text mining bacause I just use R for my classes of statistics.

    Noam

    RispondiElimina
  4. You are welcome Noam!
    I'm not a text mining expert as well but the tm package seems to provide a collection of tools that can be useful for solving both basic and more advanced problems in this interesting field.

    RispondiElimina
  5. Very nice. Liked it and probably will use it.

    RispondiElimina
  6. A great example. However sometime in the past two weeks back from 2011/11/30 the directory and file was removed. So "http://cran.r-project.org/web/packages/available_packages_by_date.html" is not found because the "packages" directory is no longer there. How about using another web site as an example?

    RispondiElimina
  7. Thanks for the update! Feel free to suggest a web site of interest as an alternative.

    RispondiElimina
  8. Thanks Paolo, this might be impossible but how about any text on a basic news web page like www.washingtonpost.com. Ignore pictures and pick phrases/sentences based on commas, periods and breaks "-". We could manually input a target web page and the example would wordwrap the page.

    Thanks, Jim

    RispondiElimina
  9. Thanks Jim for your suggestion! I have updated the post in accordance with your advice (more or less).

    RispondiElimina
  10. how does one increase the plotted area of the word cloud...by increasing the the dimensions of the png image, I am only getting a bigger image..with most of it being blank white space, and a small word cloud at the middle of the plot

    RispondiElimina
  11. It seems that this problem (bug?) is related to the graphics device driver you decides to use (pdf, svg, png, etc.). I think that Ian Fellows, the author of the package, could answer your questions more appropriately than me!

    RispondiElimina
  12. Hi,

    you can add some really nice colors to your word cloud using the Free color brewer color rules. download their spreadsheet to add the capability into your code. see colorbrewer2.org

    RispondiElimina
  13. Thanks for stepping in. The RColorBrewer package allows users to access the beautiful and well conceived colorbrewer palettes from R, take a look at it!

    RispondiElimina