One R Tip A Day: Word Cloud in R

mercoledì 27 luglio 2011

Word Cloud in R

A word cloud (or tag cloud) can be an handy tool when you need to highlight the most commonly cited words in a text using a quick visualization. Of course, you can use one of the several on-line services, such as wordle or tagxedo , very feature rich and with a nice GUI. Being an R enthusiast, I always wanted to produce this kind of images within R and now, thanks to the recently released Ian Fellows' wordcloud package, finally I can!
In order to test the package I retrieved the titles of the XKCD web comics included in my RXKCD package and produced a word cloud based on the titles' word frequencies calculated using the powerful tm package for text mining (I know, it is like killing a fly with a bazooka!).

library(RXKCD)
library(tm)
library(wordcloud)
library(RColorBrewer)
path <- system.file("xkcd", package = "RXKCD")
datafiles <- list.files(path)
xkcd.df <- read.csv(file.path(path, datafiles))
xkcd.corpus <- Corpus(DataframeSource(data.frame(xkcd.df[, 3])))
xkcd.corpus <- tm_map(xkcd.corpus, removePunctuation)
xkcd.corpus <- tm_map(xkcd.corpus, content_transformer(tolower))
xkcd.corpus <- tm_map(xkcd.corpus, function(x) removeWords(x, stopwords("english")))
tdm <- TermDocumentMatrix(xkcd.corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, "BuGn")
pal <- pal[-(1:2)]
png("wordcloud.png", width=1280,height=800)
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
dev.off()

As a second example, inspired by this post from the eKonometrics blog, I created a word cloud from the description of 3177 available R packages listed at http://cran.r-project.org/web/packages.

require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
u = "http://cran.r-project.org/web/packages/available_packages_by_date.html"
t = readHTMLTable(u)[[1]]
ap.corpus <- Corpus(DataframeSource(data.frame(as.character(t[,3]))))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))

ap.corpus <- Corpus(VectorSource(ap.corpus))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud_packages.png", width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=3,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()

As a third example, thanks to Jim's comment, I take advantage of Duncan Temple Lang's RNYTimes package to access user-generate content on the NY Times and produce a wordcloud of 'today' comments on articles.
Caveat: in order to use the RNYTimes package you need a API key from The New York Times which you can get by registering to the The New York Times Developer Network (free of charge) from here.

require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
install.packages(packageName, repos = "http://www.omegahat.org/R", type = "source")
require(RNYTimes)
my.key <- "your API key here"
what= paste("by-date", format(Sys.time(), "%Y-%m-%d"),sep="/")
# what="recent"
recent.news <- community(what=what, key=my.key)
pagetree <- htmlTreeParse(recent.news, error=function(...){}, useInternalNodes = TRUE)
x <- xpathSApply(pagetree, "//*/body", xmlValue)
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
ap.corpus <- Corpus(DataframeSource(data.frame(as.character(x))))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
pal2 <- brewer.pal(8,"Dark2")
png("wordcloud_NewYorkTimes_Community.png", width=1280,height=800)
wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=2,
max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()

48 commenti:

Noam López30 luglio 2011 alle ore 06:44
The post is interesting and I could replicate your second example. But I don't know how to do it if I have a text in a txt file or a word file. Your example just works with a html table but very often we have whole texts. I will be very grateful if you can make a world cloud using a txt file.

Noam
RispondiElimina
Risposte
Paolo30 luglio 2011 alle ore 07:35
Dear Noam,
You can find both the answer to your question and a nice introduction to text mining in R in the vignette of the tm package:
install.packages("tm")
library("tm")
vignette("tm")
HIH!
RispondiElimina
Risposte
Noam López31 luglio 2011 alle ore 17:09
Thank you Paolo, I'm going to read about the tm package. This is my first meet with text mining bacause I just use R for my classes of statistics.

Noam
RispondiElimina
Risposte
Paolo31 luglio 2011 alle ore 19:23
You are welcome Noam!
I'm not a text mining expert as well but the tm package seems to provide a collection of tools that can be useful for solving both basic and more advanced problems in this interesting field.
RispondiElimina
Risposte
cooksappe3 settembre 2011 alle ore 15:34
wow *_*
RispondiElimina
Risposte
tankjn21 ottobre 2011 alle ore 20:57
Very nice. Liked it and probably will use it.
RispondiElimina
Risposte
Anonimo30 dicembre 2011 alle ore 19:24
A great example. However sometime in the past two weeks back from 2011/11/30 the directory and file was removed. So "http://cran.r-project.org/web/packages/available_packages_by_date.html" is not found because the "packages" directory is no longer there. How about using another web site as an example?
RispondiElimina
Risposte
Paolo30 dicembre 2011 alle ore 22:42
Thanks for the update! Feel free to suggest a web site of interest as an alternative.
RispondiElimina
Risposte
Anonimo31 dicembre 2011 alle ore 02:05
Thanks Paolo, this might be impossible but how about any text on a basic news web page like www.washingtonpost.com. Ignore pictures and pick phrases/sentences based on commas, periods and breaks "-". We could manually input a target web page and the example would wordwrap the page.

Thanks, Jim
RispondiElimina
Risposte
Paolo31 dicembre 2011 alle ore 09:58
Thanks Jim for your suggestion! I have updated the post in accordance with your advice (more or less).
RispondiElimina
Risposte
Anonimo12 febbraio 2012 alle ore 00:51
how does one increase the plotted area of the word cloud...by increasing the the dimensions of the png image, I am only getting a bigger image..with most of it being blank white space, and a small word cloud at the middle of the plot
RispondiElimina
Risposte
Paolo14 febbraio 2012 alle ore 08:59
It seems that this problem (bug?) is related to the graphics device driver you decides to use (pdf, svg, png, etc.). I think that Ian Fellows, the author of the package, could answer your questions more appropriately than me!
RispondiElimina
Risposte
Hamish22 febbraio 2012 alle ore 08:14
Hi,

you can add some really nice colors to your word cloud using the Free color brewer color rules. download their spreadsheet to add the capability into your code. see colorbrewer2.org
RispondiElimina
Risposte
Paolo22 febbraio 2012 alle ore 08:33
Thanks for stepping in. The RColorBrewer package allows users to access the beautiful and well conceived colorbrewer palettes from R, take a look at it!
RispondiElimina
Risposte
Julian30 marzo 2012 alle ore 16:11
This sounds very promising. I didn't know that R let you create this kind of visualizations. I used wordle in the past, and R for a bioinformatic project at the University.
RispondiElimina
Risposte
ArunD23 aprile 2012 alle ore 08:10
Thanks for the post..when i run the second example nothing happens...just checking if the png wordcloud files should be in any particular folder

Thanks
Arun David
RispondiElimina
Risposte
Paolo23 aprile 2012 alle ore 08:59
Dear Arun,
I checked the code for the second example and it seems to work without a hitch. If you have used the exact same code presented in the post, the image should be generated in your working directory and named wordcloud_packages.png.

HIH!
RispondiElimina
Risposte
Anonimo16 maggio 2012 alle ore 14:39
I am a beginner with R. How can I create a "phrase" cloud? Basically I have a list of 100 strings/phrases which i want to present as a cloud.
RispondiElimina
Risposte
Paolo16 maggio 2012 alle ore 15:09
I have difficulty understanding the goal of your exercise. A word cloud is a visual tool which can help in perceiving the most prominent (frequent) terms in a collection of words using either color or size. In your case I can imagine your phrases are all different among them; therefore a representation based on frequency make little sense to me.
RispondiElimina
Risposte
Paolo16 maggio 2012 alle ore 15:38
Try something like this:
Put your data in a file phrases.txt with a single phrase for each raw:

all work and no play
makes jack
a dull boy

1) Import the phrases in R

my.phrases <- scan("phrases.txt", what="char", sep="\n")

2) import or create a vector with the frequencies (convert your order of importance in same way to frequencies), e.g.

my.freq <- c(10,20,15)

3) Plot the wordcloud

library(wordcloud)
wordcloud(my.phrases, my.freq)
RispondiElimina
Risposte
Paolo17 maggio 2012 alle ore 08:20
See ?wordcloud. Take a look at min.freq and max.words arguments.
RispondiElimina
Risposte
Amol Kokate24 maggio 2012 alle ore 08:30
hi this is good !! but it depends on word frequency,i will try on sentiment wordcloud along with frequency it means positive word show different color and negative one show different color.can any one help me.
RispondiElimina
Risposte
Hari26 dicembre 2012 alle ore 09:22
Hi Paolo, I was able to build word cloud from csv file with little modification of above code. Would like to know if it is possible to create different shape like what tagxedo provides?
RispondiElimina
Risposte
Paolo26 dicembre 2012 alle ore 09:31
Dear Hari, from what I can see from the help page this feature is not currently available in the wordcloud package. You could suggest it to the author.
RispondiElimina
Risposte
Hari27 dicembre 2012 alle ore 09:03
FYI
Hello Mr.Ian,
I am using the package Word Cloud authored by you in R. The package creates a word cloud in a circle shape, would like to know if it is possible to make different shapes of word cloud like what tagxedo provides.
RispondiElimina
Risposte
Paolo27 dicembre 2012 alle ore 09:07
Dear Hari, in order to contact the author of the wordcloud package you should use the information you can find at http://cran.r-project.org/web/packages/wordcloud/index.html
I am not related to the author of this package nor have any connection with him.
RispondiElimina
Risposte
Anonimo15 marzo 2013 alle ore 16:03
Great post!! Super useful!
RispondiElimina
Risposte
Anonimo21 marzo 2013 alle ore 17:49
I get a lot of errors on this:
Error: failed to load external entity "http://cran.r-project.org/web/packages/available_packages_by_date.html"
> ap.corpus <- Corpus(DataframeSource(data.frame(as.character(t[,3]))))
Error in t[, 3] : object of type 'closure' is not subsettable
> ap.corpus <- tm_map(ap.corpus, removePunctuation)
Error in tm_map(ap.corpus, removePunctuation) :
object 'ap.corpus' not found
> ap.corpus <- tm_map(ap.corpus, tolower)
Error in tm_map(ap.corpus, tolower) : object 'ap.corpus' not found
> ap.corpus <- tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
Error in tm_map(ap.corpus, function(x) removeWords(x, stopwords("english"))) :
object 'ap.corpus' not found
> ap.tdm <- TermDocumentMatrix(ap.corpus)
Error in TermDocumentMatrix(ap.corpus) : object 'ap.corpus' not found
> ap.m <- as.matrix(ap.tdm)
Error in as.matrix(ap.tdm) : object 'ap.tdm' not found
> ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
Error in is.data.frame(x) : object 'ap.m' not found
> ap.d <- data.frame(word = names(ap.v),freq=ap.v)
Error in data.frame(word = names(ap.v), freq = ap.v) :
object 'ap.v' not found
> table(ap.d$freq)
Error in table(ap.d$freq) : object 'ap.d' not found
> pal2 <- brewer.pal(8,"Dark2")
> png("wordcloud_packages.png", width=1280,height=800)
> wordcloud(ap.d$word,ap.d$freq, scale=c(8,.2),min.freq=3,
+ max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
Error in wordcloud(ap.d$word, ap.d$freq, scale = c(8, 0.2), min.freq = 3, :
object 'ap.d' not found
RispondiElimina
Risposte
Paolo21 marzo 2013 alle ore 20:15
I checked the code again and, with R 2.15.2 (on both Windows and Linux) and a recent version of the loaded packages, everything works as expected. Two suggestions: 1) check all the packages are installed and loaded. 2) Check your internet connection and firewall settings.
HIH
RispondiElimina
Risposte
Unknown26 agosto 2013 alle ore 08:19
Hi, I am using the latest version of R. I copy pasted your code in R but I am not getting any graph/World Cloud. I have installed all the required packages. Can you please help?
RispondiElimina
Risposte
Paolo26 agosto 2013 alle ore 08:28
Dear Ved,
Of course I can only guess but, from my experience, if both the installation of the different required packages and the sourced code didn't throw any error, this mean that the word cloud image was produced and saved in your workspace directory with the name wordcloud.png.

HIH
RispondiElimina
Risposte
Pradeepta Mishra29 agosto 2013 alle ore 09:52
Hi,
The second example gives me a picture with black background as "Invalid Image". Pleas suggest how to get the image correctly for the word cloud. I am using the same example as mentioned.

Thanks
RispondiElimina
Risposte
Anonimo9 gennaio 2014 alle ore 10:04
Can you help me with how you loaded the csv file. I am newbie in R and getting stuck in it
RispondiElimina
Risposte
Paolo9 gennaio 2014 alle ore 10:23
Dear Dhaval, csv importing is a very common starting point you have to do when you are going to use R or any other programming language for analyzing data. I suggest you to take a look at any introductory R book/tutorial you can find (see the R/CRAN website for tens of choices). Furthermore if you are still stuck at some point, you can get a lot of useful responses on the StackOverflow Q & A website (use the [r] tag).
RispondiElimina
Risposte
Anonimo9 febbraio 2014 alle ore 05:46
?¿?¿?¿
Error en .overlap(x1, y1, sw1, sh1, boxes) :
el paquete 'dataptr' no ofrece la función 'Rcpp'
> dev.off()
null device

por que me sale esto?
donde esta mi error
mi
Platform: x86_64-w64-mingw32/x64 (64-bit)
RispondiElimina
Risposte
Paolo9 febbraio 2014 alle ore 10:08
No easy solution: I checked on a Windows installation (R version 3.0.2) and everything seems work properly! Some advise: install the current version of R with all the updated packages required by the tutorial.
RispondiElimina
Risposte
Anonimo27 febbraio 2014 alle ore 16:35
Hi Paolo,
I was trying to form the wordcloud for NYTimes community comments.
After I type,
recent.news <- community(what=what, key=my.key)
It gives me an error,saying : Error: Forbidden
What could be the reason beind this error? I have successfully obtained the API Key for NYTimes Community API.
RispondiElimina
Risposte
singularity4 giugno 2014 alle ore 03:54
I got the same error as Paicyclopedia. I think the problem is related to the security level of your computer.
RispondiElimina
Risposte
Diane19 luglio 2014 alle ore 00:13
It works. I have R-3.1.1 on Windows 8.1 64-bit.
In example 1 I have changed line 10 to
xkcd.corpus <- tm_map(xkcd.corpus, content_transformer(tolower))

and in example 2 the same in line 9:
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))

Example 3: "three" changes
0. I have installed RCURL.
1. I have downloaded the package RNYTimes_0.1-1.tar.gz and changed line 5 to
install.packages("G:\\RFiles\\RNYTimes_0.1-1.tar.gz", repos=NULL, type = "source")
After the first start I deleted line 5.
2. Line 20 is changed to
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))

I have got the key for the Comment API.

Great, thank you for these useful examples.
RispondiElimina
Risposte

Aggiungi commento