# Text mining with R

How to do basic and advanced operations above text: reading text files, tokenization, cleaning, word counting,..

In this article we will demonstrate, how to briefly perform basic analysis of textual data. This includes: (i) reading the text files, (ii) cleaning the text, (iii) analyzing the text, and (iv) visualizing the words occurrence.

We will consider e-book Dubliners of James Joyce, available in a plain text format on the Gutenberg web page.

First we load the libraries and data.

library(tm) # Framework for text mining.

## Cleaning the text

Next, we split the text in myBook1 into units, which were originally separated by space and punctuation characters ! “ # \$ % & ’ ( ) * + , – . / : ; < = > ? @ [ ] ^ _  { | } ~.
We use strsplit which results in a list of words and we have to unlist it, i.e., transform it into a vector which contains all the atomic components from the list.
When the words are in the vector, we transform them into a lower case and remove spaces.

words <- unlist(strsplit(myBook1, split = "[[:space:][:punct:]]"))
words <- tolower(words)
words <- gsub("[0-9]", "", words)

## Analyzing the text

Next, we remove English stop words and count the words using table’. Finally, we sort the words according to their frequencies in decreasing order.

words_clean<-removeWords(words, stopwords("en"))
words_clean <- words_clean[words_clean != ""]
words_clean_t <- table(words_clean)
words_clean_s <- words_clean_t[order(words_clean_t, decreasing=TRUE) ]

We can also stem the words, as follows:

words_stem <- stemDocument(words_clean)
words_stem_t<-table(words_stem)
words_stem_t_s <- words_stem_t[order(words_stem_t, decreasing=T) ]

By runing head we shall get the following result:

words_stem
said s mr man t one littl o gabriel ask
755 613 576 240 226 225 211 196 158 156