Skip main navigation

Text mining with R

How to do basic and advanced operations above text: reading text files, tokenization, cleaning, word counting,..
keyword network
© Janez Povh

In this article we will demonstrate, how to briefly perform basic analysis of textual data. This includes: (i) reading the text files, (ii) cleaning the text, (iii) analyzing the text, and (iv) visualizing the words occurrence.

Reading the text

We will consider e-book Dubliners of James Joyce, available in a plain text format on the Gutenberg web page.

First we load the libraries and data.

library(tm) # Framework for text mining.
library("readr")

myBook1 <- read_file(url("https://www.gutenberg.org/files/2814/2814-0.txt"))

Cleaning the text

Next, we split the text in myBook1 into units, which were originally separated by space and punctuation characters ! “ # $ % & ’ ( ) * + , – . / : ; < = > ? @ [ ] ^ _ ` { | } ~.
We use strsplit which results in a list of words and we have to unlist it, i.e., transform it into a vector which contains all the atomic components from the list.
When the words are in the vector, we transform them into a lower case and remove spaces.

words <- unlist(strsplit(myBook1, split = "[[:space:][:punct:]]"))
words <- tolower(words)
words <- gsub("[0-9]", "", words)

Analyzing the text

Next, we remove English stop words and count the words using `table’. Finally, we sort the words according to their frequencies in decreasing order.

words_clean<-removeWords(words, stopwords("en"))
words_clean <- words_clean[words_clean != ""]
words_clean_t <- table(words_clean)
words_clean_s <- words_clean_t[order(words_clean_t, decreasing=TRUE) ]
head(words_clean_s, 10)

We can also stem the words, as follows:

words_stem <- stemDocument(words_clean)
words_stem_t<-table(words_stem)
words_stem_t_s <- words_stem_t[order(words_stem_t, decreasing=T) ]

By runing head we shall get the following result:

head(words_stem_t_s, 10)
words_stem
said s mr man t one littl o gabriel ask
755 613 576 240 226 225 211 196 158 156
© Janez Povh
This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education