Skip main navigation

Text mining with R

How to do basic and advanced operations above text: reading text files, tokenization, cleaning, word counting,..
keyword network
© Janez Povh

In this article we will demonstrate, how to briefly perform basic analysis of textual data. This includes: (i) reading the text files, (ii) cleaning the text, (iii) analyzing the text, and (iv) visualizing the words occurrence.

Reading the text

We will consider e-book Dubliners of James Joyce, available in a plain text format on the Gutenberg web page.

First we load the libraries and data.

library(tm) # Framework for text mining.
library("readr")

myBook1 <- read_file(url("https://www.gutenberg.org/files/2814/2814-0.txt"))

Cleaning the text

Next, we split the text in myBook1 into units, which were originally separated by space and punctuation characters ! “ # $ % & ’ ( ) * + , – . / : ; < = > ? @ [ ] ^ _ ` { | } ~.
We use strsplit which results in a list of words and we have to unlist it, i.e., transform it into a vector which contains all the atomic components from the list.
When the words are in the vector, we transform them into a lower case and remove spaces.

words <- unlist(strsplit(myBook1, split = "[[:space:][:punct:]]"))
words <- tolower(words)
words <- gsub("[0-9]", "", words)

Analyzing the text

Next, we remove English stop words and count the words using `table’. Finally, we sort the words according to their frequencies in decreasing order.

words_clean<-removeWords(words, stopwords("en"))
words_clean <- words_clean[words_clean != ""]
words_clean_t <- table(words_clean)
words_clean_s <- words_clean_t[order(words_clean_t, decreasing=TRUE) ]
head(words_clean_s, 10)

We can also stem the words, as follows:

words_stem <- stemDocument(words_clean)
words_stem_t<-table(words_stem)
words_stem_t_s <- words_stem_t[order(words_stem_t, decreasing=T) ]

By runing head we shall get the following result:

head(words_stem_t_s, 10)
words_stem
said s mr man t one littl o gabriel ask
755 613 576 240 226 225 211 196 158 156
© Janez Povh
This article is from the free online

Managing Big Data with R and Hadoop

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now