Learn more about this course.

Text mining with R

How to do basic and advanced operations above text: reading text files, tokenization, cleaning, word counting,..

In this article we will demonstrate, how to briefly perform basic analysis of textual data. This includes: (i) reading the text files, (ii) cleaning the text, (iii) analyzing the text, and (iv) visualizing the words occurrence.

Reading the text

We will consider e-book Dubliners of James Joyce, available in a plain text format on the Gutenberg web page.

First we load the libraries and data.

Want to keep
learning?

This content is taken from
Partnership for Advanced Computing in Europe (PRACE) online course,

Managing Big Data with R and Hadoop

View Course

library(tm) # Framework for text mining.
library("readr")

myBook1 <- read_file(url("https://www.gutenberg.org/files/2814/2814-0.txt"))

Cleaning the text

Next, we split the text in myBook1 into units, which were originally separated by space and punctuation characters ! “ # $ % & ’ ( ) * + , – . / : ; < = > ? @ [ ] ^ _ ` { | } ~.
We use strsplit which results in a list of words and we have to unlist it, i.e., transform it into a vector which contains all the atomic components from the list.
When the words are in the vector, we transform them into a lower case and remove spaces.

words <- unlist(strsplit(myBook1, split = "[[:space:][:punct:]]"))
words <- tolower(words)
words <- gsub("[0-9]", "", words)

Analyzing the text

Next, we remove English stop words and count the words using `table’. Finally, we sort the words according to their frequencies in decreasing order.

words_clean<-removeWords(words, stopwords("en"))
words_clean <- words_clean[words_clean != ""]
words_clean_t <- table(words_clean)
words_clean_s <- words_clean_t[order(words_clean_t, decreasing=TRUE) ]
head(words_clean_s, 10)

We can also stem the words, as follows:

words_stem <- stemDocument(words_clean)
words_stem_t<-table(words_stem)
words_stem_t_s <- words_stem_t[order(words_stem_t, decreasing=T) ]

By runing head we shall get the following result:

head(words_stem_t_s, 10)
words_stem
 said s mr man t one littl o gabriel ask 
 755 613 576 240 226 225 211 196 158 156

Want to keep learning?

This content is taken from Partnership for Advanced Computing in Europe (PRACE) online course

Managing Big Data with R and Hadoop

View Course

See other articles from this course

This article is from the free online

Managing Big Data with R and Hadoop

Created by

Join Now

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now

Learn more about this course.

Text mining with R

Reading the text

Want to keep
learning?

Managing Big Data with R and Hadoop

Cleaning the text

Analyzing the text

Want to keep learning?

Managing Big Data with R and Hadoop

Managing Big Data with R and Hadoop

Managing Big Data with R and Hadoop

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Learn more about this course.

Text mining with R

Share this step

Reading the text

Want to keep learning?

Managing Big Data with R and Hadoop

Cleaning the text

Analyzing the text

Want to keep learning?

Managing Big Data with R and Hadoop

Share this

Managing Big Data with R and Hadoop

Managing Big Data with R and Hadoop

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Want to keep
learning?