Learn more about this course.

Introduction of relevant libraries in Python

Text Pre-Processing

Once a text is loaded, the first step to remove any character and provide a uniform input by removing capital letter, digits and any other potential ambiguity. This process is called text normalisation, which includes:

Converting all letters to lower case
Changing numbers into words or removing numbers
Removing punctuations, accent marks, white spaces, etc.
Expanding (or removing in some cases) abbreviations
Removing stop words

Try the following code

Want to keep
learning?

This content is taken from
Edge Hill University online course,

Introduction to Python for Big Data Analytics

View Course


sentence = “Severe weather has been forecast for the entire United States.”
new_sentence = sentence.lower()
print(new_sentence)

Explore the .lower() function.

Now try the following


sentence = “This (sentence) has a lot of punctuation[]!!!”
new_sentence = sentence.translate(string.maketrans(“”,””), string.punctuation)

print(new_sentence)

Another important technique in text pre-processing is tokenisation, which we discussed this earlier on.
However, another aspect to consider is remove stop words, such as “the”, “a”, “on”, “is”, “all”. These words tend to occur very frequently, but they are associated with little meaning and so they are usually removed from texts. NLTK allows to do so very easily.

Try the following code

From nltk.tokenize import word_tokenize
input_sentence = “Big Data Analytics is a scientific field, which has been expanded over the last decade by several research communities.”
stop_words = set(stopwords.words(‘english’))
tokenized_words = work_tokenize(input_sentence)

for token in tokenized_words:
	if token not in stop_words:
 print(token)

Want to keep learning?

This content is taken from Edge Hill University online course

Introduction to Python for Big Data Analytics

View Course

See other articles from this course

This article is from the free online

Introduction to Python for Big Data Analytics

Created by

Join Now

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now

Learn more about this course.

Introduction of relevant libraries in Python

Text Pre-Processing

Want to keep
learning?

Introduction to Python for Big Data Analytics

Want to keep learning?

Introduction to Python for Big Data Analytics

Introduction to Python for Big Data Analytics

Introduction to Python for Big Data Analytics

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Learn more about this course.

Introduction of relevant libraries in Python

Text Pre-Processing

Want to keep learning?

Introduction to Python for Big Data Analytics

Want to keep learning?

Introduction to Python for Big Data Analytics

Share this

Introduction to Python for Big Data Analytics

Introduction to Python for Big Data Analytics

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Want to keep
learning?