Skip main navigation

£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

Find out more

Introduction of relevant libraries in Python

Introduction of relevant libraries in Python

Text Pre-Processing

Once a text is loaded, the first step to remove any character and provide a uniform input by removing capital letter, digits and any other potential ambiguity. This process is called text normalisation, which includes:

  • Converting all letters to lower case
  • Changing numbers into words or removing numbers
  • Removing punctuations, accent marks, white spaces, etc.
  • Expanding (or removing in some cases) abbreviations
  • Removing stop words

Try the following code

sentence = Severe weather has been forecast for the entire United States.
new_sentence = sentence.lower()

Explore the .lower() function.

Now try the following

sentence = This (sentence) has a lot of punctuation[]!!!”
new_sentence = sentence.translate(string.maketrans(“”,””), string.punctuation)


Another important technique in text pre-processing is tokenisation, which we discussed this earlier on.
However, another aspect to consider is remove stop words, such as “the”, “a”, “on”, “is”, “all”. These words tend to occur very frequently, but they are associated with little meaning and so they are usually removed from texts. NLTK allows to do so very easily.

Try the following code

From nltk.tokenize import word_tokenize
input_sentence = Big Data Analytics is a scientific field, which has been expanded over the last decade by several research communities.
stop_words = set(stopwords.words(english))
tokenized_words = work_tokenize(input_sentence)

for token in tokenized_words:
if token not in stop_words:

This article is from the free online

Introduction to Python for Big Data Analytics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education