Skip main navigation

Introduction of relevant libraries in Python

Introduction of relevant libraries in Python

Text Pre-Processing

Once a text is loaded, the first step to remove any character and provide a uniform input by removing capital letter, digits and any other potential ambiguity. This process is called text normalisation, which includes:

  • Converting all letters to lower case
  • Changing numbers into words or removing numbers
  • Removing punctuations, accent marks, white spaces, etc.
  • Expanding (or removing in some cases) abbreviations
  • Removing stop words

Try the following code

sentence = Severe weather has been forecast for the entire United States.
new_sentence = sentence.lower()

Explore the .lower() function.

Now try the following

sentence = This (sentence) has a lot of punctuation[]!!!”
new_sentence = sentence.translate(string.maketrans(“”,””), string.punctuation)


Another important technique in text pre-processing is tokenisation, which we discussed this earlier on.
However, another aspect to consider is remove stop words, such as “the”, “a”, “on”, “is”, “all”. These words tend to occur very frequently, but they are associated with little meaning and so they are usually removed from texts. NLTK allows to do so very easily.

Try the following code

From nltk.tokenize import word_tokenize
input_sentence = Big Data Analytics is a scientific field, which has been expanded over the last decade by several research communities.
stop_words = set(stopwords.words(english))
tokenized_words = work_tokenize(input_sentence)

for token in tokenized_words:
if token not in stop_words:

This article is from the free online

Introduction to Python for Big Data Analytics

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now