# Introduction of relevant libraries in Python

## Text Pre-Processing

Once a text is loaded, the first step to remove any character and provide a uniform input by removing capital letter, digits and any other potential ambiguity. This process is called text normalisation, which includes:

• Converting all letters to lower case
• Changing numbers into words or removing numbers
• Removing punctuations, accent marks, white spaces, etc.
• Expanding (or removing in some cases) abbreviations
• Removing stop words

Try the following code

sentence = “Severe weather has been forecast for the entire United States.”new_sentence = sentence.lower()print(new_sentence)

Explore the .lower() function.

Now try the following

sentence = “This (sentence) has a lot of punctuation[]!!!”new_sentence = sentence.translate(string.maketrans(“”,””), string.punctuation)print(new_sentence)

Another important technique in text pre-processing is tokenisation, which we discussed this earlier on.
However, another aspect to consider is remove stop words, such as “the”, “a”, “on”, “is”, “all”. These words tend to occur very frequently, but they are associated with little meaning and so they are usually removed from texts. NLTK allows to do so very easily.

Try the following code

From nltk.tokenize import word_tokenizeinput_sentence = “Big Data Analytics is a scientific field, which has been expanded over the last decade by several research communities.”stop_words = set(stopwords.words(‘english’))tokenized_words = work_tokenize(input_sentence)for token in tokenized_words:	if token not in stop_words: print(token)