£199.99 £139.99 for one year of Unlimited learning. Offer ends on 28 February 2023 at 23:59 (UTC). T&Cs apply

Identification of requirements

Identification of requirements relevant to the activities.

Main Requirements

The main requirements in these case studies include the use of some specific Python libraries. More specifically

• NLTK
• Seaborn

NLTK stands for Natural Language Toolkit and it provides an interface to vast lexical resources, as well as text processing tools for classification, tokenisation, tagging, parsing, etc. Let’s look at what they mean in more details

Activity

Open a new Jupyter notebook

Type the following

import nltksentence = "Big Data Analytics is an emergent and significant scientific field, which will drive innovation."

At this stage, all we have is a string which has been stored in the variable sentence.
However, we need to specify to the Python interpreter that the string consists of units (or tokens), which are equivalent to words

tokens = nltk.word_tokenize(sentence)print(tokens)

What do you see?

The next step is to attach a lexical tag to each of those words.
These tags represent the corresponding lexical properties, for example

• NN: noun, common, singular
• NNP: noun, proper, singular
• NNS: noun, common, plural
• VB: verb, base form, etc.

Run this code

tags = nltk.pos_tag(tokens)print(tags)

What do you see?
Search for the different tags and identify what they refer to

What other commands does NLTK have?
Spend a few minutes familiarising yourself with the library

Visualisation via Matplotlib and Seaborn

Any visualisation approach within Data Analytics, needs to address the following points

• What is the story you want to tell?
• How to present it to optimise the message?

In this course, we will use matplotlib and Seaborn, which are libraries specifically designed for statistical graphics in Python. Seaborn is based on matplotlib and integrates closely with pandas data structures.

One of the strongest features of Seaborn is that it allows you to explore and better understand your data. It easily plots pandas dataframes and arrays containing whole datasets by automatically perform the necessary pre-processing and statistical aggregation stages to display informative graphs.

Let’s look at some examples.
Open a Jupyter notebook and type the following

import matplotlib as mplimport matplotlib.pyplot as pltimport numpy as npfig, ax = plt.subplots() # Create a figure containing a single axes.ax.plot([1, 2, 3, 4], [-1, 2, -2, 4]); # Plot some data on the axes.

What can you see?

Experiment with other graphs and functions, which are part of matplotlib. Explore scatter plots and try different examples.

Now, let’s consider seaborn. Again, open a Jupyter notebook and type the following

# Import seabornimport seaborn as sns# Apply the default themesns.set_theme()# Load an example datasettips = sns.load_dataset("tips")# Let’s plotsns.relplot( data=tips, x="Total Bill", y="Tip", col="time", hue="smoker", style="smoker", size="size",)

Understand what the different parameters do.

• What does the relplot() function do?
• What do ‘hue’ and ‘style’ do?
• What is ‘sns.set_theme()’ ?
• Experiment and try different graphs based on the above dataset