Skip main navigation

Identification of requirements

Identification of requirements relevant to the activities.

Main Requirements

The main requirements in these case studies include the use of some specific Python libraries. More specifically

  • NLTK
  • Seaborn

NLTK stands for Natural Language Toolkit and it provides an interface to vast lexical resources, as well as text processing tools for classification, tokenisation, tagging, parsing, etc. Let’s look at what they mean in more details

Activity

Open a new Jupyter notebook

Type the following

import nltk

sentence = "Big Data Analytics is an emergent and significant scientific field, which will drive innovation."

At this stage, all we have is a string which has been stored in the variable sentence.
However, we need to specify to the Python interpreter that the string consists of units (or tokens), which are equivalent to words

tokens = nltk.word_tokenize(sentence)
print(tokens)

What do you see?

The next step is to attach a lexical tag to each of those words.
These tags represent the corresponding lexical properties, for example

  • NN: noun, common, singular
  • NNP: noun, proper, singular
  • NNS: noun, common, plural
  • VB: verb, base form, etc.

Run this code

tags = nltk.pos_tag(tokens)
print(tags)

What do you see?
Search for the different tags and identify what they refer to

What other commands does NLTK have?
Spend a few minutes familiarising yourself with the library

Visualisation via Matplotlib and Seaborn

Any visualisation approach within Data Analytics, needs to address the following points

  • Who is your audience?
  • What is the story you want to tell?
  • How to present it to optimise the message?

In this course, we will use matplotlib and Seaborn, which are libraries specifically designed for statistical graphics in Python. Seaborn is based on matplotlib and integrates closely with pandas data structures.

One of the strongest features of Seaborn is that it allows you to explore and better understand your data. It easily plots pandas dataframes and arrays containing whole datasets by automatically perform the necessary pre-processing and statistical aggregation stages to display informative graphs.

Let’s look at some examples.
Open a Jupyter notebook and type the following


import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots() # Create a figure containing a single axes.
ax.plot([1, 2, 3, 4], [-1, 2, -2, 4]); # Plot some data on the axes.

What can you see?

Experiment with other graphs and functions, which are part of matplotlib. Explore scatter plots and try different examples.

Now, let’s consider seaborn. Again, open a Jupyter notebook and type the following

# Import seaborn
import seaborn as sns

# Apply the default theme
sns.set_theme()

# Load an example dataset
tips = sns.load_dataset("tips")

# Let’s plot
sns.relplot(
data=tips,
x="Total Bill", y="Tip", col="time",
hue="smoker", style="smoker", size="size",
)

Understand what the different parameters do.

  • What does the relplot() function do?
  • What do ‘hue’ and ‘style’ do?
  • What is ‘sns.set_theme()’ ?
  • Experiment and try different graphs based on the above dataset
This article is from the free online

Introduction to Python for Big Data Analytics

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now