Skip main navigation

Identification of requirements

Identification of requirements relevant to the activities.

Main Requirements

The main requirements in these case studies include the use of some specific Python libraries. More specifically

  • NLTK
  • Seaborn

NLTK stands for Natural Language Toolkit and it provides an interface to vast lexical resources, as well as text processing tools for classification, tokenisation, tagging, parsing, etc. Let’s look at what they mean in more details

Activity

Open a new Jupyter notebook

Type the following

import nltk

sentence = "Big Data Analytics is an emergent and significant scientific field, which will drive innovation."

At this stage, all we have is a string which has been stored in the variable sentence.
However, we need to specify to the Python interpreter that the string consists of units (or tokens), which are equivalent to words

tokens = nltk.word_tokenize(sentence)
print(tokens)

What do you see?

The next step is to attach a lexical tag to each of those words.
These tags represent the corresponding lexical properties, for example

  • NN: noun, common, singular
  • NNP: noun, proper, singular
  • NNS: noun, common, plural
  • VB: verb, base form, etc.

Run this code

tags = nltk.pos_tag(tokens)
print(tags)

What do you see?
Search for the different tags and identify what they refer to

What other commands does NLTK have?
Spend a few minutes familiarising yourself with the library

Visualisation via Matplotlib and Seaborn

Any visualisation approach within Data Analytics, needs to address the following points

  • Who is your audience?
  • What is the story you want to tell?
  • How to present it to optimise the message?

In this course, we will use matplotlib and Seaborn, which are libraries specifically designed for statistical graphics in Python. Seaborn is based on matplotlib and integrates closely with pandas data structures.

One of the strongest features of Seaborn is that it allows you to explore and better understand your data. It easily plots pandas dataframes and arrays containing whole datasets by automatically perform the necessary pre-processing and statistical aggregation stages to display informative graphs.

Let’s look at some examples.
Open a Jupyter notebook and type the following


import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots() # Create a figure containing a single axes.
ax.plot([1, 2, 3, 4], [-1, 2, -2, 4]); # Plot some data on the axes.

What can you see?

Experiment with other graphs and functions, which are part of matplotlib. Explore scatter plots and try different examples.

Now, let’s consider seaborn. Again, open a Jupyter notebook and type the following

# Import seaborn
import seaborn as sns

# Apply the default theme
sns.set_theme()

# Load an example dataset
tips = sns.load_dataset("tips")

# Let’s plot
sns.relplot(
data=tips,
x="Total Bill", y="Tip", col="time",
hue="smoker", style="smoker", size="size",
)

Understand what the different parameters do.

  • What does the relplot() function do?
  • What do ‘hue’ and ‘style’ do?
  • What is ‘sns.set_theme()’ ?
  • Experiment and try different graphs based on the above dataset
This article is from the free online

Introduction to Python for Big Data Analytics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education