Skip main navigation

Introduction to Data Analytics libraries

Introduction to Data Analytics libraries

Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labelled data. It is a fundamental high-level building block for doing practical, real-world data analysis in Python.

Pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time-series data
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational/statistical data sets. The data actually need not be labelled at all to be placed into a pandas data structure

Key features:

  • Easy handling of missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labelling of axes
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
  • Time-series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Pandas Data Structures: Series

A series is a single vector of data (like a NumPy array) with an index that labels each element in the vector. If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the series, while the index is a pandas Index object.

For example

import pandas as pd
counts = pd.Series([632, 1638, 569, 115])

0 632
1 1638
2 569
3 115
dtype: int64

counts.values
array([632, 1638, 569, 115])

Pandas Data Structures: DataFrame

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the DataFrame allows us to represent and manipulate higher-dimensional data.

See, for example the following picture depicting a dataframe extracted from a csv file.

Pandas Dataframe. This image depicts how to import and use the Pandas Python library.

Pandas Dataframe. This image depicts some examples of the Pandas Python library.

Try the following code


# import pandas as pd
import pandas as pd

input_users = {'Name':['Sarah', 'Lucas', 'Debbie', 'Joanna'],
'Age':[41, 51, 87, 69]}

df = pd.DataFrame(input_users)
print(df)

What does it do? Explore the different components and try different examples.

This article is from the free online

Introduction to Python for Big Data Analytics

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education