Learn more about this course.

What are the essential operations in Pandas?

Pandas has certain essential operations that data analysts need to use to interact with the data stored in Series and DataFrame.

Data analysts spend a significant amount of time cleaning and preparing data sets to work on. They must possess the necessary tools and ability to work with messy data sets, missing values, inconsistencies, and ambiguous data.

Pandas has certain essential operations that data analysts need to use to interact with the data stored in Series and DataFrame. These operations allow data analysts to get data into a workable form before the data analysis.

Want to keep
learning?

This content is taken from
FutureLearn online course,

Introduction to Data Analytics with Python

View Course

Reindexing

A necessary operation that we perform on the Pandas data structure is reindexing, which means creating a new object and rearranging the data in the Pandas data structure, conforming to the new index.

Want to keep
learning?

This content is taken from
FutureLearn online course,

Introduction to Data Analytics with Python

View Course

While doing so, if data is not present for some index in the original data, missing values are added, corresponding to those indexes.

Code:

a = Series(np.random.randn(10), index=['a','b','c','d','e','f','g','h','i','j'])
a

Output:

a 0.591050

b -0.952670

c -0.948599

d 0.091596

e -1.096649

f 0.199346

g 0.856941

h -0.086180

i -2.623903

j 0.271230

dtype: float64

Code:

new_index = ['a','A1','b','B1','c','C1','d','e','f','g','h','i','j']
a_new = a.reindex(new_index)
a_new

Output:

a 0.591050

A1 NaN

b -0.952670

B1 NaN

c -0.948599

C1 NaN

d 0.091596

e -1.096649

f 0.199346

g 0.856941

h -0.086180

i -2.623903

j 0.271230

dtype: float64

Handling missing values during reindexing

Imagine a situation where you are processing employee records. However, many of the employees have supplied incomplete information. You need a way to handle these cases and highlight the gaps to follow up with them. Perhaps you could insert ‘Unknown’ into all the empty fields to make the missing values easy to identify.

There are various ways the missing values can be handled during reindexing. We can:

- either specify a particular value to be filled – we do this by adding a parameter fill_value = <value to be filled> to the reindex method

For example:

Code:

a_fillvalue = a.reindex(new_index, fill_value=0)
a_fillvalue

Output:

a 0.591050

A1 0.000000

b -0.952670

B1 0.000000

c -0.948599

C1 0.000000

d 0.091596

e -1.096649

f 0.199346

g 0.856941

h -0.086180

i -2.623903

j 0.271230

dtype: float64

Or, we can specify the pre-defined options by passing a parameter method = <predefined method values>. This method is handy in case we need to do operations like interpolation, forward fill, backward fill, and so on for instances such as time-series data analysis.

For example:

Code:

a = Series(np.random.randn(10), index=[0,2,4,6,8,10,12,14,16,18])
a

Output:

0 1.036439

1 1.036439

2 -0.841819

3 -0.841819

4 0.629621

5 0.629621

6 -1.905720

7 -1.905720

8 1.673387

9 1.673387

10 0.792506

11 0.792506

12 0.267104

13 0.267104

14 0.759571

15 0.759571

16 -0.847925

17 -0.847925

18 -0.598402

19 -0.598402

dtype: float64

Code:

## Reindex so that indexes 1,3,5... are introduced in the series
a_new = a.reindex(range(20))
a_new

Output:

0 1.036439

1 NaN

2 -0.841819

3 NaN

4 0.629621

5 NaN

6 -1.905720

7 NaN

8 1.673387

9 NaN

10 0.792506

11 NaN

12 0.267104

13 NaN

14 0.759571

15 NaN

16 -0.847925

17 NaN

18 -0.598402

19 NaN

dtype: float64

Code:

## Perform similar reindex but with forward fill method specific for null values

a_ffill = a.reindex(range(20), method='ffill')
a_ffill

Output:

0 1.036439

1 1.036439

2 -0.841819

3 -0.841819

4 0.629621

5 0.629621

6 -1.905720

7 -1.905720

8 1.673387

9 1.673387

10 0.792506

11 0.792506

12 0.267104

13 0.267104

14 0.759571

15 0.759571

16 -0.847925

17 -0.847925

18 -0.598402

19 -0.598402

dtype: float64

Look at index 1, 3, and 5: values have been populated from the previous index.

For the complete list of parameters of reindexing method, refer to the documentation available at the following links:

Read: Pandas Document for Series reindexing [1]

Read: Pandas Document for Dataframe Reindexing [2]

Deleting entries

We often need to delete the data from the Pandas Series and DataFrame. You can do this using the drop() method, which is available to both Series and DataFrame. This method accepts the index, or the list of index, to be dropped from the Series and DataFrame.

This method creates a new object with only the required values. Note that this operation doesn’t perform inline-drop (i.e. the original Pandas Series or DataFrame will be preserved and still available after the drop operations). In practical terms, the method creates a selective copy of the data.

Deleting entries from Pandas Series

Let’s look at how to delete entries from a Pandas Series.

- Drop single index.

Code:

b = Series(np.arange(10), index=['a','b','c','d','e','f','g','h','i','j'])
b

Output:

a 0

b 1

c 2

d 3

e 4

f 5

g 6

h 7

I 8

j 9

dtype: int32

Code:

#Dropping index b

new_series = b.drop('b')
new_series

Output:

a 0

c 2

d 3

e 4

f 5

g 6

h 7

I 8

j 9

dtype: int32

- Drop multiple indexes.

Code:

# Dropping multiple index.
# for e.g., a, ge j
new_series_1 = b.drop(['a','g','j'])
new_series_1

Output:

b 1

c 2

d 3

e 4

f 5

h 7

I 8

dtype: int32

Deleting entries from Pandas DataFrame

In the case of DataFrame, we specify the index for both axes: row labels (by using index parameter) and column names (by using columns parameter).

The following code snippets demonstrate this behaviour:

- Removing a row from DataFrame.

Code:

df_states

Output:

Screenshot from the Jupyter Notebook, showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image

Code:

df_states_noNT = df_states.drop('NT')
df_staes_noNT

Output:

Screenshot from the Jupyter Notebook. The screenshot shows an example of removing rows. The row was removed in NT.
Click to enlarge image

Removing multiple columns from DataFrame by passing a sequence of column index and axis = 1.
Code:
~~~ python
df_states
~~~

Output:

Screenshot from the Jupyter Notebook showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image

Code:

df1 = df_states.drop(['state','area'], axis=1)
df1

Output:

Screenshot from the Jupyter Notebook. The screenshot shows an example of removing multiples columns. The columns removed are area and state.
Click to enlarge image

Code:

df_states

Output:

Screenshot from the Jupyter Notebook showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image

References

Pandas Document for Series reindexing [Internet]. Pandas; [date unknown]. Available from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
Pandas Document for Dataframe Reindexing [Internet]. Pandas; [date unknown]. Available from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

Want to keep learning?

This content is taken from FutureLearn online course

Introduction to Data Analytics with Python

View Course

See other articles from this course

This article is from the free online

Introduction to Data Analytics with Python

Created by

Join Now

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now

Learn more about this course.

What are the essential operations in Pandas?

Want to keep
learning?

Introduction to Data Analytics with Python

Reindexing

Want to keep
learning?

Introduction to Data Analytics with Python

Handling missing values during reindexing

Deleting entries

Deleting entries from Pandas Series

Deleting entries from Pandas DataFrame

References

Want to keep learning?

Introduction to Data Analytics with Python

Introduction to Data Analytics with Python

Introduction to Data Analytics with Python

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Learn more about this course.

What are the essential operations in Pandas?

Want to keep learning?

Introduction to Data Analytics with Python

Reindexing

Want to keep learning?

Introduction to Data Analytics with Python

Handling missing values during reindexing

Deleting entries

Deleting entries from Pandas Series

Deleting entries from Pandas DataFrame

References

Want to keep learning?

Introduction to Data Analytics with Python

Share this

Introduction to Data Analytics with Python

Introduction to Data Analytics with Python

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Want to keep
learning?

Want to keep
learning?