Skip main navigation

What are the essential operations in Pandas?

Pandas has certain essential operations that data analysts need to use to interact with the data stored in Series and DataFrame.

Data analysts spend a significant amount of time cleaning and preparing data sets to work on. They must possess the necessary tools and ability to work with messy data sets, missing values, inconsistencies, and ambiguous data.

Pandas has certain essential operations that data analysts need to use to interact with the data stored in Series and DataFrame. These operations allow data analysts to get data into a workable form before the data analysis.

Reindexing

A necessary operation that we perform on the Pandas data structure is reindexing, which means creating a new object and rearranging the data in the Pandas data structure, conforming to the new index.

While doing so, if data is not present for some index in the original data, missing values are added, corresponding to those indexes.

Code:

a = Series(np.random.randn(10), index=['a','b','c','d','e','f','g','h','i','j'])
a

Output:

a 0.591050

b -0.952670

c -0.948599

d 0.091596

e -1.096649

f 0.199346

g 0.856941

h -0.086180

i -2.623903

j 0.271230

dtype: float64

 

Code:

new_index = ['a','A1','b','B1','c','C1','d','e','f','g','h','i','j']
a_new = a.reindex(new_index)
a_new

Output:

a 0.591050

A1 NaN

b -0.952670

B1 NaN

c -0.948599

C1 NaN

d 0.091596

e -1.096649

f 0.199346

g 0.856941

h -0.086180

i -2.623903

j 0.271230

dtype: float64

 

Handling missing values during reindexing

Imagine a situation where you are processing employee records. However, many of the employees have supplied incomplete information. You need a way to handle these cases and highlight the gaps to follow up with them. Perhaps you could insert ‘Unknown’ into all the empty fields to make the missing values easy to identify.

There are various ways the missing values can be handled during reindexing. We can:

    • either specify a particular value to be filled – we do this by adding a parameter fill_value = <value to be filled> to the reindex method

For example:

Code:

a_fillvalue = a.reindex(new_index, fill_value=0)
a_fillvalue

Output:

a 0.591050

A1 0.000000

b -0.952670

B1 0.000000

c -0.948599

C1 0.000000

d 0.091596

e -1.096649

f 0.199346

g 0.856941

h -0.086180

i -2.623903

j 0.271230

dtype: float64

 

Or, we can specify the pre-defined options by passing a parameter method = <predefined method values>. This method is handy in case we need to do operations like interpolation, forward fill, backward fill, and so on for instances such as time-series data analysis.

For example:

Code:

a = Series(np.random.randn(10), index=[0,2,4,6,8,10,12,14,16,18])
a

 

Output:

0 1.036439

1 1.036439

2 -0.841819

3 -0.841819

4 0.629621

5 0.629621

6 -1.905720

7 -1.905720

8 1.673387

9 1.673387

10 0.792506

11 0.792506

12 0.267104

13 0.267104

14 0.759571

15 0.759571

16 -0.847925

17 -0.847925

18 -0.598402

19 -0.598402

dtype: float64

Code:

## Reindex so that indexes 1,3,5... are introduced in the series
a_new = a.reindex(range(20))
a_new

Output:

0 1.036439

1 NaN

2 -0.841819

3 NaN

4 0.629621

5 NaN

6 -1.905720

7 NaN

8 1.673387

9 NaN

10 0.792506

11 NaN

12 0.267104

13 NaN

14 0.759571

15 NaN

16 -0.847925

17 NaN

18 -0.598402

19 NaN

dtype: float64

Code:

## Perform similar reindex but with forward fill method specific for null values

a_ffill = a.reindex(range(20), method='ffill')
a_ffill

Output:

0 1.036439

1 1.036439

2 -0.841819

3 -0.841819

4 0.629621

5 0.629621

6 -1.905720

7 -1.905720

8 1.673387

9 1.673387

10 0.792506

11 0.792506

12 0.267104

13 0.267104

14 0.759571

15 0.759571

16 -0.847925

17 -0.847925

18 -0.598402

19 -0.598402

dtype: float64

Look at index 1, 3, and 5: values have been populated from the previous index.

For the complete list of parameters of reindexing method, refer to the documentation available at the following links:

Read: Pandas Document for Series reindexing [1]

Read: Pandas Document for Dataframe Reindexing [2]

Deleting entries

We often need to delete the data from the Pandas Series and DataFrame. You can do this using the drop() method, which is available to both Series and DataFrame. This method accepts the index, or the list of index, to be dropped from the Series and DataFrame.

This method creates a new object with only the required values. Note that this operation doesn’t perform inline-drop (i.e. the original Pandas Series or DataFrame will be preserved and still available after the drop operations). In practical terms, the method creates a selective copy of the data.

Deleting entries from Pandas Series

Let’s look at how to delete entries from a Pandas Series.

    • Drop single index.

Code:

b = Series(np.arange(10), index=['a','b','c','d','e','f','g','h','i','j'])
b

Output:

a 0

b 1

c 2

d 3

e 4

f 5

g 6

h 7

I 8

j 9

dtype: int32

 

Code:

#Dropping index b

new_series = b.drop('b')
new_series

 

Output:

a 0

c 2

d 3

e 4

f 5

g 6

h 7

I 8

j 9

dtype: int32

 

    • Drop multiple indexes.

Code:

# Dropping multiple index.
# for e.g., a, ge j
new_series_1 = b.drop(['a','g','j'])
new_series_1

 

Output:

b 1

c 2

d 3

e 4

f 5

h 7

I 8

dtype: int32

Deleting entries from Pandas DataFrame

 

In the case of DataFrame, we specify the index for both axes: row labels (by using index parameter) and column names (by using columns parameter).

The following code snippets demonstrate this behaviour:

 

    • Removing a row from DataFrame.

Code:

df_states

Output:

Screenshot from the Jupyter Notebook, showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image

 

Code:

df_states_noNT = df_states.drop('NT')
df_staes_noNT

 

Output:

Screenshot from the Jupyter Notebook. The screenshot shows an example of removing rows. The row was removed in NT.
Click to enlarge image

Removing multiple columns from DataFrame by passing a sequence of column index and axis = 1.
Code:
~~~ python
df_states
~~~

Output:

Screenshot from the Jupyter Notebook showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image

Code:

df1 = df_states.drop(['state','area'], axis=1)
df1

Output:

Screenshot from the Jupyter Notebook. The screenshot shows an example of removing multiples columns. The columns removed are area and state.
Click to enlarge image

 

Code:

df_states

Output:

Screenshot from the Jupyter Notebook showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image

References

  1. Pandas Document for Series reindexing [Internet]. Pandas; [date unknown]. Available from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
  2. Pandas Document for Dataframe Reindexing [Internet]. Pandas; [date unknown]. Available from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
This article is from the free online

Introduction to Data Analytics with Python

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education