Currently set to Index
Currently set to Follow
Skip main navigation

What are the essential operations in Pandas?

Pandas has certain essential operations that data analysts need to use to interact with the data stored in Series and DataFrame.
Data analysts spend a significant amount of time cleaning and preparing data sets to work on. They must possess the necessary tools and ability to work with messy data sets, missing values, inconsistencies, and ambiguous data.
Pandas has certain essential operations that data analysts need to use to interact with the data stored in Series and DataFrame. These operations allow data analysts to get data into a workable form before the data analysis.

Reindexing

A necessary operation that we perform on the Pandas data structure is reindexing, which means creating a new object and rearranging the data in the Pandas data structure, conforming to the new index.
While doing so, if data is not present for some index in the original data, missing values are added, corresponding to those indexes.
Code:
a = Series(np.random.randn(10), index=['a','b','c','d','e','f','g','h','i','j'])
a
Output:
a 0.591050
b -0.952670
c -0.948599
d 0.091596
e -1.096649
f 0.199346
g 0.856941
h -0.086180
i -2.623903
j 0.271230
dtype: float64
 
Code:
new_index = ['a','A1','b','B1','c','C1','d','e','f','g','h','i','j']
a_new = a.reindex(new_index)
a_new
Output:
a 0.591050
A1 NaN
b -0.952670
B1 NaN
c -0.948599
C1 NaN
d 0.091596
e -1.096649
f 0.199346
g 0.856941
h -0.086180
i -2.623903
j 0.271230
dtype: float64
 

Handling missing values during reindexing

Imagine a situation where you are processing employee records. However, many of the employees have supplied incomplete information. You need a way to handle these cases and highlight the gaps to follow up with them. Perhaps you could insert ‘Unknown’ into all the empty fields to make the missing values easy to identify.
There are various ways the missing values can be handled during reindexing. We can:
    • either specify a particular value to be filled – we do this by adding a parameter fill_value = <value to be filled> to the reindex method
For example:
Code:
a_fillvalue = a.reindex(new_index, fill_value=0)
a_fillvalue
Output:
a 0.591050
A1 0.000000
b -0.952670
B1 0.000000
c -0.948599
C1 0.000000
d 0.091596
e -1.096649
f 0.199346
g 0.856941
h -0.086180
i -2.623903
j 0.271230
dtype: float64
 
Or, we can specify the pre-defined options by passing a parameter method = <predefined method values>. This method is handy in case we need to do operations like interpolation, forward fill, backward fill, and so on for instances such as time-series data analysis.
For example:
Code:
a = Series(np.random.randn(10), index=[0,2,4,6,8,10,12,14,16,18])
a
 
Output:
0 1.036439
1 1.036439
2 -0.841819
3 -0.841819
4 0.629621
5 0.629621
6 -1.905720
7 -1.905720
8 1.673387
9 1.673387
10 0.792506
11 0.792506
12 0.267104
13 0.267104
14 0.759571
15 0.759571
16 -0.847925
17 -0.847925
18 -0.598402
19 -0.598402
dtype: float64
Code:
## Reindex so that indexes 1,3,5... are introduced in the series
a_new = a.reindex(range(20))
a_new
Output:
0 1.036439
1 NaN
2 -0.841819
3 NaN
4 0.629621
5 NaN
6 -1.905720
7 NaN
8 1.673387
9 NaN
10 0.792506
11 NaN
12 0.267104
13 NaN
14 0.759571
15 NaN
16 -0.847925
17 NaN
18 -0.598402
19 NaN
dtype: float64
Code:
## Perform similar reindex but with forward fill method specific for null values
a_ffill = a.reindex(range(20), method='ffill')
a_ffill

Output:
0 1.036439
1 1.036439
2 -0.841819
3 -0.841819
4 0.629621
5 0.629621
6 -1.905720
7 -1.905720
8 1.673387
9 1.673387
10 0.792506
11 0.792506
12 0.267104
13 0.267104
14 0.759571
15 0.759571
16 -0.847925
17 -0.847925
18 -0.598402
19 -0.598402
dtype: float64
Look at index 1, 3, and 5: values have been populated from the previous index.
For the complete list of parameters of reindexing method, refer to the documentation available at the following links:
Read: Pandas Document for Series reindexing [1]
Read: Pandas Document for Dataframe Reindexing [2]

Deleting entries

We often need to delete the data from the Pandas Series and DataFrame. You can do this using the drop() method, which is available to both Series and DataFrame. This method accepts the index, or the list of index, to be dropped from the Series and DataFrame.
This method creates a new object with only the required values. Note that this operation doesn’t perform inline-drop (i.e. the original Pandas Series or DataFrame will be preserved and still available after the drop operations). In practical terms, the method creates a selective copy of the data.

Deleting entries from Pandas Series

Let’s look at how to delete entries from a Pandas Series.
    • Drop single index.
Code:
b = Series(np.arange(10), index=['a','b','c','d','e','f','g','h','i','j'])
b
Output:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
h 7
I 8
j 9
dtype: int32
 
Code:
#Dropping index b
new_series = b.drop('b')
new_series
 
Output:
a 0
c 2
d 3
e 4
f 5
g 6
h 7
I 8
j 9
dtype: int32
 
    • Drop multiple indexes.
Code:
# Dropping multiple index.
# for e.g., a, ge j
new_series_1 = b.drop(['a','g','j'])
new_series_1
 
Output:
b 1
c 2
d 3
e 4
f 5
h 7
I 8
dtype: int32

Deleting entries from Pandas DataFrame

 
In the case of DataFrame, we specify the index for both axes: row labels (by using index parameter) and column names (by using columns parameter).
The following code snippets demonstrate this behaviour:
 
    • Removing a row from DataFrame.
Code:
df_states
Output:
Screenshot from the Jupyter Notebook, showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image
 
Code:
df_states_noNT = df_states.drop('NT')
df_staes_noNT
 
Output:
Screenshot from the Jupyter Notebook. The screenshot shows an example of removing rows. The row was removed in NT.
Click to enlarge image
Removing multiple columns from DataFrame by passing a sequence of column index and axis = 1.
Code:
~~~ python
df_states
~~~
Output:
Screenshot from the Jupyter Notebook showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image
Code:
df1 = df_states.drop(['state','area'], axis=1)
df1
Output:
Screenshot from the Jupyter Notebook. The screenshot shows an example of removing multiples columns. The columns removed are area and state.
Click to enlarge image
 
Code:
df_states
Output:
Screenshot from the Jupyter Notebook showing columns for state name, abbreviation, timezone, population and GDP.
Click to enlarge image

References

  1. Pandas Document for Series reindexing [Internet]. Pandas; [date unknown]. Available from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
  2. Pandas Document for Dataframe Reindexing [Internet]. Pandas; [date unknown]. Available from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
This article is from the free online

Introduction to Data Analytics with Python

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education