Skip main navigation

Aligning, mapping, and sorting data in Pandas

Article discussing aligning, mapping, and sorting data in Pandas.

As seen earlier, preparing data is the first step in data analysis. Data analysts can manipulate data for their data analysis using the align, map, and sort features in Pandas to make it more readable and organised.

Data alignment

When we perform mathematical operations between Panda objects with different indexes, Pandas will perform the data alignment into the resulting Panda object. This operation is known as data alignment.

Code:

df1 = DataFrame(np.arange(9).reshape(3,3), columns=['a','b','c'], index=['SA', 'VIC', 'NSW'])
df1

Output:

Screenshot from the Jupyter Notebook. Screenshot shows example of data alignment.

Code:

df2 = DataFrame(np.arange(12).reshape(4,3), columns=['a','b','e'], index=['SA', 'VIC', 'NSW', 'ACT'])
df2

Output:

Adding DataFrames

In case of addition, if index pairs are not the same, the resultant Pandas object will have the index that is the union of both the original index and missing values will be filled as NaN (Not a Number).

Code:

df1+df2

Output:

Handling missing data

We also can pass parameter values to determine how missing values should be dealt with, which performs this internal data alignment.

Code:

df1.add(df2, fill_value=0)

Output:

Mapping

Often, we would want to change or manipulate the values in a particular row or a column by applying some functions only to select values.

For example, think of a data set that captures information about an extensive collection of products (represented as columns in the data set). These products go through an update every year. Here, you need a way to update all the version numbers quickly and easily instead of updating each product individually.

This process is known as mapping, and this can be done by using the .apply() method, which has the following parameters:

  • a lambda function, to specify what kind of transformation needs to be applied
  • an axis parameter, which by default equates to 0 and so applies across the index (and not columns).

The following code snippets demonstrate this behaviour:

Code:

df_states

Output:

Screenshot from the Jupyter Notebook. The screenshot shows an example of data frame mapping for rows labeled WA, SA, VIC, NSW, ACT, QLD, NT.

Code:

f = lambda x:x.upper()

Code:

df_states['state'] = df_states['state'].apply(f)
df_states

Output:

Sorting

Sometimes, data must be sorted in order to make it clear and meaningful. For instance, consider an on-demand video streaming service that wants to know which TV series in its catalogue are the most popular ones so that they could be renewed for another season. Here, the series titles need to be sorted along with the extent of how much they are being watched.

The sorting function in Pandas comes in handy in situations like the above.

Sorting the indexes/labels

To sort data lexicographically (i.e. the dictionary order) by row or column index, we use the sort_index() method. See below for a demonstration of sorting the indexes.
It should be noted that this method returns a new object, which is sorted based on the criteria specified:

  • Original DataFrame.

Code:

df_states

Output:

  • DataFrame sorted by row index.

Code:

df_states.sort_index()

Output:

  • DataFrame sorted by columns (lexicographically).

Code:

df_states.sort_index(axis=1)

Output:

Sorting by values

Instead of sorting by indexes and labels, we can also sort the data by the actual values in the columns. For this purpose, another function known as sort_values() can be used. This function will sort the data on the basis of values instead of labels.

See below code snippet for an example, where we will arrange the values by GDP column:

Code:

df_states

Output:

Screenshot from the Jupyter Notebook. The screenshot shows an example of sorting by value in the GDP column sorting smallest to largest by Australian state.

Code:

df_states.sort_values(GDP)

Output:

Screenshot from the Jupyter Notebook. Screenshot shows an example of sorting by value in the GDP column sorting smallest to largest by Australian state.

Next, you will engage in an exercise to apply your learnings on manipulating data in Pandas.

This article is from the free online

Introduction to Data Analytics with Python

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now