Learn more about this course.

GroupBy mechanics

Learn the mechanics of GroupBy.

Consider this scenario:

You are analysing data for a company that sells computer hardware. The company would like to look at the past year’s sales trends to see which types of hardware were the most profitable, which brands were the most popular, and the average number of sales per region, among other insights. Unfortunately, the company recorded their sales transactions in a rather disorganised way: the transaction records for all offices and regions are stored in a single large file in order of transaction date and time. There are thousands of records and it would be very time consuming to organise them manually. You need to use Python to categorise the transactions, perform necessary calculations for each category, and summarise the results in a useful way.

Fortunately, Python has GroupBy operations to help you solve these kinds of business problems.

A GroupBy operation is a process with one or more of these steps:

Want to keep
learning?

This content is taken from
FutureLearn online course,

Data Wrangling and Ingestion using Python

View Course

Splitting: the data is separated into groups according to some criteria.
Applying: a function is applied to each group independently.
Combining: the individual results are combined into a data structure.

For example, you would use a GroupBy operation to separate some values according to a category, add up the values for each category, then combine the categories for a summary.

This diagram illustrates the process:

Click to enlarge

Out of these steps, splitting is the most straightforward.

Splitting an object into groups

Pandas can split objects along any axes. To create a GroupBy object, call the groupby() method on the dataframe. This will return a dataframe GroupBy object.

You can specify splitting criteria or GroupBy mapping in many different ways; for example:

a Python function called on each axis label
for dataframe objects, a string indicating a column or list of column names for grouping
a list or NumPy array of the same length as the selected axis
a dictionary or series that provides a label to group name mapping.

Collectively, the grouping objects are referred to as ‘keys’. In the example above, ‘class’ was the key.

Try this example that uses the column name as the splitting criterion:

Code:


df_animals = pd.DataFrame([('bird', 'Falconiformes', 389.0),
 ('bird', 'Psittaciformes', 24.0),
 ('mammal', 'Carnivora', 80.2),
 ('mammal', 'Primates', np.nan),
 ('mammal', 'Carnivora', 58)],
 index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
 columns=('class', 'order', 'max_speed'))
df_animals

Output:

Graphic shows a table. The Y-axis rows are named falcon, parrot, lion, monkey, leopard. The x-axis columns are named: class, order and max_speed. Row "falcon" reads: bird, Falconiformes, 389. Row "Parrot" reads: bird, Psittaciformes, 24. Row "lion" reads: mammal, carnivora, 80.2. Row "Monkey" reads: mammal, primate, NAN. "Leopard" reads: mammal, carnivora, 58.
Click to enlarge

Code:


grp_class= df_animals.groupby('class')
grp_class

Output:


<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001FA37CF8320>

Code:


grp_class_order = df_animals.groupby(['class', 'order'])
grp_class_order

Output:


<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001FA37CF8898>

Do you see how the groupby() operation returns a dataframeGroupBy object?

Importantly, no splitting actually occurs until there is a need. Creating the GroupBy object only verifies that you passed a valid mapping for grouping or splitting.

For example, once you call a .sum() method on the two GroupBy objects, the split occurs and returns the summation result.

Use these code snippets for a demonstration:

Code:


grp_class.sum()

Output:

Graphic shows a table. The Y-axis rows are named bird, mammal. The x-axis column is named max-speed. The bird row reads: 413.0. The mammal row reads: 138.2. For birds in the facoliformes order, max_speed is 389, and for psittaciformes the max_speed is 24. In mammals in the carnivora order, max_speed is 138.2. And for primates its 0.0
Click to enlarge

Code:


grp_class_order.sum()

Output:

GroupBy sorting

By default, the group keys are sorted during the GroupBy operation. If you don’t want group keys to be sorted, pass the parameter sort=false.

Try this example:

Code:


df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A','C','C','C'], 'Y': [2, 4, 3, 4,2,5,6]})
df2

Output:

Graphic shows a table. Rows are named 0, 1, 2, 3, 4, 5, 6. COlumns are named X, Y. Row 0 = B,2. Row 1 = B, 4. Row 2 = A, 3. Row 3 = A, 4. Row 4 = C, 2. Row 5 = C, 5. Row 6 = C, 6.
Click to enlarge

Code:


#Observer that in the output, Keys are sorted lexicographically
df2.groupby(['X']).sum()

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

Code:


#In this statement we are passing sort=False, now group keys will not be sorted
df2.groupby(['X'], sort=False).sum()

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

GroupBy object attributes

If you want to check the created groupings, you can look at the groups attribute of the GroupBy object.

This action returns a dictionary with:

keys: computed unique groups
values: corresponding axis labels for each group.

Let’s check this with an example:

Code:


df=pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})
df

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

Code:


df.groupby(['X']).groups

Output:


{'A': Int64Index([0, 2], dtype='int64'),
 'B': Int64Index([1, 3], dtype='int64')}

Notice that:

the first group key is A with indexes 0 and 2 as values
the second group key is B with indexes 1 and 3 as values.

For the next step, applying, we may need to perform one or a combination of these actions:

Aggregation: compute a summary statistic for each group (eg, sums, means, counts).
Transformation: perform a computation on the data in each group and return a like-indexed object.
Filtration: discard some groups, according to a groupwise computation that evaluates as True or False.

You will explore each of these steps in more detail later.

Want to keep learning?

This content is taken from FutureLearn online course

Data Wrangling and Ingestion using Python

View Course

See other articles from this course

This article is from the free online

Data Wrangling and Ingestion using Python

Created by

Join Now

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now

Learn more about this course.

GroupBy mechanics

Want to keep
learning?

Data Wrangling and Ingestion using Python

Splitting an object into groups

GroupBy sorting

GroupBy object attributes

Want to keep learning?

Data Wrangling and Ingestion using Python

Data Wrangling and Ingestion using Python

Data Wrangling and Ingestion using Python

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Learn more about this course.

GroupBy mechanics

Want to keep learning?

Data Wrangling and Ingestion using Python

Splitting an object into groups

GroupBy sorting

GroupBy object attributes

Want to keep learning?

Data Wrangling and Ingestion using Python

Share this

Data Wrangling and Ingestion using Python

Data Wrangling and Ingestion using Python

Reach your personal and professional goals

Register to receive updates

Learn more about this course.

Learn more about this course.

See all FutureLearn courses.

Want to keep
learning?