Skip main navigation

GroupBy mechanics

Learn the mechanics of GroupBy.

Consider this scenario:

You are analysing data for a company that sells computer hardware. The company would like to look at the past year’s sales trends to see which types of hardware were the most profitable, which brands were the most popular, and the average number of sales per region, among other insights. Unfortunately, the company recorded their sales transactions in a rather disorganised way: the transaction records for all offices and regions are stored in a single large file in order of transaction date and time. There are thousands of records and it would be very time consuming to organise them manually. You need to use Python to categorise the transactions, perform necessary calculations for each category, and summarise the results in a useful way.

Fortunately, Python has GroupBy operations to help you solve these kinds of business problems.

A GroupBy operation is a process with one or more of these steps:

  • Splitting: the data is separated into groups according to some criteria.
  • Applying: a function is applied to each group independently.
  • Combining: the individual results are combined into a data structure.

For example, you would use a GroupBy operation to separate some values according to a category, add up the values for each category, then combine the categories for a summary.

This diagram illustrates the process:

Graphic shows example of “Python Group By demo”. Key reads: “Group by = Split, Apply, Combine”. Displayed we “key” and “data” information on the left with multiple “Key” letters and “Data” numbers. It is then “Split” into the vales sections for A, B, C, D with their matching “data” values. Next we see “Apply” where the multiple “Key” letters and “Data” numbers have been added together. Lastly, at the end we see the new chart under “Combine”. The original table had two columns reading: “Key” and “Data” with “Key” y-axis reading: A, B, C, D, A, B, C, D, A, B, C, D. The “Data” y-axis reading: 3, 4, 5, 6, 7, 3, 2, 4, 5, 6, 8, 9. After the are combined the new table now reads: “Key” and “Data”. “Key” y-axis: A, B, C, D. The “Data” y axis: 15, 13, 15, 19. Click to enlarge

Out of these steps, splitting is the most straightforward.

Splitting an object into groups

Pandas can split objects along any axes. To create a GroupBy object, call the groupby() method on the dataframe. This will return a dataframe GroupBy object.

You can specify splitting criteria or GroupBy mapping in many different ways; for example:

  • a Python function called on each axis label

  • for dataframe objects, a string indicating a column or list of column names for grouping

  • a list or NumPy array of the same length as the selected axis

  • a dictionary or series that provides a label to group name mapping.

Collectively, the grouping objects are referred to as ‘keys’. In the example above, ‘class’ was the key.

Try this example that uses the column name as the splitting criterion:

Code:


df_animals = pd.DataFrame([('bird', 'Falconiformes', 389.0),
('bird', 'Psittaciformes', 24.0),
('mammal', 'Carnivora', 80.2),
('mammal', 'Primates', np.nan),
('mammal', 'Carnivora', 58)],
index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
columns=('class', 'order', 'max_speed'))
df_animals

Output:

Graphic shows a table. The Y-axis rows are named falcon, parrot, lion, monkey, leopard. The x-axis columns are named: class, order and max_speed. Row "falcon" reads: bird, Falconiformes, 389. Row "Parrot" reads: bird, Psittaciformes, 24. Row "lion" reads: mammal, carnivora, 80.2. Row "Monkey" reads: mammal, primate, NAN. "Leopard" reads: mammal, carnivora, 58.
Click to enlarge

Code:


grp_class= df_animals.groupby('class')
grp_class

Output:


<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001FA37CF8320>

Code:


grp_class_order = df_animals.groupby(['class', 'order'])
grp_class_order

Output:


<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001FA37CF8898>

Do you see how the groupby() operation returns a dataframeGroupBy object?

Importantly, no splitting actually occurs until there is a need. Creating the GroupBy object only verifies that you passed a valid mapping for grouping or splitting.

For example, once you call a .sum() method on the two GroupBy objects, the split occurs and returns the summation result.

Use these code snippets for a demonstration:

Code:


grp_class.sum()

Output:

Graphic shows a table. The Y-axis rows are named bird, mammal. The x-axis column is named max-speed. The bird row reads: 413.0. The mammal row reads: 138.2. For birds in the facoliformes order, max_speed is 389, and for psittaciformes the max_speed is 24. In mammals in the carnivora order, max_speed is 138.2. And for primates its 0.0
Click to enlarge

Code:


grp_class_order.sum()

Output:

Graphic shows a table. The Y-axis rows are named bird, mammal. The x-axis column is named max-speed. The bird row reads: 413.0. The mammal row reads: 138.2. For birds in the facoliformes order, max_speed is 389, and for psittaciformes the max_speed is 24. In mammals in the carnivora order, max_speed is 138.2. And for primates its 0.0
Click to enlarge

GroupBy sorting

By default, the group keys are sorted during the GroupBy operation. If you don’t want group keys to be sorted, pass the parameter sort=false.

Try this example:

Code:


df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A','C','C','C'], 'Y': [2, 4, 3, 4,2,5,6]})
df2

Output:

Graphic shows a table. Rows are named 0, 1, 2, 3, 4, 5, 6. COlumns are named X, Y. Row 0 = B,2. Row 1 = B, 4. Row 2 = A, 3. Row 3 = A, 4. Row 4 = C, 2. Row 5 = C, 5. Row 6 = C, 6.
Click to enlarge

Code:


#Observer that in the output, Keys are sorted lexicographically
df2.groupby(['X']).sum()

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

Code:


#In this statement we are passing sort=False, now group keys will not be sorted
df2.groupby(['X'], sort=False).sum()

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

GroupBy object attributes

If you want to check the created groupings, you can look at the groups attribute of the GroupBy object.

This action returns a dictionary with:

  • keys: computed unique groups
  • values: corresponding axis labels for each group.

Let’s check this with an example:

Code:


df=pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})
df

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

Code:


df.groupby(['X']).groups

Output:


{'A': Int64Index([0, 2], dtype='int64'),
'B': Int64Index([1, 3], dtype='int64')}

Notice that:

  • the first group key is A with indexes 0 and 2 as values

  • the second group key is B with indexes 1 and 3 as values.

For the next step, applying, we may need to perform one or a combination of these actions:

  • Aggregation: compute a summary statistic for each group (eg, sums, means, counts).

  • Transformation: perform a computation on the data in each group and return a like-indexed object.

  • Filtration: discard some groups, according to a groupwise computation that evaluates as True or False.

You will explore each of these steps in more detail later.

This article is from the free online

Data Wrangling and Ingestion using Python

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now