Skip main navigation

GroupBy mechanics

Learn the mechanics of GroupBy.

Consider this scenario:

You are analysing data for a company that sells computer hardware. The company would like to look at the past year’s sales trends to see which types of hardware were the most profitable, which brands were the most popular, and the average number of sales per region, among other insights. Unfortunately, the company recorded their sales transactions in a rather disorganised way: the transaction records for all offices and regions are stored in a single large file in order of transaction date and time. There are thousands of records and it would be very time consuming to organise them manually. You need to use Python to categorise the transactions, perform necessary calculations for each category, and summarise the results in a useful way.

Fortunately, Python has GroupBy operations to help you solve these kinds of business problems.

A GroupBy operation is a process with one or more of these steps:

  • Splitting: the data is separated into groups according to some criteria.
  • Applying: a function is applied to each group independently.
  • Combining: the individual results are combined into a data structure.

For example, you would use a GroupBy operation to separate some values according to a category, add up the values for each category, then combine the categories for a summary.

This diagram illustrates the process:

Graphic shows example of “Python Group By demo”. Key reads: “Group by = Split, Apply, Combine”. Displayed we “key” and “data” information on the left with multiple “Key” letters and “Data” numbers. It is then “Split” into the vales sections for A, B, C, D with their matching “data” values. Next we see “Apply” where the multiple “Key” letters and “Data” numbers have been added together. Lastly, at the end we see the new chart under “Combine”. The original table had two columns reading: “Key” and “Data” with “Key” y-axis reading: A, B, C, D, A, B, C, D, A, B, C, D. The “Data” y-axis reading: 3, 4, 5, 6, 7, 3, 2, 4, 5, 6, 8, 9. After the are combined the new table now reads: “Key” and “Data”. “Key” y-axis: A, B, C, D. The “Data” y axis: 15, 13, 15, 19. Click to enlarge

Out of these steps, splitting is the most straightforward.

Splitting an object into groups

Pandas can split objects along any axes. To create a GroupBy object, call the groupby() method on the dataframe. This will return a dataframe GroupBy object.

You can specify splitting criteria or GroupBy mapping in many different ways; for example:

  • a Python function called on each axis label

  • for dataframe objects, a string indicating a column or list of column names for grouping

  • a list or NumPy array of the same length as the selected axis

  • a dictionary or series that provides a label to group name mapping.

Collectively, the grouping objects are referred to as ‘keys’. In the example above, ‘class’ was the key.

Try this example that uses the column name as the splitting criterion:

Code:


df_animals = pd.DataFrame([('bird', 'Falconiformes', 389.0),
('bird', 'Psittaciformes', 24.0),
('mammal', 'Carnivora', 80.2),
('mammal', 'Primates', np.nan),
('mammal', 'Carnivora', 58)],
index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
columns=('class', 'order', 'max_speed'))
df_animals

Output:

Graphic shows a table. The Y-axis rows are named falcon, parrot, lion, monkey, leopard. The x-axis columns are named: class, order and max_speed. Row "falcon" reads: bird, Falconiformes, 389. Row "Parrot" reads: bird, Psittaciformes, 24. Row "lion" reads: mammal, carnivora, 80.2. Row "Monkey" reads: mammal, primate, NAN. "Leopard" reads: mammal, carnivora, 58.
Click to enlarge

Code:


grp_class= df_animals.groupby('class')
grp_class

Output:


<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001FA37CF8320>

Code:


grp_class_order = df_animals.groupby(['class', 'order'])
grp_class_order

Output:


<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001FA37CF8898>

Do you see how the groupby() operation returns a dataframeGroupBy object?

Importantly, no splitting actually occurs until there is a need. Creating the GroupBy object only verifies that you passed a valid mapping for grouping or splitting.

For example, once you call a .sum() method on the two GroupBy objects, the split occurs and returns the summation result.

Use these code snippets for a demonstration:

Code:


grp_class.sum()

Output:

Graphic shows a table. The Y-axis rows are named bird, mammal. The x-axis column is named max-speed. The bird row reads: 413.0. The mammal row reads: 138.2. For birds in the facoliformes order, max_speed is 389, and for psittaciformes the max_speed is 24. In mammals in the carnivora order, max_speed is 138.2. And for primates its 0.0
Click to enlarge

Code:


grp_class_order.sum()

Output:

Graphic shows a table. The Y-axis rows are named bird, mammal. The x-axis column is named max-speed. The bird row reads: 413.0. The mammal row reads: 138.2. For birds in the facoliformes order, max_speed is 389, and for psittaciformes the max_speed is 24. In mammals in the carnivora order, max_speed is 138.2. And for primates its 0.0
Click to enlarge

GroupBy sorting

By default, the group keys are sorted during the GroupBy operation. If you don’t want group keys to be sorted, pass the parameter sort=false.

Try this example:

Code:


df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A','C','C','C'], 'Y': [2, 4, 3, 4,2,5,6]})
df2

Output:

Graphic shows a table. Rows are named 0, 1, 2, 3, 4, 5, 6. COlumns are named X, Y. Row 0 = B,2. Row 1 = B, 4. Row 2 = A, 3. Row 3 = A, 4. Row 4 = C, 2. Row 5 = C, 5. Row 6 = C, 6.
Click to enlarge

Code:


#Observer that in the output, Keys are sorted lexicographically
df2.groupby(['X']).sum()

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

Code:


#In this statement we are passing sort=False, now group keys will not be sorted
df2.groupby(['X'], sort=False).sum()

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

GroupBy object attributes

If you want to check the created groupings, you can look at the groups attribute of the GroupBy object.

This action returns a dictionary with:

  • keys: computed unique groups
  • values: corresponding axis labels for each group.

Let’s check this with an example:

Code:


df=pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})
df

Output:

Graphic shows a table. The Y-axis rows are named A, B, C. Row A: 7; Row B: 6, Row C: 13.
Click to enlarge

Code:


df.groupby(['X']).groups

Output:


{'A': Int64Index([0, 2], dtype='int64'),
'B': Int64Index([1, 3], dtype='int64')}

Notice that:

  • the first group key is A with indexes 0 and 2 as values

  • the second group key is B with indexes 1 and 3 as values.

For the next step, applying, we may need to perform one or a combination of these actions:

  • Aggregation: compute a summary statistic for each group (eg, sums, means, counts).

  • Transformation: perform a computation on the data in each group and return a like-indexed object.

  • Filtration: discard some groups, according to a groupwise computation that evaluates as True or False.

You will explore each of these steps in more detail later.

This article is from the free online

Data Wrangling and Ingestion using Python

Created by
FutureLearn - Learning For Life

Our purpose is to transform access to education.

We offer a diverse selection of courses from leading universities and cultural institutions from around the world. These are delivered one step at a time, and are accessible on mobile, tablet and desktop, so you can fit learning around your life.

We believe learning should be an enjoyable, social experience, so our courses offer the opportunity to discuss what you’re learning with others as you go, helping you make fresh discoveries and form new ideas.
You can unlock new opportunities with unlimited access to hundreds of online short courses for a year by subscribing to our Unlimited package. Build your knowledge with top universities and organisations.

Learn more about how FutureLearn is transforming access to education