# GroupBy mechanics

Learn the mechanics of GroupBy.

Consider this scenario:

You are analysing data for a company that sells computer hardware. The company would like to look at the past year’s sales trends to see which types of hardware were the most profitable, which brands were the most popular, and the average number of sales per region, among other insights. Unfortunately, the company recorded their sales transactions in a rather disorganised way: the transaction records for all offices and regions are stored in a single large file in order of transaction date and time. There are thousands of records and it would be very time consuming to organise them manually. You need to use Python to categorise the transactions, perform necessary calculations for each category, and summarise the results in a useful way.

A GroupBy operation is a process with one or more of these steps:

• Splitting: the data is separated into groups according to some criteria.
• Applying: a function is applied to each group independently.
• Combining: the individual results are combined into a data structure.

For example, you would use a GroupBy operation to separate some values according to a category, add up the values for each category, then combine the categories for a summary.

This diagram illustrates the process:

Out of these steps, splitting is the most straightforward.

## Splitting an object into groups

Pandas can split objects along any axes. To create a GroupBy object, call the groupby() method on the dataframe. This will return a dataframe GroupBy object.

You can specify splitting criteria or GroupBy mapping in many different ways; for example:

• a Python function called on each axis label

• for dataframe objects, a string indicating a column or list of column names for grouping

• a list or NumPy array of the same length as the selected axis

• a dictionary or series that provides a label to group name mapping.

Collectively, the grouping objects are referred to as ‘keys’. In the example above, ‘class’ was the key.

Try this example that uses the column name as the splitting criterion:

Code:

df_animals = pd.DataFrame([('bird', 'Falconiformes', 389.0), ('bird', 'Psittaciformes', 24.0), ('mammal', 'Carnivora', 80.2), ('mammal', 'Primates', np.nan), ('mammal', 'Carnivora', 58)], index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'], columns=('class', 'order', 'max_speed'))df_animals 

Output:

Code:

grp_class= df_animals.groupby('class')grp_class

Output:

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001FA37CF8320>

Code:

grp_class_order = df_animals.groupby(['class', 'order'])grp_class_order

Output:

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x000001FA37CF8898>

Do you see how the groupby() operation returns a dataframeGroupBy object?

Importantly, no splitting actually occurs until there is a need. Creating the GroupBy object only verifies that you passed a valid mapping for grouping or splitting.

For example, once you call a .sum() method on the two GroupBy objects, the split occurs and returns the summation result.

Use these code snippets for a demonstration:

Code:

grp_class.sum() 

Output:

Code:

grp_class_order.sum()

Output:

### GroupBy sorting

By default, the group keys are sorted during the GroupBy operation. If you don’t want group keys to be sorted, pass the parameter sort=false.

Try this example:

Code:

df2 = pd.DataFrame({'X': ['B', 'B', 'A', 'A','C','C','C'], 'Y': [2, 4, 3, 4,2,5,6]})df2

Output:

Code:

#Observer that in the output, Keys are sorted lexicographicallydf2.groupby(['X']).sum() 

Output:

Code:

#In this statement we are passing sort=False, now group keys will not be sorteddf2.groupby(['X'], sort=False).sum()

Output:

### GroupBy object attributes

If you want to check the created groupings, you can look at the groups attribute of the GroupBy object.

This action returns a dictionary with:

• keys: computed unique groups
• values: corresponding axis labels for each group.

Let’s check this with an example:

Code:

df=pd.DataFrame({'X': ['A', 'B', 'A', 'B'], 'Y': [1, 4, 3, 2]})df

Output:

Code:

df.groupby(['X']).groups

Output:

{'A': Int64Index([0, 2], dtype='int64'), 'B': Int64Index([1, 3], dtype='int64')}

Notice that:

• the first group key is A with indexes 0 and 2 as values

• the second group key is B with indexes 1 and 3 as values.

For the next step, applying, we may need to perform one or a combination of these actions:

• Aggregation: compute a summary statistic for each group (eg, sums, means, counts).

• Transformation: perform a computation on the data in each group and return a like-indexed object.

• Filtration: discard some groups, according to a groupwise computation that evaluates as True or False.

You will explore each of these steps in more detail later.