Grouping and Aggregation – AICorr.com


Grouping and aggregation

This is a grouping and aggregation tutorial.

Grouping and aggregation are common operations in data manipulation and analysis. Pandas offers very straightforward and efficient methods for such tasks. These operations are used to summarize data based on certain criteria and compute statistics or metrics over groups of data. Grouping involves splitting the data into groups based on some criteria, applying a function to each group independently, and then combining the results into a data structure.

Let’s dive into the techniques practically.

Grouping

To group data, Pandas has the function “groupby()“. In order to view the data, we need to use a for loop and iterate over it.

import pandas as pd

# Sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A'],
    'Value': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Grouping by 'Category' column
grouped = df.groupby('Category')

# Viewing groups
for a, b in grouped:
    print(a)
    print(b)
A
  Category  Value
0        A     10
2        A     30
4        A     50

B
  Category  Value
1        B     20
3        B     40

Aggregation

Once the data is grouped, we can now apply various aggregation functions (with .agg) to compute summary statistics for each group.

# Aggregation functions
agg_result = grouped.agg({
    'Value': ['sum', 'mean', 'min', 'max', 'count']
})

print(agg_result)
         Value                    
           sum  mean min max count
Category                          
A           90  30.0  10  50     3
B           60  30.0  20  40     2

We implement the following aggregation functions: summation, mean value, minimum value, maximum value, and counting of number of elements.

Accessing groups

One way to access data groups is through a for loop iteration. Pandas also provides the function “get_group()”, which can access separate groups. We continue the same example from above.

# View group A
group_A = grouped.get_group('A')
print(group_A)
  Category  Value
0        A     10
2        A     30
4        A     50

The get_group() method cannot take a list as an input and display multiple groups.

Multi-grouping

We can group by multiple columns, by passing a list of column names to “groupby()“.

# Multi-grouping
multi_grouped = df.groupby(['Category', 'Value'])

# View groups
for a, b in multi_grouped:
    print(a)
    print(b)
('A', 10)
  Category  Value
0        A     10

('A', 30)
  Category  Value
2        A     30

('A', 50)
  Category  Value
4        A     50

('B', 20)
  Category  Value
1        B     20

('B', 40)
  Category  Value
3        B     40

Custom aggregation

This method refers to customising aggregation functions. For example, we can use the lambda function to compute the difference between the maximum and minimum values of the ‘Value’ column for each group. For more information regarding lambda functions, please read here.

# Customisation agg function
result = grouped['Value'].agg(lambda x: x.max() - x.min())
print(result)
Category
A    40
B    20
Name: Value, dtype: int64

The maximum value for group A is 50 and the minimum is 10. And the maximum value for group B is 40 and the minumum is 20. As a result, after calculating the difference, the output is 40 for A and 20 for B.


This is an original grouping and aggregation educational material created by aicorr.com.

Next: Dates and Time

We will be happy to hear your thoughts

Leave a reply

0
Your Cart is empty!

It looks like you haven't added any items to your cart yet.

Browse Products
Powered by Caddy
Shopping cart