How to aggregate Pandas DataFrame in Python?

In this article, we will study how to aggregate Pandas DataFrame using Python.

What is Aggregation?

Aggregation is a combination of several separate elements. An example of the aggregate would be to describe the total amount of bikes sold by a sales team in one month.

Aggregate Pandas DataFrame

Let us now understand how to aggregate pandas DataFrame using Python.

I) Import Pandas – Pandas is an open-source library provided by Python. Pandas provide an easy way to create and manipulate the data. Since we are dealing with DataFrame we need to import Pandas.

import pandas as pd

II) Create DataFrame – A DataFrame is a two-dimensional data structure. Data in DataFrame is aligned in a tabular fashion in rows and columns. Here we are creating Dataframe since aggregation is to be applied on it. Following code is implemented to create DataFrame:

df = pd.DataFrame({'Gender' : ['Men', 'Women', 'Women', 'Men', 'Men', 'Men'],
                   'Height' : [154,160,135,120,154,132],
                   'Age' : [34,26,50,42,29,38],
                   'Weight' : [90,50,48,67,44,32]})

print(df)

OUTPUT

   Gender    Height    Age    Weight
0  Men       154       34     90
1  Women     160       26     50
2  Women     135       50     48
3  Men       120       42     67
4  Men       154       29     44
5  Men       132       38     32

(I) Grouping the Data

(i) .groupby() – Groupby is a simple concept of grouping the data depending upon the category. Let’s say we want to calculate the total number of employees working in an organization with salary greater than 10000. For this we need to group the employees first and than apply sum function to calculate total.

In real-life scenarios, there exists huge amount of data. In that case, it becomes difficult to handle the data if we want to perform operation on it. Hence, groupby() plays vital role to achieve efficient results. 

group_of_gender = df.groupby("Gender").groups
print(group_of_gender)

OUTPUT

{'Men': Int64Index([0, 3, 4, 5], dtype='int64'), 
'Women': Int64Index([1, 2], dtype='int64')}

In this example, we have grouped the data on the basis of Gender.

(II) Basic Functions required

(i) .mean()- Everyone of us must have came across “Results” during school and college time. The final grade which is present on the Marksheet is “Mean”. We sum the total marks we got in each subject and than simply divide it by total number of subjects. The mean is used to derive the central tendency of the data in question. Let’s look at the code given below:

df.groupby("Gender")['Weight'].mean()

OUTPUT

Gender
Men      58.25
Women    49.00
Name: Weight, dtype: float64

In this example, we have calculated Mean weight of each gender.

(ii) .max()- Whenever results are out, topper exists. How do we define that he/she is topper? If the student has scored highest or maximum marks among all student, than we say the student is topper. Similary, max() is used to calculate maximum value in the given range of values. Let’s look at the code given below:

df.groupby("Gender")['Weight'].max()

OUTPUT

Gender
Men      90
Women    50
Name: Weight, dtype: int64

In this example, we have calculated Maximum weight from each gender.

(iii) .min()- min() calculates minimum value within the given range. Let’s look at the code given below:

df.groupby("Gender")['Weight'].min()

OUTPUT

Gender
Men      32
Women    48
Name: Weight, dtype: int64

In this example, we have calculated Minimum weight from each gender.

(iv) .sum()-  sum() results in total of all the values for a given element. Let’s look at the code given below:
df.groupby("Gender")['Weight'].sum()

OUTPUT

Gender
Men      233
Women     98
Name: Weight, dtype: int64

In this example, we have calculated total/sum of weight for each gender.

(v) .count()- count() calculates the total number of items present. It returns an integer value which is the count of how many times the object as occurred or total number of object present. Let’s look at the code given below:
df.groupby("Gender").count()

OUTPUT

             Height     Age    Weight

Gender

Men          4          4       4

Women        2          2       2

(II) Aggregate DataFrame

agg() means aggregation(). Dataframe.agg() function is used to apply aggregation across one or more column.

image1

Example (i): In this example we are counting the total number of entries for each element. This is done using ‘count’.

df.agg(['count'])

OUTPUT

           Gender    Height    Age    Weight    

count      6         6         6      6

Similarly, you can try this with ‘min’, ‘avg’, ‘sum’ , ‘max’.

Example (ii): In this example, we are dividing the data on the basis of Gender. From each gender, minimum weight is calculated. Here, ‘groupby’ divides the data, ‘min’ finds minimum value and ‘agg’ aggregates the result.

df.groupby(['Gender'], as_index=False)['Weight'].agg('min')

OUTPUT

  Gender    Weight
0 Men       32
1 Women     48

Similarly, you can try this with ‘count’, ‘avg’, ‘sum’ , ‘max’.

Example (iii): In this example, we have aggregated “sum” and “min” of all the details present in the dataframe. “sum” calculates total and “min” finds the smallest value. This is known as Aggregation over Rows.

df.agg(['sum', 'min'])

OUTPUT

                     Gender    Height    Age    Weight 
sum  MenWomenWomenMenMenMen    855       219    331
min                     Men    120       26     32

Similarly, you can try this with other combination.

Example (iv): In this example, we have first grouped the data on the basis of Gender. For each gender we have calculated maximum and minimum weight along with maximum and minimum height. This result is aggregation using agg(). This is known as Aggregation per Column.

df.groupby(['Gender'], as_index=False).agg({'Weight' : ['max', 'min'], 'Height' : ['max', 'min']})

OUTPUT

Gender
Weight
Height
max
min
max
min
0
Men
90
32
154
120
1
Women
50
48
160
135

Similarly, you can try this with other combination.

Example (v): In this example, we have calculated the mean for each column. 92.666667 is the mean of Age, Height and Weight of Men from Row 0. This is also known as aggregation over the columns.

df.agg("mean", axis="columns")

OUTPUT

0    92.666667
1    78.666667
2    77.666667
3    76.333333
4    75.666667
5    67.333333
dtype: float64

Example (vi): In this example, we are grouping the data on the basis of “Gender” and “Height”. Hence, if we see the dataframe properly, there are two entries with same gender and same height. This would be considered as one entry. Therefore, age and weight of that two entries are added. On this data, aggregation is performed.

df.groupby(['Gender', 'Height'], as_index=False)['Age','Weight'].agg('sum')

OUTPUT

         Gender    Height   Age   Weight 

0        Men       120      42    67

1        Men       132      38    32

2        Men       154      63    134

3        Women     135      50    48

4        Women     160      26    50

Thank You.

You may also read: Tuple Constructor in Python.

Leave a Reply

Your email address will not be published. Required fields are marked *