How to aggregate Pandas DataFrame in Python?
In this article, we will study how to aggregate Pandas DataFrame using Python.
What is Aggregation?
Aggregation is a combination of several separate elements. An example of the aggregate would be to describe the total amount of bikes sold by a sales team in one month.
Aggregate Pandas DataFrame
Let us now understand how to aggregate pandas DataFrame using Python.
I) Import Pandas – Pandas is an open-source library provided by Python. Pandas provide an easy way to create and manipulate the data. Since we are dealing with DataFrame we need to import Pandas.
import pandas as pd
II) Create DataFrame – A DataFrame is a two-dimensional data structure. Data in DataFrame is aligned in a tabular fashion in rows and columns. Here we are creating Dataframe since aggregation is to be applied on it. Following code is implemented to create DataFrame:
df = pd.DataFrame({'Gender' : ['Men', 'Women', 'Women', 'Men', 'Men', 'Men'], 'Height' : [154,160,135,120,154,132], 'Age' : [34,26,50,42,29,38], 'Weight' : [90,50,48,67,44,32]}) print(df)
OUTPUT
Gender Height Age Weight 0 Men 154 34 90 1 Women 160 26 50 2 Women 135 50 48 3 Men 120 42 67 4 Men 154 29 44 5 Men 132 38 32
(I) Grouping the Data
(i) .groupby() – Groupby is a simple concept of grouping the data depending upon the category. Let’s say we want to calculate the total number of employees working in an organization with salary greater than 10000. For this we need to group the employees first and than apply sum function to calculate total.
In real-life scenarios, there exists huge amount of data. In that case, it becomes difficult to handle the data if we want to perform operation on it. Hence, groupby() plays vital role to achieve efficient results.
group_of_gender = df.groupby("Gender").groups print(group_of_gender)
OUTPUT
{'Men': Int64Index([0, 3, 4, 5], dtype='int64'), 'Women': Int64Index([1, 2], dtype='int64')}
In this example, we have grouped the data on the basis of Gender.
(II) Basic Functions required
(i) .mean()- Everyone of us must have came across “Results” during school and college time. The final grade which is present on the Marksheet is “Mean”. We sum the total marks we got in each subject and than simply divide it by total number of subjects. The mean is used to derive the central tendency of the data in question. Let’s look at the code given below:
df.groupby("Gender")['Weight'].mean()
OUTPUT
Gender Men 58.25 Women 49.00 Name: Weight, dtype: float64
In this example, we have calculated Mean weight of each gender.
(ii) .max()- Whenever results are out, topper exists. How do we define that he/she is topper? If the student has scored highest or maximum marks among all student, than we say the student is topper. Similary, max() is used to calculate maximum value in the given range of values. Let’s look at the code given below:
df.groupby("Gender")['Weight'].max()
OUTPUT
Gender Men 90 Women 50 Name: Weight, dtype: int64
In this example, we have calculated Maximum weight from each gender.
(iii) .min()- min() calculates minimum value within the given range. Let’s look at the code given below:
df.groupby("Gender")['Weight'].min()
OUTPUT
In this example, we have calculated Minimum weight from each gender.
df.groupby("Gender")['Weight'].sum()
OUTPUT
Gender Men 233 Women 98 Name: Weight, dtype: int64
In this example, we have calculated total/sum of weight for each gender.
df.groupby("Gender").count()
OUTPUT
Height Age Weight Gender Men 4 4 4 Women 2 2 2
(II) Aggregate DataFrame
agg() means aggregation(). Dataframe.agg() function is used to apply aggregation across one or more column.
Example (i): In this example we are counting the total number of entries for each element. This is done using ‘count’.
df.agg(['count'])
OUTPUT
Gender Height Age Weight count 6 6 6 6
Similarly, you can try this with ‘min’, ‘avg’, ‘sum’ , ‘max’.
Example (ii): In this example, we are dividing the data on the basis of Gender. From each gender, minimum weight is calculated. Here, ‘groupby’ divides the data, ‘min’ finds minimum value and ‘agg’ aggregates the result.
df.groupby(['Gender'], as_index=False)['Weight'].agg('min')
OUTPUT
Gender Weight 0 Men 32 1 Women 48
Similarly, you can try this with ‘count’, ‘avg’, ‘sum’ , ‘max’.
Example (iii): In this example, we have aggregated “sum” and “min” of all the details present in the dataframe. “sum” calculates total and “min” finds the smallest value. This is known as Aggregation over Rows.
df.agg(['sum', 'min'])
OUTPUT
Gender Height Age Weight sum MenWomenWomenMenMenMen 855 219 331 min Men 120 26 32
Similarly, you can try this with other combination.
Example (iv): In this example, we have first grouped the data on the basis of Gender. For each gender we have calculated maximum and minimum weight along with maximum and minimum height. This result is aggregation using agg(). This is known as Aggregation per Column.
df.groupby(['Gender'], as_index=False).agg({'Weight' : ['max', 'min'], 'Height' : ['max', 'min']})
OUTPUT
Gender | Weight | Height | |||
---|---|---|---|---|---|
max | min | max | min | ||
0 | Men | 90 | 32 | 154 | 120 |
1 | Women | 50 | 48 | 160 | 135 |
Similarly, you can try this with other combination.
Example (v): In this example, we have calculated the mean for each column. 92.666667 is the mean of Age, Height and Weight of Men from Row 0. This is also known as aggregation over the columns.
df.agg("mean", axis="columns")
OUTPUT
0 92.666667 1 78.666667 2 77.666667 3 76.333333 4 75.666667 5 67.333333 dtype: float64
Example (vi): In this example, we are grouping the data on the basis of “Gender” and “Height”. Hence, if we see the dataframe properly, there are two entries with same gender and same height. This would be considered as one entry. Therefore, age and weight of that two entries are added. On this data, aggregation is performed.
df.groupby(['Gender', 'Height'], as_index=False)['Age','Weight'].agg('sum')
OUTPUT
Gender Height Age Weight 0 Men 120 42 67 1 Men 132 38 32 2 Men 154 63 134 3 Women 135 50 48 4 Women 160 26 50
Thank You.
You may also read: Tuple Constructor in Python.
Leave a Reply