pandas.DataFrame.sort_values in Python
In Data Science, Pandas has now become a great tool for handling an extremely huge amount of data with ease like any array. It is often needed to sort data for analysis. Though iterating through the rows of the data set and sorting them is possible, it might take time for large data sets. Pandas DataFrame object has a method called sort_values that allows sorting data in the way it is needed.
The sort_values() method
The sort_values() method of Pandas DataFrame has the signature,
DataFrame.sort_values(self, by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False)
Arguments
self |
|
axis |
|
by |
|
ascending |
|
inplace |
|
kind |
|
na_position |
|
ignore_index |
|
Example
For example, consider the dataset,
0 1 2 3 4 5 6 column1 Violet Indigo Blue Green NaN Orange Red column2 0 1 2 3 4 5 6 column3 Table Chair Phone Laptop Desktop Tablet Bench
>>>import pandas as pd
>>>import numpy as np
>>>col3 = ['Table',' Chair', 'Phone', 'Laptop', 'Desktop', 'Tablet',' Bench']
>>>col2 = [0, 1, 2, 3, 4, 5, 6]
>>>col1 = [ 'Violet', 'Indigo', 'Blue', 'Green', np.NaN, 'Orange', 'Red']
>>>df = pd.DataFrame({'column1':col1,'column2':col2,'column3':col3})
>>> df
column1 column2 column3
0 Violet 0 Table
1 Indigo 1 Chair
2 Blue 2 Phone
3 Green 3 Laptop
4 NaN 4 Desktop
5 Orange 5 Tablet
6 Red 6 Bench>>> df.sort_values(['column1']) column1 column2 column3 2 Blue 2 Phone 3 Green 3 Laptop 1 Indigo 1 Chair 5 Orange 5 Tablet 6 Red 6 Bench 0 Violet 0 Table 4 NaN 4 Desktop
The data are sorted in ascending order based on the values in ‘column1’. If NaN has to appear at the top, set na_position='first'
>>> df.sort_values(['column1'],na_position='first') column1 column2 column3 4 NaN 4 Desktop 2 Blue 2 Phone 3 Green 3 Laptop 1 Indigo 1 Chair 5 Orange 5 Tablet 6 Red 6 Bench 0 Violet 0 Table
Setting ascending=False,
>>> df.sort_values(['column1'],na_position='first',ascending=False) column1 column2 column3 4 NaN 4 Desktop 0 Violet 0 Table 6 Red 6 Bench 5 Orange 5 Tablet 1 Indigo 1 Chair 3 Green 3 Laptop 2 Blue 2 Phone
The data are sorted alphabetically in descending order based on the values in ‘column1’. Note that the NaN value is retained at the top because na_position is set to ‘first’, otherwise, NaN value will be at the bottom,
>>> df.sort_values(['column1'],ascending=False) column1 column2 column3 0 Violet 0 Table 6 Red 6 Bench 5 Orange 5 Tablet 1 Indigo 1 Chair 3 Green 3 Laptop 2 Blue 2 Phone 4 NaN 4 Desktop
Changing the value of the argument kind will not affect small datasets. They all will give the same result as before,
>>> df.sort_values(['column1'],kind='heapsort') column1 column2 column3 2 Blue 2 Phone 3 Green 3 Laptop 1 Indigo 1 Chair 5 Orange 5 Tablet 6 Red 6 Bench 0 Violet 0 Table 4 NaN 4 Desktop >>> df.sort_values(['column1'],kind='mergesort') column1 column2 column3 2 Blue 2 Phone 3 Green 3 Laptop 1 Indigo 1 Chair 5 Orange 5 Tablet 6 Red 6 Bench 0 Violet 0 Table 4 NaN 4 Desktop
So far, the axis was set to default(0 or ‘index’). To be able to understand the effect of changing the axis to 1, change the index with the set_index() method to ‘column2’. The method set_index can also set the index of the data set to one of the columns in the data set.
>>> df.set_index('column2')
column1 column3
column2
0 Violet Table
1 Indigo Chair
2 Blue Phone
3 Green Laptop
4 NaN Desktop
5 Orange Tablet
6 Red BenchIf the data are sorted with index value 1 and axis 1,
>>> df.set_index('column2').sort_values([1],axis=1)
column3 column1
column2
0 Table Violet
1 Chair Indigo
2 Phone Blue
3 Laptop Green
4 Desktop NaN
5 Tablet Orange
6 Bench Red
Previously, when the axis was 0, and when the data were sorted, the rows of the data changed accordingly. Now when the data is sorted with axis=1, the columns of the data change based on the values in the column. The data is sorted based on the row with index 1. Note the difference before and after the sort. This is similar to sorting the transpose of the data with axis=0. In the examples above, when the data were sorted with axis=0, the indices also changed along with the data. Setting the value of ignore_index to True, the index values can be retained as such.
>>> df.sort_values(['column1'],ignore_index=True) column1 column2 column3 0 Blue 2 Phone 1 Green 3 Laptop 2 Indigo 1 Chair 3 Orange 5 Tablet 4 Red 6 Bench 5 Violet 0 Table 6 NaN 4 Desktop
Otherwise,
>>> df.sort_values(['column1'],ignore_index=False) column1 column2 column3 2 Blue 2 Phone 3 Green 3 Laptop 1 Indigo 1 Chair 5 Orange 5 Tablet 6 Red 6 Bench 0 Violet 0 Table 4 NaN 4 Desktop
Note the difference between the indices of the above two examples.
So far, the value of the argument inplace was set to False. So the Python interpreter printed the data frame that was sorted and returned by the method sort_values. If the value of inplace is set to True, the method will no longer return the sorted data. Instead, it will sort the data and store it in the same object.
>>> df.sort_values(['column1'],inplace=True) >>> df column1 column2 column3 2 Blue 2 Phone 3 Green 3 Laptop 1 Indigo 1 Chair 5 Orange 5 Tablet 6 Red 6 Bench 0 Violet 0 Table 4 NaN 4 Desktop
Notice that after the execution of the statement, the DataFrame is not printed.
Leave a Reply