Grouping Data in Pandas

As you explore your data with Pandas, you will almost always want to see it in different formats. You may want to lok at the relationship between different observations or structure the data into something more easily read or understood.

Using the Pandas groupby function is a tool that you will use often. It is important to really get your head around it to make the most use of it.

There are many tutorials on the web showing how to use groupby. Some are very good and some are very hard to follow. Usually, they can get complicated very quickly.

I like to understand Pandas functions from their simplest use so that I remember the syntax. If I run into something more complicated then I look at the docs.

In short, understand that there is a method you want to use to help solve your problem. Know how it works. read the docs if you need to know more.

To write a Pandas tutorial you of course need data and a DataFrame to work on. I need something fairly simple so decided to construct my own.

normally tutorials show the usual and tortuous dictionary method of creating a DF. But let’s try another way for a bit of fun.

The DF will be 3 columns of 10 rows simulating a little sales table for musical instruments sold by area. In this example the data will be created randomly.

First we import the usual suspects along with random and from the random module, choice.

import random
from random import choice

import pandas as pd

Next create a function which takes a list and integer as arguments and returns a list of items picked randomly from the list.

def rndList(lst, length):
    c = []
    for x in range(length):
        rndChoice = random.choice(lst)
        c.append(rndChoice)
    return c

The next step is to create some lists to send to the function rndList and a list of random integers to simulate quantities of each instrument using the randint method.

area = ['north', 'south', 'east', 'west']
instruments = ['guitars', 'pianos']
quantity = [random.randint(5, 20) for p in range(10)]

Now create the lists of 10 random choices and use Python zip to create a variable lst.

data = list(zip(areas, item, quantity))

Now create a Pandas DataFrame from the list with the required column headings.

dfTest = pd.DataFrame(
    data,
    columns=['area', 'instrument', 'quantity'])

This should create a ‘dummy’ DataFrame with random data which will look something like this:

	area	instrument	quantity
0	north	pianos	17
1	north	guitars	17
2	east	guitars	14
3	north	guitars	15
4	west	guitars	5
5	north	pianos	10
6	west	guitars	9
7	south	guitars	6
8	south	pianos	20
9	south	pianos	11

Remember. This DF will change every time you run the code because it uses random choices from small lists to fill in the data. But it was fun to create a little function to create the data, zip everything up and create a DF. It also shows how to create a DF from lists.

Now imagine we want to see the quantity of sales by instrument for each area.

This is where the df.groupby method comes in.

Let’s look at the steps needed to build up the command:

Step	Entry
Step 1	dfTest.groupby()
Step 2a	[‘area’, ‘instrument’]
Step 2b	dfTest.groupby([‘area’, ‘instrument’])
Step 3	.agg()
Step4	Sales=(‘quantity’, ‘sum’)
Full command	dfTest.groupby([‘area’, ‘instrument’]).agg(Sales(‘quantity’,’sum’))

Create a new DF called dfGroup1 with your new groupby method.

dfGroup1 = dfTest.groupby(
    ["area", "instrument"]).agg(
    Sales=('quantity', 'sum')
    )

Where grouping can get confusing is when you do not use an aggregate method. If you enter:

dfTest.groupby(['area','instrument'])

Then python will return a reference to the grouped object:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f83050dccf8>

Python is basically saying, ‘I’ve created the group object. Now do something with it.’ Which is where the .agg() method is used. There are many aggregate methods you can choose from. Mean, median, mode, standard deviation etc.

The new dfGroup1 DF will look like this:

		Sales
area	instrument
east	pianos	22
north	pianos	18
south	guitars	42
south	pianos	14
west	pianos	14

Note how we have a new column called ‘Sales’ which was created in by agg() method. You may also sometimes see a dictionary passed to the agg method such as:

dfGroup1 = dfTest.groupby(
    ["area", "instrument"]).agg({
    'quantity': sum
    })

I prefer the ability to rename columns when a new name which might better suit the view of the data that has been created.

You can the make a quick plot to see what your new DF looks like using:

dfGroup1.plot.bar()

This is a very simple look at grouping in pandas but the method of splitting the problem down to simpler blocks makes it easier to understand.

Often, you will receive new data which you will want to use in conjunction with your existing data.

In our example, we may want to look at sales people and their territories to create a view showing their performance. We would need to have a DF with the sale peoples details first. I’ll just make one up.

lst = [['Adam', 'north'],
       ['Bill', 'south'],
       ['Colin', 'east'],
       ['Davina', 'west']]

salesmen = pd.DataFrame(lst, columns=['name', 'area'])

	name	area
0	Adam	north
1	Bill	south
2	Colin	east
3	Davina	west

This will create the DF above. Notice how this is a small DF compared to the main one which has 3 columns with 10 rows. I don’t want to get into the mechanics of merging 2 DataFrames here but because they both share a column called ‘area’, it is quite simple to merge them.

dfNew = pd.merge(dfTest, salesmen)

	area	instrument	quantity	name
0	south	guitars	20	Bill
1	south	guitars	6	Bill
2	south	pianos	18	Bill
3	west	guitars	14	Davina
4	west	guitars	20	Davina
5	west	guitars	16	Davina
6	north	guitars	5	Adam
7	north	pianos	15	Adam
8	north	pianos	20	Adam
9	east	pianos	19	Colin

Don’t worry about the multiple entries in this DF. It has been created completely randomly. Just imagine that some entries are for different months.

Now run a groupby using the steps outlined above to show the sales person and now many of each instrument they have sold.

dfNewGroup = dfNew.groupby(['name', 'instrument']).agg(
    Sales=('quantity', 'sum'))

		Sales
name	instrument
Adam	guitars	17
Adam	pianos	6
Bill	pianos	46
Colin	pianos	14
Davina	guitars	18
Davina	pianos	53

A quick plot will show something like:

Grouping data into different views is a very powerful pandas tool. Hopefully these simple steps will help you to understand grouping and how it is used which you can then apply to real world data.

Grouping Data in Pandas

1 Comment

Leave a comment

Share this:

Related

1 Comment

Leave a comment