Grouping Data in Pandas

As you explore your data with Pandas, you will almost always want to see it in different formats. You may want to lok at the relationship between different observations or structure the data into something more easily read or understood.

Using the Pandas groupby function is a tool that you will use often. It is important to really get your head around it to make the most use of it.

There are many tutorials on the web showing how to use groupby. Some are very good and some are very hard to follow. Usually, they can get complicated very quickly.

I like to understand Pandas functions from their simplest use so that I remember the syntax. If I run into something more complicated then I look at the docs.

In short, understand that there is a method you want to use to help solve your problem. Know how it works. read the docs if you need to know more.

To write a Pandas tutorial you of course need data and a DataFrame to work on. I need something fairly simple so decided to construct my own.

normally tutorials show the usual and tortuous dictionary method of creating a DF. But let’s try another way for a bit of fun.

The DF will be 3 columns of 10 rows simulating a little sales table for musical instruments sold by area. In this example the data will be created randomly.

First we import the usual suspects along with random and from the random module, choice.

import random
from random import choice

import pandas as pd

Next create a function which takes a list and integer as arguments and returns a list of items picked randomly from the list.

def rndList(lst, length):
    c = []
    for x in range(length):
        rndChoice = random.choice(lst)
        c.append(rndChoice)
    return c

The next step is to create some lists to send to the function rndList and a list of random integers to simulate quantities of each instrument using the randint method.

area = ['north', 'south', 'east', 'west']
instruments = ['guitars', 'pianos']
quantity = [random.randint(5, 20) for p in range(10)]

Now create the lists of 10 random choices and use Python zip to create a variable lst.

data = list(zip(areas, item, quantity))

Now create a Pandas DataFrame from the list with the required column headings.

dfTest = pd.DataFrame(
    data,
    columns=['area', 'instrument', 'quantity'])

This should create a ‘dummy’ DataFrame with random data which will look something like this:

area instrument quantity
0 north pianos 17
1 north guitars 17
2 east guitars 14
3 north guitars 15
4 west guitars 5
5 north pianos 10
6 west guitars 9
7 south guitars 6
8 south pianos 20
9 south pianos 11

Remember. This DF will change every time you run the code because it uses random choices from small lists to fill in the data. But it was fun to create a little function to create the data, zip everything up and create a DF. It also shows how to create a DF from lists.

Now imagine we want to see the quantity of sales by instrument for each area.

This is where the df.groupby method comes in.

Let’s look at the steps needed to build up the command:

StepEntry
Step 1dfTest.groupby()
Step 2a[‘area’, ‘instrument’]
Step 2bdfTest.groupby([‘area’, ‘instrument’])
Step 3.agg()
Step4Sales=(‘quantity’, ‘sum’)
Full commanddfTest.groupby([‘area’, ‘instrument’]).agg(Sales(‘quantity’,’sum’))

Create a new DF called dfGroup1 with your new groupby method.

dfGroup1 = dfTest.groupby(
    ["area", "instrument"]).agg(
    Sales=('quantity', 'sum')
    )

Where grouping can get confusing is when you do not use an aggregate method. If you enter:

dfTest.groupby(['area','instrument'])

Then python will return a reference to the grouped object:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f83050dccf8>

Python is basically saying, ‘I’ve created the group object. Now do something with it.’ Which is where the .agg() method is used. There are many aggregate methods you can choose from. Mean, median, mode, standard deviation etc.

The new dfGroup1 DF will look like this:

Sales
area instrument
east pianos 22
north pianos 18
south guitars 42
pianos 14
west pianos 14

Note how we have a new column called ‘Sales’ which was created in by agg() method. You may also sometimes see a dictionary passed to the agg method such as:

dfGroup1 = dfTest.groupby(
    ["area", "instrument"]).agg({
    'quantity': sum
    })

I prefer the ability to rename columns when a new name which might better suit the view of the data that has been created.

You can the make a quick plot to see what your new DF looks like using:

dfGroup1.plot.bar()

This is a very simple look at grouping in pandas but the method of splitting the problem down to simpler blocks makes it easier to understand.

Often, you will receive new data which you will want to use in conjunction with your existing data.

In our example, we may want to look at sales people and their territories to create a view showing their performance. We would need to have a DF with the sale peoples details first. I’ll just make one up.

lst = [['Adam', 'north'],
       ['Bill', 'south'],
       ['Colin', 'east'],
       ['Davina', 'west']]

salesmen = pd.DataFrame(lst, columns=['name', 'area'])
name area
0 Adam north
1 Bill south
2 Colin east
3 Davina west

This will create the DF above. Notice how this is a small DF compared to the main one which has 3 columns with 10 rows. I don’t want to get into the mechanics of merging 2 DataFrames here but because they both share a column called ‘area’, it is quite simple to merge them.

dfNew = pd.merge(dfTest, salesmen)
area instrument quantity name
0 south guitars 20 Bill
1 south guitars 6 Bill
2 south pianos 18 Bill
3 west guitars 14 Davina
4 west guitars 20 Davina
5 west guitars 16 Davina
6 north guitars 5 Adam
7 north pianos 15 Adam
8 north pianos 20 Adam
9 east pianos 19 Colin

Don’t worry about the multiple entries in this DF. It has been created completely randomly. Just imagine that some entries are for different months.

Now run a groupby using the steps outlined above to show the sales person and now many of each instrument they have sold.

dfNewGroup = dfNew.groupby(['name', 'instrument']).agg(
    Sales=('quantity', 'sum'))
Sales
name instrument
Adam guitars 17
pianos 6
Bill pianos 46
Colin pianos 14
Davina guitars 18
pianos 53

A quick plot will show something like:

Grouping data into different views is a very powerful pandas tool. Hopefully these simple steps will help you to understand grouping and how it is used which you can then apply to real world data.

1 Comment

Leave a comment