Why learn Pandas?

By Andrew Solomon - Geekuni mentor, software engineer and aspiring stall holder.

In this article, we’ll look at real-world problems and solve them both with and without Pandas. Along the way, you’ll see how Pandas can help you solve problems with less code that’s easier to read.

The goal isn’t to say you always need Pandas. It’s to show when it can help you get the answers you need from your data - sometimes even more easily than using a spreadsheet!

Introduction

For a long time I thought of Pandas as the Swiss Army knife for data scientists. However, when I started playing with it I realised that it was going to make a lot of things much more straightforward. Sharing Pandas with other developers was part of my motivation for putting together the Python Essentials course.

This article is a very quick Pandas taster where we walk you through a use-case which will provide you with the motivation to add it to your toolkit too.

Scenario

I’m running a market stall at Bondi Beach (I wish!) and it’s time to look over how my purchases compare with my sales to see what I need to fine-tune.

I have all purchase and sales data from 2025 in a spreadsheet, and here are my questions:

What was my profit over the year?
What was my profit per month? per produce?
What did I buy too much of?

The data I have to work off is this CSV file - my ledger.

Preparation

First, clone the data and examples of this blog rather than copying and pasting:

git clone https://github.com/andrewsolomon/play_with_pandas.git

Because you’ll be installing two Python modules - Pandas and Babel - create a virtual environment first so that package dependencies don’t affect anything else you’re working on:

cd play_with_pandas
python3 -m venv .env
source .env/bin/activate

Finally, install Pandas and Babel:

python -m pip install -r requirements.txt

Example 1: What was my profit over the year?

Let’s start with the easiest, a spreadsheet:

Open Google Sheets and import ledger.csv
In G1 enter profit
In G2 enter =ARRAYFORMULA(E2:E * F2:F - C2:C * D2:D)
In I1 enter Total Profit
In J1 enter =ARRAYFORMULA(SUM(G2:G))

The non-Pandas Python approach involves looping over each row, type-casting the various fields from strings to integers or floats, and calculating the profit of each row in a for-loop.

#!/usr/bin/env python

import csv

with open('ledger.csv', 'r') as file:
    rows = csv.DictReader(file)
    profit = sum(
        int(row['num_sold']) * float(row['retail_price'])
        - int(row['num_purchased']) * float(row['wholesale_price'])
        for row in rows
    )

print(f'Profit: ${profit:,.2f}')

Here’s the Pandas approach:

#!/usr/bin/env python

import pandas as pd

df = pd.read_csv('ledger.csv')
df['profit'] = (
    df['num_sold'] * df['retail_price']
    - df['num_purchased'] * df['wholesale_price']
)
print(f'Profit: ${df["profit"].sum():,.2f}')

Both approaches give the same result:

$ ./ex01_year_profit.py 
Profit: $113,624.06

Reflections

Think of df (short for DataFrame) as a spreadsheet, where we’re adding the column profit using a formula involving columns num_sold, retail_price, num_purchased and wholesale_price. As with a spreadsheet, we didn’t need to loop over the rows - we just did the calculation using columns.

Amazingly, we didn’t need to do any type casting (e.g. float(row['retail_price'])) - Pandas just guessed the column types for us!

Here are Pandas’ inferred column types:

>>> df.dtypes
date                object
produce             object
wholesale_price    float64
num_purchased        int64
num_sold             int64
retail_price       float64
profit             float64
dtype: object

It’s got the prices and numbers right, but casting the date as an object means it will be treated just like a Python string which isn’t perfect. We’ll address this in the next example.

Example 2: What was my profit per month?

Without Pandas, the Python code involves starting with a default dictionary using months for keys, like this:

>>> from collections import defaultdict
>>> monthly_profit = defaultdict(float) 
>>> monthly_profit['2025-01']
0.0

We’re extracting the month 2025-01 as the first 7 characters of the date string 2025-01-04 like this:

month = row['date'][:7]

The full implementation is, once again, a for-loop:

#!/usr/bin/env python

import csv
from collections import defaultdict

monthly_profit = defaultdict(float)

with open('ledger.csv', 'r') as file:
    rows = csv.DictReader(file)

    for row in rows:
        month = row['date'][:7]
        monthly_profit[month] += (
            int(row['num_sold']) * float(row['retail_price'])
            - int(row['num_purchased']) * float(row['wholesale_price'])
        )

print('Profit per month: ')
for month in sorted(monthly_profit):
    print(f'{month}    ${monthly_profit[month]:,.2f}')

The Pandas approach is to make sure the date type is exactly what you expect using the parse_dates and date_format parameters:

>>> import pandas as pd

>>> df = pd.read_csv(
...     'ledger.csv',
...     parse_dates=['date'],
...     date_format={'date': '%Y-%m-%d'},
... )

>>> df.dtypes
date               datetime64[ns]
produce                    object
wholesale_price           float64
num_purchased               int64
num_sold                    int64
retail_price              float64
dtype: object

and with this you just create a month column on which you can group the results:

>>> df['month'] = df['date'].dt.to_period('M')

Here’s the code in full:

#!/usr/bin/env python

import pandas as pd

df = pd.read_csv(
    'ledger.csv',
    parse_dates=['date'],
    date_format={'date': '%Y-%m-%d'},
)

df['profit'] = (
    df['num_sold'] * df['retail_price']
    - df['num_purchased'] * df['wholesale_price']
)

df['month'] = df['date'].dt.to_period('M')
print('Profit per month: ')
print(df.groupby(['month'])['profit'].sum().to_string())

To add currency and commas to the output, replace the last line with:

print(
    df.groupby(['month'])['profit'].sum().map(
        lambda x: f'${x:,.2f}'
    ).to_string()
)

The output:

$ ./ex02_monthly_profit.py
Profit per month: 
month
2025-01    $10,338.84
2025-02    $12,237.55
2025-03    $12,597.41
2025-04     $9,571.15
2025-05    $11,805.39
2025-06     $2,006.74
2025-07     $4,067.32
2025-08     $8,476.62
2025-09     $7,450.12
2025-10    $12,747.84
2025-11    $10,063.17
2025-12    $12,261.91
Freq: M

And how do I do this with Google Sheets? I needed Copilot + GPT 5.4 to work this out - it uses pivot tables. The steps are explained here.

Reflections

The ease of using Pandas is illustrated by the fact that the sum method on a GroupBy object removes the need of a for-loop containing:

monthly_profit[month] += profit

Now, the same functionality was achieved with a pivot table in a spreadsheet, however using Pandas, with a single line you can get an altogether different view of the data - in this case the profit by produce:

print(df.groupby(['produce'])['profit'].sum())

or, even better, ordered by profit:

>>> print(df.groupby(['produce'])['profit'].sum().sort_values())
produce
Grapes (100g)                 1153.09
Onion                         1287.00
Potato                        1451.35
Apple                         1973.91
Sweet potato                  2352.50
Pear                          2356.40
Grapefruit                    2441.30
Green beans (200g)            2518.40
Garlic bulb                   2887.23
Cucumber                      3004.48
Leek                          3176.04
Mango                         4396.24
Ginger root (approx 130g)     5110.31
Bananas (1 bunch)             6066.02
Guava                         6373.25
Lettuce                       6431.55
Pineapple                     6456.27
Papaya                        7560.97
Cabbage                       7866.66
Snow peas (200g)              8187.14
Coconut                       8422.83
Celery (1 bunch)              8985.71
Watermelon                   13165.41
Name: profit, dtype: float64

Example 3: What did I buy too much of?

This is the most important information we’ll derive from the data. For each product, we work out the months where we overstocked to the point that we made a loss on it. Here’s a sample of what I mean, where the profit is a negative number:

Overstock report
                                   num_purchased  num_sold     profit percent_sold
produce                   month                                                   
Apple                     2025-01           1134       798    -$11.43          70%
                          2025-12           1073       687    -$97.16          64%
Bananas (1 bunch)         2025-02           1111       779    -$57.72          70%
Cabbage                   2025-11           1476       974   -$498.57          66%
Coconut                   2025-03           1394       992    -$18.74          71%
                          2025-08           1403       993    -$51.96          71%
Cucumber                  2025-08           1502      1026    -$86.87          68%
...
Watermelon                2025-06            816       456 -$3,570.99          56%
                          2025-07            684       467   -$678.24          68%

Let’s start with the Pandas approach.

The first construct to observe is that we’re grouping the data by both the produce and the month, calculating the sum of the relevant values for each pair such as ('Apple', '2025-01'), or ('Coconut', '2025-05'):

per_produce_per_month = (
    df.groupby(['produce', 'month'])[
        ['num_purchased', 'num_sold', 'profit']
    ]
    .sum()
)

The other point to note is that we can easily filter those pairs to the ones where we were overstocked to the point of making a loss:

per_produce_per_month[per_produce_per_month['profit'] < 0]

The SQL equivalent is:

SELECT * FROM per_produce_per_month WHERE profit < 0

Here’s the full Pandas solution:

#!/usr/bin/env python

import pandas as pd

df = pd.read_csv('ledger.csv')
df['date'] = pd.to_datetime(df['date'])
df['profit'] = (
    df['num_sold'] * df['retail_price']
    - df['num_purchased'] * df['wholesale_price']
)
df['month'] = df['date'].dt.to_period('M')

# totals per produce per month
per_produce_per_month = (
    df.groupby(['produce', 'month'])[
        ['num_purchased', 'num_sold', 'profit']
    ]
    .sum()
)

# percent sold per produce per month
per_produce_per_month['percent_sold'] = (
    per_produce_per_month['num_sold']
    / per_produce_per_month['num_purchased']
    * 100
)

print('Overstock report')
print(per_produce_per_month[per_produce_per_month['profit'] < 0])

You can add Babel formatters to the last line in order to make the percentage and currency more readable:

# For pretty formatting
from babel.numbers import format_currency
print(per_produce_per_month[per_produce_per_month['profit'] < 0].to_string(
        formatters={
            'percent_sold': lambda value: f'{value:.0f}%',
            'profit': lambda value: format_currency(
                value,
                'AUD',
                locale='en_AU',
            ),
        }
))

Reflections

I was so satisfied with the Pandas code that I left the non-Pandas Python and spreadsheet solutions to Copilot + GPT 5.4 AI coding tools.

Here’s how I run the non-Pandas Python code, piping the result through the column command for formatting:

./ex03_overstock_per_produce_per_month_nop.py  | column -t -s $'\t'

This produces output almost as readable as the Pandas output:

produce                    month    num_purchased  num_sold  profit     percent_sold
Apple                      2025-01  1134           798       -$11.43    %70.37
Apple                      2025-12  1073           687       -$97.16    %64.03
Bananas (1 bunch)          2025-02  1111           779       -$57.72    %70.12
Cabbage                    2025-11  1476           974       -$498.57   %65.99
Coconut                    2025-03  1394           992       -$18.74    %71.16
...

At the core of the algorithm is a construct conceptually similar to the Pandas GroupBy, namely, a dictionary with a key consisting of both the produce and month:

key = (row['produce'], row['date'][:7])

I found no pivot tables solution with Google sheets and Chat GPT 5.4 came up with this single-cell formula which looks a lot like SQL:

=QUERY(
  QUERY(
    {
      ledger!B2:B,
      ledger!G2:G,
      ARRAYFORMULA(VALUE(ledger!D2:D)),
      ARRAYFORMULA(VALUE(ledger!E2:E)),
      ARRAYFORMULA(VALUE(ledger!H2:H))
    },
    "select Col1, Col2, sum(Col3), sum(Col4), sum(Col5)
     where Col1 is not null
     group by Col1, Col2
     label
       Col1 'produce',
       Col2 'month',
       sum(Col3) 'num_purchased',
       sum(Col4) 'num_sold',
       sum(Col5) 'profit'",
    0
  ),
  "select * where Col5 < 0",
  1
)

To work through it yourself, follow these instructions.

Finally, why not just do all of this in SQL? At first glance, the query is straightforward:

SELECT
    produce,
    strftime('%Y-%m', date) AS month,
    SUM(num_purchased) AS num_purchased,
    SUM(num_sold) AS num_sold,
    SUM(num_sold) * 100.0 / SUM(num_purchased) AS percent_sold,
    SUM(num_sold * retail_price) - SUM(num_purchased * wholesale_price) AS profit
FROM ledger
GROUP BY produce, month
HAVING profit < 0
ORDER BY produce, month;

And for this example, SQL works perfectly well. But this brings us back to my real job: deploying microservices to the cloud.

Many production microservices follow a simple pattern: retrieve data from one system, transform it, and pass the result on to the next stage. Pandas is a good fit for this kind of in-process data transformation, especially when the surrounding logic is in Python.

In that model, the service does not need to persist intermediate data to its own local disk. The data can be processed in memory and then handed off to the next service via a queue or API.

That doesn’t make the system magically secure, but it does reduce local data-at-rest exposure. There are fewer temporary files, database tables, host volumes, snapshots, and backups containing intermediate data. For cloud-hosted microservices, that can simplify both the architecture and the security model.

Conclusion

Pandas is a powerful, yet accessible, tool that simplifies data analysis. It treats your data like a spreadsheet (a DataFrame), letting you perform calculations on entire columns without looping through rows or worrying about type casting.

The result: grouping, aggregating, and filtering data with significantly less - and more readable - code than raw Python or a spreadsheet.

Software development training - Geekuni blog

Thursday, 4 June 2026

Why learn Pandas?

Introduction

Scenario

Preparation

Example 1: What was my profit over the year?

Reflections

Example 2: What was my profit per month?

Reflections

Example 3: What did I buy too much of?

Reflections

Conclusion

No comments:

Post a Comment