By Andrew Solomon - Geekuni mentor, software engineer and aspiring stall holder.
In this article, we’ll look at real-world problems and solve them both with and without Pandas. Along the way, you’ll see how Pandas can help you solve problems with less code that’s easier to read.
The goal isn’t to say you always need Pandas. It’s to show when it can help you get the answers you need from your data - sometimes even more easily than using a spreadsheet!
Introduction
For a long time I thought of Pandas as the Swiss Army knife for data scientists. However, when I started playing with it I realised that it was going to make a lot of things much more straightforward. Sharing Pandas with other developers was part of my motivation for putting together the Python Essentials course.
This article is a very quick Pandas taster where we walk you through a use-case which will provide you with the motivation to add it to your toolkit too.
Scenario
I’m running a market stall at Bondi Beach (I wish!) and it’s time to look over how my purchases compare with my sales to see what I need to fine-tune.
I have all purchase and sales data from 2025 in a spreadsheet, and here are my questions:
- What was my profit over the year?
- What was my profit per month? per produce?
- What did I buy too much of?
The data I have to work off is this CSV file - my ledger.
Preparation
First, clone the data and examples of this blog rather than copying and pasting:
git clone https://github.com/andrewsolomon/play_with_pandas.git
Because you’ll be installing two Python modules - Pandas and Babel - create a virtual environment first so that package dependencies don’t affect anything else you’re working on:
cd play_with_pandas
python3 -m venv .env
source .env/bin/activate
Finally, install Pandas and Babel:
python -m pip install -r requirements.txt
Example 1: What was my profit over the year?
Let’s start with the easiest, a spreadsheet:
- Open Google Sheets and import
ledger.csv - In
G1enterprofit - In
G2enter=ARRAYFORMULA(E2:E * F2:F - C2:C * D2:D) - In
I1enterTotal Profit - In
J1enter=ARRAYFORMULA(SUM(G2:G))
The non-Pandas Python approach involves looping over each row, type-casting the various fields from strings to integers or floats, and calculating the profit of each row in a for-loop.
#!/usr/bin/env python
import csv
with open('ledger.csv', 'r') as file:
rows = csv.DictReader(file)
profit = sum(
int(row['num_sold']) * float(row['retail_price'])
- int(row['num_purchased']) * float(row['wholesale_price'])
for row in rows
)
print(f'Profit: ${profit:,.2f}')
Here’s the Pandas approach:
#!/usr/bin/env python
import pandas as pd
df = pd.read_csv('ledger.csv')
df['profit'] = (
df['num_sold'] * df['retail_price']
- df['num_purchased'] * df['wholesale_price']
)
print(f'Profit: ${df["profit"].sum():,.2f}')
Both approaches give the same result:
$ ./ex01_year_profit.py
Profit: $113,624.06
Reflections
Think of df (short for DataFrame) as a spreadsheet, where we’re adding the column profit using a formula involving columns num_sold, retail_price, num_purchased and wholesale_price. As with a spreadsheet, we didn’t need to loop over the rows - we just did the calculation using columns.
Amazingly, we didn’t need to do any type casting (e.g. float(row['retail_price'])) - Pandas just guessed the column types for us!
Here are Pandas’ inferred column types:
>>> df.dtypes
date object
produce object
wholesale_price float64
num_purchased int64
num_sold int64
retail_price float64
profit float64
dtype: object
It’s got the prices and numbers right, but casting the date as an object means it will be treated just like a Python string which isn’t perfect. We’ll address this in the next example.
Example 2: What was my profit per month?
Without Pandas, the Python code involves starting with a default dictionary using months for keys, like this:
>>> from collections import defaultdict
>>> monthly_profit = defaultdict(float)
>>> monthly_profit['2025-01']
0.0
We’re extracting the month 2025-01 as the first 7 characters of the date string 2025-01-04 like this:
month = row['date'][:7]
The full implementation is, once again, a for-loop:
#!/usr/bin/env python
import csv
from collections import defaultdict
monthly_profit = defaultdict(float)
with open('ledger.csv', 'r') as file:
rows = csv.DictReader(file)
for row in rows:
month = row['date'][:7]
monthly_profit[month] += (
int(row['num_sold']) * float(row['retail_price'])
- int(row['num_purchased']) * float(row['wholesale_price'])
)
print('Profit per month: ')
for month in sorted(monthly_profit):
print(f'{month} ${monthly_profit[month]:,.2f}')
The Pandas approach is to make sure the date type is exactly what you expect using the parse_dates and date_format parameters:
>>> import pandas as pd
>>> df = pd.read_csv(
... 'ledger.csv',
... parse_dates=['date'],
... date_format={'date': '%Y-%m-%d'},
... )
>>> df.dtypes
date datetime64[ns]
produce object
wholesale_price float64
num_purchased int64
num_sold int64
retail_price float64
dtype: object
and with this you just create a month column on which you can group the results:
>>> df['month'] = df['date'].dt.to_period('M')
Here’s the code in full:
#!/usr/bin/env python
import pandas as pd
df = pd.read_csv(
'ledger.csv',
parse_dates=['date'],
date_format={'date': '%Y-%m-%d'},
)
df['profit'] = (
df['num_sold'] * df['retail_price']
- df['num_purchased'] * df['wholesale_price']
)
df['month'] = df['date'].dt.to_period('M')
print('Profit per month: ')
print(df.groupby(['month'])['profit'].sum().to_string())
To add currency and commas to the output, replace the last line with:
print(
df.groupby(['month'])['profit'].sum().map(
lambda x: f'${x:,.2f}'
).to_string()
)
The output:
$ ./ex02_monthly_profit.py
Profit per month:
month
2025-01 $10,338.84
2025-02 $12,237.55
2025-03 $12,597.41
2025-04 $9,571.15
2025-05 $11,805.39
2025-06 $2,006.74
2025-07 $4,067.32
2025-08 $8,476.62
2025-09 $7,450.12
2025-10 $12,747.84
2025-11 $10,063.17
2025-12 $12,261.91
Freq: M
And how do I do this with Google Sheets? I needed Copilot + GPT 5.4 to work this out - it uses pivot tables. The steps are explained here.
Reflections
The ease of using Pandas is illustrated by the fact that the sum method on a GroupBy object removes the need of a for-loop containing:
monthly_profit[month] += profit
Now, the same functionality was achieved with a pivot table in a spreadsheet, however using Pandas, with a single line you can get an altogether different view of the data - in this case the profit by produce:
print(df.groupby(['produce'])['profit'].sum())
or, even better, ordered by profit:
>>> print(df.groupby(['produce'])['profit'].sum().sort_values())
produce
Grapes (100g) 1153.09
Onion 1287.00
Potato 1451.35
Apple 1973.91
Sweet potato 2352.50
Pear 2356.40
Grapefruit 2441.30
Green beans (200g) 2518.40
Garlic bulb 2887.23
Cucumber 3004.48
Leek 3176.04
Mango 4396.24
Ginger root (approx 130g) 5110.31
Bananas (1 bunch) 6066.02
Guava 6373.25
Lettuce 6431.55
Pineapple 6456.27
Papaya 7560.97
Cabbage 7866.66
Snow peas (200g) 8187.14
Coconut 8422.83
Celery (1 bunch) 8985.71
Watermelon 13165.41
Name: profit, dtype: float64
Example 3: What did I buy too much of?
This is the most important information we’ll derive from the data. For each product, we work out the months where we overstocked to the point that we made a loss on it. Here’s a sample of what I mean, where the profit is a negative number:
Overstock report
num_purchased num_sold profit percent_sold
produce month
Apple 2025-01 1134 798 -$11.43 70%
2025-12 1073 687 -$97.16 64%
Bananas (1 bunch) 2025-02 1111 779 -$57.72 70%
Cabbage 2025-11 1476 974 -$498.57 66%
Coconut 2025-03 1394 992 -$18.74 71%
2025-08 1403 993 -$51.96 71%
Cucumber 2025-08 1502 1026 -$86.87 68%
...
Watermelon 2025-06 816 456 -$3,570.99 56%
2025-07 684 467 -$678.24 68%
Let’s start with the Pandas approach.
The first construct to observe is that we’re grouping the data by both the produce and the month, calculating the sum of the relevant values for each pair such as ('Apple', '2025-01'), or ('Coconut', '2025-05'):
per_produce_per_month = (
df.groupby(['produce', 'month'])[
['num_purchased', 'num_sold', 'profit']
]
.sum()
)
The other point to note is that we can easily filter those pairs to the ones where we were overstocked to the point of making a loss:
per_produce_per_month[per_produce_per_month['profit'] < 0]
The SQL equivalent is:
SELECT * FROM per_produce_per_month WHERE profit < 0
Here’s the full Pandas solution:
#!/usr/bin/env python
import pandas as pd
df = pd.read_csv('ledger.csv')
df['date'] = pd.to_datetime(df['date'])
df['profit'] = (
df['num_sold'] * df['retail_price']
- df['num_purchased'] * df['wholesale_price']
)
df['month'] = df['date'].dt.to_period('M')
# totals per produce per month
per_produce_per_month = (
df.groupby(['produce', 'month'])[
['num_purchased', 'num_sold', 'profit']
]
.sum()
)
# percent sold per produce per month
per_produce_per_month['percent_sold'] = (
per_produce_per_month['num_sold']
/ per_produce_per_month['num_purchased']
* 100
)
print('Overstock report')
print(per_produce_per_month[per_produce_per_month['profit'] < 0])
You can add Babel formatters to the last line in order to make the percentage and currency more readable:
# For pretty formatting
from babel.numbers import format_currency
print(per_produce_per_month[per_produce_per_month['profit'] < 0].to_string(
formatters={
'percent_sold': lambda value: f'{value:.0f}%',
'profit': lambda value: format_currency(
value,
'AUD',
locale='en_AU',
),
}
))
Reflections
I was so satisfied with the Pandas code that I left the non-Pandas Python and spreadsheet solutions to Copilot + GPT 5.4 AI coding tools.
Here’s how I run the non-Pandas Python code, piping the result through the column command for formatting:
./ex03_overstock_per_produce_per_month_nop.py | column -t -s $'\t'
This produces output almost as readable as the Pandas output:
produce month num_purchased num_sold profit percent_sold
Apple 2025-01 1134 798 -$11.43 %70.37
Apple 2025-12 1073 687 -$97.16 %64.03
Bananas (1 bunch) 2025-02 1111 779 -$57.72 %70.12
Cabbage 2025-11 1476 974 -$498.57 %65.99
Coconut 2025-03 1394 992 -$18.74 %71.16
...
At the core of the algorithm is a construct conceptually similar to the Pandas GroupBy, namely, a dictionary with a key consisting of both the produce and month:
key = (row['produce'], row['date'][:7])
I found no pivot tables solution with Google sheets and Chat GPT 5.4 came up with this single-cell formula which looks a lot like SQL:
=QUERY(
QUERY(
{
ledger!B2:B,
ledger!G2:G,
ARRAYFORMULA(VALUE(ledger!D2:D)),
ARRAYFORMULA(VALUE(ledger!E2:E)),
ARRAYFORMULA(VALUE(ledger!H2:H))
},
"select Col1, Col2, sum(Col3), sum(Col4), sum(Col5)
where Col1 is not null
group by Col1, Col2
label
Col1 'produce',
Col2 'month',
sum(Col3) 'num_purchased',
sum(Col4) 'num_sold',
sum(Col5) 'profit'",
0
),
"select * where Col5 < 0",
1
)
To work through it yourself, follow these instructions.
Finally, why not just do all of this in SQL? At first glance, the query is straightforward:
SELECT
produce,
strftime('%Y-%m', date) AS month,
SUM(num_purchased) AS num_purchased,
SUM(num_sold) AS num_sold,
SUM(num_sold) * 100.0 / SUM(num_purchased) AS percent_sold,
SUM(num_sold * retail_price) - SUM(num_purchased * wholesale_price) AS profit
FROM ledger
GROUP BY produce, month
HAVING profit < 0
ORDER BY produce, month;
And for this example, SQL works perfectly well. But this brings us back to my real job: deploying microservices to the cloud.
Many production microservices follow a simple pattern: retrieve data from one system, transform it, and pass the result on to the next stage. Pandas is a good fit for this kind of in-process data transformation, especially when the surrounding logic is in Python.
In that model, the service does not need to persist intermediate data to its own local disk. The data can be processed in memory and then handed off to the next service via a queue or API.
That doesn’t make the system magically secure, but it does reduce local data-at-rest exposure. There are fewer temporary files, database tables, host volumes, snapshots, and backups containing intermediate data. For cloud-hosted microservices, that can simplify both the architecture and the security model.
Conclusion
Pandas is a powerful, yet accessible, tool that simplifies data analysis. It treats your data like a spreadsheet (a DataFrame), letting you perform calculations on entire columns without looping through rows or worrying about type casting.
The result: grouping, aggregating, and filtering data with significantly less - and more readable - code than raw Python or a spreadsheet.

No comments:
Post a Comment