Group by operations comparison in pandas, dplyr and SQL
This is a comparison of the interfaces of pandas, dplyr and SQL to do several calculations based on groups. I’ve come accross many of these calculations very often at work.
The goal is to see which one is simpler to use in each case. Time benchmarks of several data analysis libraries, like dplyr, data.table, pandas, and dask are done very often. What is less common is to see how the API of these libraries compare and I think this is key when choosing one of them.
I’ve attempted to do things as simple as possible in each of the three languages, don’t hesitate to tell me if you’ve come up with simpler ways. I’ve also committed to the outputs of each of my calculations to be “tidy”. That is, if a dataframe comes in, I want a dataframe to come out. I don’t want the calculation to return a vector, array, series or matrix. I also don’t want to keep any row names or indexes, the indexing of the rows should always keep being from 1 to the number of rows.
Load libraries and data
First we load the Palmer Penguins dataset both in python and R. In SQL,
we have a table called penguins
on our database.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(palmerpenguins)
df <- palmerpenguins::penguins
import pandas as pd
from palmerpenguins import load_penguins
df_py = load_penguins()
df_py
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## 0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
## 1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
## 2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
## 3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
## 4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
## .. ... ... ... ... ... ... ... ...
## 339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
## 340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
## 341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
## 342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
## 343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009
##
## [344 rows x 8 columns]
Count based on 1 column
This is by far the simplest of the calculations. We want to count how
many penguins are there for each of the species
. Python seems to be
the more complicated in this case. This is mainly because the size
method returns a series, and we have to rename it to have a meaningful
name for the count.
SELECT species, count(*) FROM "penguins" group by 1
df %>% count(species)
## # A tibble: 3 x 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
(
df_py
.groupby("species")
.size()
.to_frame()
.reset_index()
.rename(columns={0: 'n'})
)
## species n
## 0 Adelie 152
## 1 Chinstrap 68
## 2 Gentoo 124
Count based on 2 columns
This is a very similar case as the previous one. The only difference is
that we’re counting the number of penguins by species
and island
.
SELECT species, island, count(*) FROM "penguins" group by 1, 2
df %>% count(species, island)
## # A tibble: 5 x 3
## species island n
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 56
## 3 Adelie Torgersen 52
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 124
(
df_py
.groupby(["species", "island"])
.size()
.to_frame()
.reset_index()
.rename(columns={0: 'n'})
)
## species island n
## 0 Adelie Biscoe 44
## 1 Adelie Dream 56
## 2 Adelie Torgersen 52
## 3 Chinstrap Dream 68
## 4 Gentoo Biscoe 124
Many aggregates per column
In here we want to compute several aggregations on different columns.
For every species
and island
we want:
- The average
bill_length_mm
- The median
bill_length_mm
- The number of penguins whose
flipper_length_mm
is not null.
Python has a nice thing that allows to compute several aggregations
functions for a given column. However, the price to pay is that the
obtained dataframe is not tidy. This is because we have multiindexed
columns, and we can use the flatten_cols
to transform the columns to a
list. I use it very often to tidy my multiindexed dataframes.
SELECT
species,
island,
avg(bill_length_mm),
median(bill_length_mm),
sum(flipper_length_mm is not null) FROM "penguins"
group by 1, 2
df %>%
group_by(species, island) %>%
summarise(
mean_bill_length_mm = mean(bill_length_mm, na.rm = T),
median_bill_length_mm = median(bill_length_mm, na.rm = T),
not_null_flipper = sum(!is.na(flipper_length_mm))
)
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 5 x 5
## # Groups: species [3]
## species island mean_bill_length_mm median_bill_length_mm not_null_flipper
## <fct> <fct> <dbl> <dbl> <int>
## 1 Adelie Biscoe 39.0 38.7 44
## 2 Adelie Dream 38.5 38.6 56
## 3 Adelie Torgersen 39.0 38.9 51
## 4 Chinstrap Dream 48.8 49.6 68
## 5 Gentoo Biscoe 47.5 47.3 123
def flatten_cols(df_py):
df_py.columns = [' '.join(col).strip() for col in df_py.columns.values]
return df_py
(
df_py
.groupby(["species", "island"], as_index=False)
.agg({
"bill_length_mm": ["mean", "median"],
"flipper_length_mm": ["count"]
})
.pipe(flatten_cols)
)
## species island bill_length_mm mean bill_length_mm median flipper_length_mm count
## 0 Adelie Biscoe 38.975000 38.70 44
## 1 Adelie Dream 38.501786 38.55 56
## 2 Adelie Torgersen 38.950980 38.90 51
## 3 Chinstrap Dream 48.833824 49.55 68
## 4 Gentoo Biscoe 47.504878 47.30 123
Keep only groups with an aggregate condition
Here we want to keep groups of species
and islands
with more than 70
penguins in them. In this case, dplyr and python have a very similar
API, the most complicated in here is the SQL one.
with high_penguins as (
select species, island from penguins group by 1, 2 having count(*) > 70
) select p.*
from penguins p
inner join high_penguins using(species, island)
df %>%
group_by(species, island) %>%
filter(n() > 70)
## # A tibble: 124 x 8
## # Groups: species, island [1]
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 46.1 13.2 211 4500
## 2 Gentoo Biscoe 50 16.3 230 5700
## 3 Gentoo Biscoe 48.7 14.1 210 4450
## 4 Gentoo Biscoe 50 15.2 218 5700
## 5 Gentoo Biscoe 47.6 14.5 215 5400
## 6 Gentoo Biscoe 46.5 13.5 210 4550
## 7 Gentoo Biscoe 45.4 14.6 211 4800
## 8 Gentoo Biscoe 46.7 15.3 219 5200
## 9 Gentoo Biscoe 43.3 13.4 209 4400
## 10 Gentoo Biscoe 46.8 15.4 215 5150
## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>
(
df_py
.groupby(["species", "island"], as_index=False)
.filter(lambda x: len(x) > 70)
)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## 152 Gentoo Biscoe 46.1 13.2 211.0 4500.0 female 2007
## 153 Gentoo Biscoe 50.0 16.3 230.0 5700.0 male 2007
## 154 Gentoo Biscoe 48.7 14.1 210.0 4450.0 female 2007
## 155 Gentoo Biscoe 50.0 15.2 218.0 5700.0 male 2007
## 156 Gentoo Biscoe 47.6 14.5 215.0 5400.0 male 2007
## .. ... ... ... ... ... ... ... ...
## 271 Gentoo Biscoe NaN NaN NaN NaN NaN 2009
## 272 Gentoo Biscoe 46.8 14.3 215.0 4850.0 female 2009
## 273 Gentoo Biscoe 50.4 15.7 222.0 5750.0 male 2009
## 274 Gentoo Biscoe 45.2 14.8 212.0 5200.0 female 2009
## 275 Gentoo Biscoe 49.9 16.1 213.0 5400.0 male 2009
##
## [124 rows x 8 columns]
Create a column based on an aggregate calculation
Here we want to compute the average bill_length_mm
by species
and
island
as an extra column of the penguins dataframe, not just as a
summarised table. My least favorite is python since I cannot use a list
of chained operations, but I have to store partial results and assign
them later.
-- Doesn't work in sqlite
select
*,
avg(bill_length_mm) over (partition by species, island)
from penguins
df %>%
group_by(species, island) %>%
mutate(
mean_bill_length = mean(bill_length_mm, na.rm = T)
)
## # A tibble: 344 x 9
## # Groups: species, island [5]
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## # mean_bill_length <dbl>
mean_bill_length = (
df_py
.loc[:, ["species", "island", "bill_length_mm"]]
.groupby(["species", "island"], as_index=False)
.transform(lambda x: x.mean())
)
df_py["mean_bill_length"] = mean_bill_length
df_py
## species island bill_length_mm bill_depth_mm ... body_mass_g sex year mean_bill_length
## 0 Adelie Torgersen 39.1 18.7 ... 3750.0 male 2007 38.950980
## 1 Adelie Torgersen 39.5 17.4 ... 3800.0 female 2007 38.950980
## 2 Adelie Torgersen 40.3 18.0 ... 3250.0 female 2007 38.950980
## 3 Adelie Torgersen NaN NaN ... NaN NaN 2007 38.950980
## 4 Adelie Torgersen 36.7 19.3 ... 3450.0 female 2007 38.950980
## .. ... ... ... ... ... ... ... ... ...
## 339 Chinstrap Dream 55.8 19.8 ... 4000.0 male 2009 48.833824
## 340 Chinstrap Dream 43.5 18.1 ... 3400.0 female 2009 48.833824
## 341 Chinstrap Dream 49.6 18.2 ... 3775.0 male 2009 48.833824
## 342 Chinstrap Dream 50.8 19.0 ... 4100.0 male 2009 48.833824
## 343 Chinstrap Dream 50.2 18.7 ... 3775.0 female 2009 48.833824
##
## [344 rows x 9 columns]
Count relative
For each species
I want to know the fraction of penguins of that
species
living in each island
.
In this case, dplyr is the clear winner since neither python nor SQL provide a very simple way of calcuating this.
with species_island_counts as (
select species, island, count(*) as n from penguins
), species_counts as (
select
species,
island,
sum(n) over (partition by species) as species_n,
n
from species_island_counts
) select
species,
island,
n / species_n as fraction
from species_counts
df %>%
count(species, island) %>%
group_by(species) %>%
mutate(fraction = n / sum(n))
## # A tibble: 5 x 4
## # Groups: species [3]
## species island n fraction
## <fct> <fct> <int> <dbl>
## 1 Adelie Biscoe 44 0.289
## 2 Adelie Dream 56 0.368
## 3 Adelie Torgersen 52 0.342
## 4 Chinstrap Dream 68 1
## 5 Gentoo Biscoe 124 1
full_counts = (
df_py
.groupby(["species", "island"])
.size()
.to_frame()
.reset_index()
.rename(columns={0: 'n'})
)
agg_counts = (
full_counts
.groupby("species")
.transform(lambda x: x.sum())["n"]
)
full_counts["agg_counts"] = agg_counts
full_counts["fraction"] = full_counts["n"] / full_counts["agg_counts"]
full_counts
## species island n agg_counts fraction
## 0 Adelie Biscoe 44 152 0.289474
## 1 Adelie Dream 56 152 0.368421
## 2 Adelie Torgersen 52 152 0.342105
## 3 Chinstrap Dream 68 68 1
## 4 Gentoo Biscoe 124 124 1
Agg with two columns
I want to compute the average of the ratio of bill_length_mm
and
bill_depth_mm
.
This is pretty easy in the three of them although a bit of tidying needs to be done in python to obtain a tidy dataframe.
SELECT species, island, avg(bill_depth_mm / bill_length_mm) FROM "penguins" group by 1, 2
df %>%
group_by(species) %>%
summarise(
depth_length_ratio = mean(bill_depth_mm / bill_length_mm, na.rm = T)
)
## # A tibble: 3 x 2
## species depth_length_ratio
## <fct> <dbl>
## 1 Adelie 0.474
## 2 Chinstrap 0.378
## 3 Gentoo 0.316
(
df_py
.groupby("species")
.apply(lambda d: (d["bill_depth_mm"] / d["bill_length_mm"]).mean())
.to_frame()
.reset_index()
.rename(columns={0: "depth_length_ratio"})
)
## species depth_length_ratio
## 0 Adelie 0.474333
## 1 Chinstrap 0.377923
## 2 Gentoo 0.315829
Conclusion
The main conclusion is that dplyr tends to be the easiest to use, with less code needed for most of the calculations.
The issue with pandas seems to be that the outputs are never tidy enough to keep working with them.