Group by operations comparison in pandas, dplyr and SQL

This is a comparison of the interfaces of pandas, dplyr and SQL to do several calculations based on groups. I’ve come accross many of these calculations very often at work.

The goal is to see which one is simpler to use in each case. Time benchmarks of several data analysis libraries, like dplyr, data.table, pandas, and dask are done very often. What is less common is to see how the API of these libraries compare and I think this is key when choosing one of them.

I’ve attempted to do things as simple as possible in each of the three languages, don’t hesitate to tell me if you’ve come up with simpler ways. I’ve also committed to the outputs of each of my calculations to be “tidy”. That is, if a dataframe comes in, I want a dataframe to come out. I don’t want the calculation to return a vector, array, series or matrix. I also don’t want to keep any row names or indexes, the indexing of the rows should always keep being from 1 to the number of rows.

Load libraries and data

First we load the Palmer Penguins dataset both in python and R. In SQL, we have a table called penguins on our database.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(palmerpenguins)

df <- palmerpenguins::penguins

import pandas as pd
from palmerpenguins import load_penguins

df_py = load_penguins()
df_py

##        species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
## 0       Adelie  Torgersen            39.1           18.7              181.0       3750.0    male  2007
## 1       Adelie  Torgersen            39.5           17.4              186.0       3800.0  female  2007
## 2       Adelie  Torgersen            40.3           18.0              195.0       3250.0  female  2007
## 3       Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN  2007
## 4       Adelie  Torgersen            36.7           19.3              193.0       3450.0  female  2007
## ..         ...        ...             ...            ...                ...          ...     ...   ...
## 339  Chinstrap      Dream            55.8           19.8              207.0       4000.0    male  2009
## 340  Chinstrap      Dream            43.5           18.1              202.0       3400.0  female  2009
## 341  Chinstrap      Dream            49.6           18.2              193.0       3775.0    male  2009
## 342  Chinstrap      Dream            50.8           19.0              210.0       4100.0    male  2009
## 343  Chinstrap      Dream            50.2           18.7              198.0       3775.0  female  2009
## 
## [344 rows x 8 columns]

Count based on 1 column

This is by far the simplest of the calculations. We want to count how many penguins are there for each of the species. Python seems to be the more complicated in this case. This is mainly because the size method returns a series, and we have to rename it to have a meaningful name for the count.

SELECT species, count(*) FROM "penguins" group by 1

df %>% count(species)

## # A tibble: 3 x 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

(
    df_py
    .groupby("species")
    .size()
    .to_frame()
    .reset_index()
    .rename(columns={0: 'n'})
)

##      species    n
## 0     Adelie  152
## 1  Chinstrap   68
## 2     Gentoo  124

Count based on 2 columns

This is a very similar case as the previous one. The only difference is that we’re counting the number of penguins by species and island.

SELECT species, island, count(*) FROM "penguins" group by 1, 2

df %>% count(species, island)

## # A tibble: 5 x 3
##   species   island        n
##   <fct>     <fct>     <int>
## 1 Adelie    Biscoe       44
## 2 Adelie    Dream        56
## 3 Adelie    Torgersen    52
## 4 Chinstrap Dream        68
## 5 Gentoo    Biscoe      124

(
    df_py
    .groupby(["species", "island"])
    .size()
    .to_frame()
    .reset_index()
    .rename(columns={0: 'n'})
)

##      species     island    n
## 0     Adelie     Biscoe   44
## 1     Adelie      Dream   56
## 2     Adelie  Torgersen   52
## 3  Chinstrap      Dream   68
## 4     Gentoo     Biscoe  124

Many aggregates per column

In here we want to compute several aggregations on different columns. For every species and island we want:

The average bill_length_mm
The median bill_length_mm
The number of penguins whose flipper_length_mm is not null.

Python has a nice thing that allows to compute several aggregations functions for a given column. However, the price to pay is that the obtained dataframe is not tidy. This is because we have multiindexed columns, and we can use the flatten_cols to transform the columns to a list. I use it very often to tidy my multiindexed dataframes.

SELECT 
    species, 
    island, 
    avg(bill_length_mm),
    median(bill_length_mm), 
    sum(flipper_length_mm is not null) FROM "penguins" 
group by 1, 2

df %>% 
  group_by(species, island) %>% 
  summarise(
    mean_bill_length_mm = mean(bill_length_mm, na.rm = T),
    median_bill_length_mm = median(bill_length_mm, na.rm = T),
    not_null_flipper = sum(!is.na(flipper_length_mm))
  )

## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.

## # A tibble: 5 x 5
## # Groups:   species [3]
##   species   island    mean_bill_length_mm median_bill_length_mm not_null_flipper
##   <fct>     <fct>                   <dbl>                 <dbl>            <int>
## 1 Adelie    Biscoe                   39.0                  38.7               44
## 2 Adelie    Dream                    38.5                  38.6               56
## 3 Adelie    Torgersen                39.0                  38.9               51
## 4 Chinstrap Dream                    48.8                  49.6               68
## 5 Gentoo    Biscoe                   47.5                  47.3              123

def flatten_cols(df_py):
    df_py.columns = [' '.join(col).strip() for col in df_py.columns.values]
    return df_py

(
    df_py
    .groupby(["species", "island"], as_index=False)
    .agg({
        "bill_length_mm": ["mean", "median"],
        "flipper_length_mm": ["count"]
    })
    .pipe(flatten_cols)
)

##      species     island  bill_length_mm mean  bill_length_mm median  flipper_length_mm count
## 0     Adelie     Biscoe            38.975000                  38.70                       44
## 1     Adelie      Dream            38.501786                  38.55                       56
## 2     Adelie  Torgersen            38.950980                  38.90                       51
## 3  Chinstrap      Dream            48.833824                  49.55                       68
## 4     Gentoo     Biscoe            47.504878                  47.30                      123

Keep only groups with an aggregate condition

Here we want to keep groups of species and islands with more than 70 penguins in them. In this case, dplyr and python have a very similar API, the most complicated in here is the SQL one.

with high_penguins as (
    select species, island from penguins group by 1, 2 having count(*) > 70
) select p.* 
from penguins p
inner join high_penguins using(species, island)

df %>% 
  group_by(species, island) %>%
  filter(n() > 70)

## # A tibble: 124 x 8
## # Groups:   species, island [1]
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           46.1          13.2               211        4500
##  2 Gentoo  Biscoe           50            16.3               230        5700
##  3 Gentoo  Biscoe           48.7          14.1               210        4450
##  4 Gentoo  Biscoe           50            15.2               218        5700
##  5 Gentoo  Biscoe           47.6          14.5               215        5400
##  6 Gentoo  Biscoe           46.5          13.5               210        4550
##  7 Gentoo  Biscoe           45.4          14.6               211        4800
##  8 Gentoo  Biscoe           46.7          15.3               219        5200
##  9 Gentoo  Biscoe           43.3          13.4               209        4400
## 10 Gentoo  Biscoe           46.8          15.4               215        5150
## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>

(
    df_py
    .groupby(["species", "island"], as_index=False)
    .filter(lambda x: len(x) > 70)
)

##     species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex  year
## 152  Gentoo  Biscoe            46.1           13.2              211.0       4500.0  female  2007
## 153  Gentoo  Biscoe            50.0           16.3              230.0       5700.0    male  2007
## 154  Gentoo  Biscoe            48.7           14.1              210.0       4450.0  female  2007
## 155  Gentoo  Biscoe            50.0           15.2              218.0       5700.0    male  2007
## 156  Gentoo  Biscoe            47.6           14.5              215.0       5400.0    male  2007
## ..      ...     ...             ...            ...                ...          ...     ...   ...
## 271  Gentoo  Biscoe             NaN            NaN                NaN          NaN     NaN  2009
## 272  Gentoo  Biscoe            46.8           14.3              215.0       4850.0  female  2009
## 273  Gentoo  Biscoe            50.4           15.7              222.0       5750.0    male  2009
## 274  Gentoo  Biscoe            45.2           14.8              212.0       5200.0  female  2009
## 275  Gentoo  Biscoe            49.9           16.1              213.0       5400.0    male  2009
## 
## [124 rows x 8 columns]

Create a column based on an aggregate calculation

Here we want to compute the average bill_length_mm by species and island as an extra column of the penguins dataframe, not just as a summarised table. My least favorite is python since I cannot use a list of chained operations, but I have to store partial results and assign them later.

-- Doesn't work in sqlite
select 
    *, 
    avg(bill_length_mm) over (partition by species, island) 
from penguins

df %>% 
  group_by(species, island) %>% 
  mutate(
    mean_bill_length = mean(bill_length_mm, na.rm = T)
  )

## # A tibble: 344 x 9
## # Groups:   species, island [5]
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   mean_bill_length <dbl>

mean_bill_length = (
    df_py
    .loc[:, ["species", "island", "bill_length_mm"]]
    .groupby(["species", "island"], as_index=False)
    .transform(lambda x: x.mean())
)

df_py["mean_bill_length"] = mean_bill_length

df_py

##        species     island  bill_length_mm  bill_depth_mm  ...  body_mass_g     sex  year  mean_bill_length
## 0       Adelie  Torgersen            39.1           18.7  ...       3750.0    male  2007         38.950980
## 1       Adelie  Torgersen            39.5           17.4  ...       3800.0  female  2007         38.950980
## 2       Adelie  Torgersen            40.3           18.0  ...       3250.0  female  2007         38.950980
## 3       Adelie  Torgersen             NaN            NaN  ...          NaN     NaN  2007         38.950980
## 4       Adelie  Torgersen            36.7           19.3  ...       3450.0  female  2007         38.950980
## ..         ...        ...             ...            ...  ...          ...     ...   ...               ...
## 339  Chinstrap      Dream            55.8           19.8  ...       4000.0    male  2009         48.833824
## 340  Chinstrap      Dream            43.5           18.1  ...       3400.0  female  2009         48.833824
## 341  Chinstrap      Dream            49.6           18.2  ...       3775.0    male  2009         48.833824
## 342  Chinstrap      Dream            50.8           19.0  ...       4100.0    male  2009         48.833824
## 343  Chinstrap      Dream            50.2           18.7  ...       3775.0  female  2009         48.833824
## 
## [344 rows x 9 columns]

Count relative

For each species I want to know the fraction of penguins of that species living in each island.

In this case, dplyr is the clear winner since neither python nor SQL provide a very simple way of calcuating this.

with species_island_counts as (
    select species, island, count(*) as n from penguins
), species_counts as (
    select 
        species, 
        island, 
        sum(n) over (partition by species) as species_n,
        n
    from species_island_counts
) select 
    species, 
    island, 
    n / species_n as fraction 
    from species_counts

df %>% 
  count(species, island) %>% 
  group_by(species) %>% 
  mutate(fraction = n / sum(n))

## # A tibble: 5 x 4
## # Groups:   species [3]
##   species   island        n fraction
##   <fct>     <fct>     <int>    <dbl>
## 1 Adelie    Biscoe       44    0.289
## 2 Adelie    Dream        56    0.368
## 3 Adelie    Torgersen    52    0.342
## 4 Chinstrap Dream        68    1    
## 5 Gentoo    Biscoe      124    1

full_counts = (
    df_py
    .groupby(["species", "island"])
    .size()
    .to_frame()
    .reset_index()
    .rename(columns={0: 'n'})
)

agg_counts = (
    full_counts
    .groupby("species")
    .transform(lambda x: x.sum())["n"]
)

full_counts["agg_counts"] = agg_counts

full_counts["fraction"] = full_counts["n"] / full_counts["agg_counts"]

full_counts

##      species     island    n agg_counts  fraction
## 0     Adelie     Biscoe   44        152  0.289474
## 1     Adelie      Dream   56        152  0.368421
## 2     Adelie  Torgersen   52        152  0.342105
## 3  Chinstrap      Dream   68         68         1
## 4     Gentoo     Biscoe  124        124         1

Agg with two columns

I want to compute the average of the ratio of bill_length_mm and bill_depth_mm.

This is pretty easy in the three of them although a bit of tidying needs to be done in python to obtain a tidy dataframe.

SELECT species, island, avg(bill_depth_mm / bill_length_mm) FROM "penguins" group by 1, 2

df %>% 
  group_by(species) %>% 
  summarise(
    depth_length_ratio = mean(bill_depth_mm / bill_length_mm, na.rm = T)
  )

## # A tibble: 3 x 2
##   species   depth_length_ratio
##   <fct>                  <dbl>
## 1 Adelie                 0.474
## 2 Chinstrap              0.378
## 3 Gentoo                 0.316

(
    df_py
    .groupby("species")
    .apply(lambda d: (d["bill_depth_mm"] / d["bill_length_mm"]).mean())
    .to_frame()
    .reset_index()
    .rename(columns={0: "depth_length_ratio"})
)

##      species  depth_length_ratio
## 0     Adelie            0.474333
## 1  Chinstrap            0.377923
## 2     Gentoo            0.315829

Conclusion

The main conclusion is that dplyr tends to be the easiest to use, with less code needed for most of the calculations.

The issue with pandas seems to be that the outputs are never tidy enough to keep working with them.